1. home
  2. R-bloggers | R news and tutorials contributed by (750) R bloggers
  3. Software

R-bloggers | R news and tutorials contributed by (750) R bloggers - Software

8 | Follower

Banking & Insurance Dataset for Data Analysis in RStudio | R-bloggers

When you are working on a project involving data analysis or statistical modeling, it's crucial to understand the dataset you're using. In this guide, we'll explore a synthetic dataset created for customers in the banking and insurance sectors. Whether you're a researcher, a student, or a business analyst, understanding how data is structured and analyzed can make a huge difference. This data comes with a variety of features that offer insights into customer behaviors, financial statuses, and policy preferences. Table of Contents Dataset Origin and ContextThe dataset, designed for analysis in tools like RStudio or SPSS, combines customer details such as age, account balance, and insurance premiums. Businesses in the finance and insurance industries need to help them optimize customer experiences, improve retention rates, and refine risk assessment models.Dataset StructureIn any data analysis, understanding the basic structure of your dataset is key. This dataset consists of 1,000 rows (representing individual customers) and 10 columns. The columns include a mix of categorical (like Gender and Marital Status) and numeric variables (like Account Balance and Credit Score). This combination allows you to explore relationships and trends across various customer attributes.File Formats and AccessThe data is accessible in a CSV format, making it easy to load into tools such as RStudio, Excel, or SPSS. For those who need assistance with data analysis or want to perform statistical tests, this format is ideal for quick importing and processing.Variables Variable Type Description Distribution / Levels CustomerID Categorical Unique identifier for each customer CUST0001 – CUST1000 Gender Categorical Gender of the customer Male, Female (≈49%/51%) MaritalStatus Categorical Marital status Single, Married, Divorced, Widowed EducationLevel Categorical Highest education attained High School, College, Graduate, Post-Graduate, Doctorate IncomeCategory Categorical Annual income bracket 120K PolicyType Categorical Type of insurance policy held Life, Health, Auto, Home, Travel Age Numeric Age in years Normal distribution, μ = 45, σ = 12 AccountBalance Numeric Bank account balance in USD Normal distribution, μ = 20,000, σ = 5,000 CreditScore Numeric FICO credit score Normal distribution, μ = 715, σ = 50 InsurancePremium Numeric Annual premium paid in USD Normal distribution, μ = 1,000, σ = 300 ClaimAmount Numeric Total claims paid in USD per year Normal distribution, μ = 5,000, σ = 2,000 Categorical VariablesCategorical variables are important because they represent grouped or qualitative data. In this dataset, you'll find attributes like Gender (Male/Female), Marital Status (Single, Married, etc.), and Policy Type (Health, Auto, Home, etc.). Understanding these helps in analyzing demographics and preferences. For example, a company could use this information to understand the market distribution of different insurance products.Numeric VariablesNumeric variables like Age, Account Balance, and Credit Score are continuous and provide a clear, measurable view of each customer's financial standing. These variables allow for in-depth statistical analysis, such as regression models or predictive analytics, to forecast customer behavior or policy outcomes. A business could use these variables to assess financial health or risk levels for insurance.Distributional AssumptionsThe data uses normal distributions for numeric variables like Age and Account Balance, meaning the values are centered around a mean with a set standard deviation. This ensures the dataset mirrors real-world scenarios, where values tend to follow a natural spread. Understanding these distributions helps in applying appropriate statistical methods when analyzing the data.Data Quality and ValidationMissing Value TreatmentBefore conducting any analysis, it's essential to address missing data. This dataset has been cleaned and preprocessed to ensure that missing values are handled appropriately, whether by imputation or removal. Having clean data ensures that the results of your analysis are valid and reliable.Outlier Detection and HandlingOutliers can significantly skew the analysis. We use methods like z-scores or boxplots to detect outliers in variables like Insurance Premium or Claim Amount. Once detected, these outliers can be adjusted or removed, ensuring your analysis reflects true patterns rather than anomalies.Consistency Checks (e.g., Income Category vs. Account Balance)Data consistency is crucial for making accurate predictions. For example, customers with an Income Category of ">120K" should logically have a higher Account Balance. We ensure that the dataset aligns with real-world logic by performing consistency checks across variables.Usage and Analysis ExamplesDemographic ProfilingUnderstanding customer demographics helps businesses create targeted marketing campaigns or personalized product offerings. This dataset allows you to analyze how age, marital status, and education level correlate with preferences for certain types of insurance policies or account balances.Credit Risk ModelingOne of the most common applications of this data is in credit risk modeling. By analyzing Credit Scores alongside Account Balance, you can build models to predict a customer's likelihood of defaulting on payments or making insurance claims.Insurance Claim PredictionPredicting Insurance Claims is another use case for this dataset. By studying the relationship between Age, Policy Type, and Claim Amount, businesses can create more accurate models to predict future claims and optimize policy pricing.Documentation and MaintenanceVersioning and Change LogAs datasets evolve, it is important to maintain version control. We ensure that any changes to the dataset are documented with clear versioning and change logs. Hence, users know exactly when and why adjustments were made.Contact and GovernanceIf you require further assistance with data analysis, our team at RStudioDatalab is here to help. Whether you need guidance on statistical tests or further clarification on the dataset, we offer support through Zoom, Google Meet, chat, and email. Bank and insurance.csv 100KB Transform your raw data into actionable insights. Let my expertise in R and advanced data analysis techniques unlock the power of your information. Get a personalized consultation and see how I can streamline your projects, saving you time and driving better decision-making. Contact me today at contact@rstudiodatalab.com or visit to schedule your discovery call. Join Our Community Book a free call

Exploring `RSQLite` With `DBI`: A Note To Myself | R-bloggers

I messed around with DBI and RSQLite and learned it’s actually pretty simple to use in R - just connect, write tables, and use SQL queries without all the complicated server stuff. Thanks to Alec Wong for suggesting this! Motivation After our last blog, my friend Alec Wong suggested that I switch storing data from CSV files to SQLite when building Plumber API. I had no idea that CSV files can get corrupted when multiple users hit the API at the same time! SQLite handles this automatically and lets you validate your data without needing to set up any complicated server stuff. It’s actually pretty straightforward, here is a note to myself of some simple and frequent functions. Objectives Connecting to A Database List Tables Check Data Add Data Query Data Using glue_sql Remove Data Disconnect Lessons Learnt Connecting to A Database library(DBI) library(RSQLite) library(tidyverse) con mutate(date = as_datetime(date, tz = "America/New_York")) list(df) } Lessons Learnt Lots of goodies on DBI official website Learnt how to set up SQLite on Rpi, incorporated it on the previous migraine logger Definitely need to be comfortable with SQL to use this Might be a good idea to add this to the pressure logger too! Maybe in the same database but different table! If you like this article: please feel free to send me a comment or visit my other blogs please feel free to follow me on BlueSky, twitter, GitHub or Mastodon if you would like collaborate please feel free to contact me

Getting My Feet Wet With `Plumber` and JavaScript | R-bloggers

Tried out plumber and a bit of JavaScript to build a simple local API for logging migraine events 🧠💻. Just a quick tap on my phone now records the time to a CSV—pretty handy! 📱✅ Motivation After our previous blog on barometric pressure monitoring, my friend Alec Wong said ‘Won’t it be great if we can just hit a button and it will record an event?". In this case the reason for recording barometric pressure is to see if there is a link between migraine event and barometric pressure values/change etc. And yes, it would be great if we can create an app of something sort to make recording much easier! There are many ways to do this. The way where we can maximize learning within R environment is to use plumber to create an API for us to interact and record event! Our use case is actually quite straight forward. We just need something that record a current timestamp when a button is clicked. Simple! But since I’ve never used plumber before, this is a great opportunity to explore it! And also a bit of JavaScript too. Again, this blog is more for my benefit where it serves as a note for myself. Here we go! Objectives: Big Picture plumber.R How to run it? One Click on iOS Opportunities for Improvement Lessons Learnt Big Picture As the image above shows, we want an app on our phone that once clicked will somehow change a csv dataframe. All these can be done by plumber setting an API to the csv. Since I just want to be able to do this on a local network of a different device (e.g. raspberrypi), we don’t need to deploy this to digital ocean or a server per se. We can run it in the background and set systemctl in case rpi restarts, point it to 0.0.0.0 and we can GET/POST via the device’s IP. Yes, unfortunately this will not work if we’re no longer on local network, which at least from my utility, it will be just fine. No need to expose port forwarding. The safer way would be to use digital ocean droplet to do this, so you’re not exposing your own IP and open port to the public. That also means, you may have to pay some 💰 (e.g. ~$5/month). May someday when it can incorporate the barometric pressure and/or other metrics then plumber.R library(plumber) library(readr) file response.json()): This is a Promise chain. After the fetch request completes, this takes the response from the server and calls the .json() method on it, which parses the JSON response body into a JavaScript object. This method also returns a Promise that resolves to the parsed JSON data. .then(data => { const resultDiv = document.getElementById("result"); resultDiv.textContent = data[0]; resultDiv.style.display = "block";}): This is the next step in the Promise chain. Once the JSON is parsed, It finds the HTML element with the ID “result”. Sets its text content to be the first item in the data array (data[0]). Makes the element visible by setting its CSS display property to “block” .catch(error => { const resultDiv = document.getElementById("result"); resultDiv.textContent = error.message })};This catches any errors that might occur during the fetch operation or when processing the response. If an error happens, it finds the HTML element with ID “result”. Sets its text content to the error message. The interesting thing I’ve not come across is the arrow function. response => response.json() means function(response) { return response.json() }. More Plumber API functions: #* logging #* @post /log function(){ date_now

Exploring a 3-D Synthetic Dataset | R-bloggers

Exploring the HistData package Over on BlueSky, I have been working through a few challenges. For the months of February and March, I participated in the DuBois Challenge, where you take a week to recreate some of the powerful visualizations that came out of the Paris Exposition from W.E.B. Du Bois. My work there, complete with code, can be found in my github Inspired by this, I’ve also been doing the #30DayChartChallenge, where you make a chart a day on a theme that changes each day. I have taken this as an opportunity to explore Michael Friendly’s HistData package, which draws from his excellent book with Howard Wainer. I have done posts on John Snow, the Trial of the Pyx, Florence Nightingale, and others on my github. However, one dataset that a simple plot doesn’t do justice to is the Pollen dataset. This dataset, like mtcars and flights, are synthetic datasets that were used as data challenges (the other two are now basic datasets for reprexes as well). This dataset, however, shows the power of plotly. Code in R library(tidyverse) library(HistData) library(plotly) data("Pollen") head(Pollen) # A tibble: 6 × 5 ridge nub crack weight density 1 -2.35 3.63 5.03 10.9 -1.39 2 -1.15 1.48 3.24 -0.594 2.12 3 -2.52 -6.86 -2.80 8.46 -3.41 4 5.75 -6.51 -5.15 4.35 -10.3 5 8.75 -3.90 -1.38 -14.9 -2.42 6 10.4 -3.16 12.8 -14.9 -6.49 The first three variables are meant to be plotted on the x, y and z axis, where the other variables are meant to describe the grains of pollen. Doing a quick correlation shows that there is at least one strong correlation that can be seen through the use of color, where weight is highly correlated with the x-axis. Code in R res add_markers(color = ~weight, size=2) |> layout(title="David Coleman's Synthetic Pollen Dataset")|> config(displayModeBar=FALSE) Can you see it? Image below to recreate… Eureka! CitationBibTeX citation:@online{russell2025, author = {Russell, John}, title = {Exploring a {3-D} {Synthetic} {Dataset}}, date = {2025-04-12}, url = {https://drjohnrussell.github.io/posts/2025-04-12-Plotly-Pollen-Dataset/}, langid = {en} } For attribution, please cite this work as: Russell, John. 2025. “Exploring a 3-D Synthetic Dataset.” April 12, 2025. https://drjohnrussell.github.io/posts/2025-04-12-Plotly-Pollen-Dataset/.