Warning: package 'tidyverse' was built under R version 4.3.3
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tibble' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
Warning: package 'purrr' was built under R version 4.3.3
Warning: package 'forcats' was built under R version 4.3.3
Warning: package 'lubridate' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(janitor) # To clean column names
Warning: package 'janitor' was built under R version 4.3.3
Attaching package: 'janitor'
The following objects are masked from 'package:stats':
chisq.test, fisher.test
library(Hmisc) # To create a correlation coeficient
Warning: package 'Hmisc' was built under R version 4.3.3
Attaching package: 'Hmisc'
The following objects are masked from 'package:dplyr':
src, summarize
The following objects are masked from 'package:base':
format.pval, units
library(car) # Independent test
Warning: package 'car' was built under R version 4.3.3
Loading required package: carData
Warning: package 'carData' was built under R version 4.3.3
Attaching package: 'car'
The following object is masked from 'package:dplyr':
recode
The following object is masked from 'package:purrr':
some
library(corrplot) # To visualise correlation matrix
Warning: package 'corrplot' was built under R version 4.3.3
corrplot 0.92 loaded
library(MASS) # Create a hsitogram
Attaching package: 'MASS'
The following object is masked from 'package:dplyr':
select
library(ggpubr) # customizing the plot
Warning: package 'ggpubr' was built under R version 4.3.3
library(ggcorrplot) # correlation matrix
Warning: package 'ggcorrplot' was built under R version 4.3.3
library(olsrr) # variable selection
Warning: package 'olsrr' was built under R version 4.3.3
Attaching package: 'olsrr'
The following object is masked from 'package:MASS':
cement
The following object is masked from 'package:datasets':
rivers
library(caret) # cross validation
Warning: package 'caret' was built under R version 4.3.3
Loading required package: lattice
Attaching package: 'caret'
The following object is masked from 'package:purrr':
lift
About the data
The dataset used in this analysis comes from a study of 94 participants performing treadmill exercises. Data were collected on the following variables: Maximum Volume of Oxygen used per minute per kilogram of body weight (max VO2), duration of exercise (duration), maximum heart rate during exercise (heart rate), age, height, weight, sex (1 = male, 2 = female), ethnicity (1 = Habesha, 2 = White, 3 = West African), and whether there was an energy shortage during exercise (energy; 1 = yes, 0 = no).
Rows: 94 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): duration, max VO2, HeartRate, age, height, weight, ethnicity, sex, ...
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Exploratory data analysis
Lets investigate the relationships among the continuous variables in our dataset.
#Summary of continous variablesummary(excercise)
duration max_vo2 heart_rate age height
Min. :180.0 Min. :15.30 Min. :104.0 Min. :21.0 Min. :154.0
1st Qu.:520.8 1st Qu.:30.40 1st Qu.:168.0 1st Qu.:44.0 1st Qu.:168.0
Median :599.5 Median :35.85 Median :175.0 Median :49.5 Median :174.0
Mean :576.7 Mean :35.63 Mean :175.4 Mean :49.6 Mean :173.9
3rd Qu.:647.8 3rd Qu.:41.38 3rd Qu.:188.0 3rd Qu.:57.0 3rd Qu.:180.0
Max. :827.0 Max. :50.90 Max. :210.0 Max. :74.0 Max. :193.0
weight ethnicity sex energy_shortage
Min. : 47.00 Min. :1.00 Min. :1.000 Min. :0.0000
1st Qu.: 65.25 1st Qu.:1.25 1st Qu.:1.000 1st Qu.:0.0000
Median : 76.50 Median :2.00 Median :1.000 Median :0.0000
Mean : 73.31 Mean :2.17 Mean :1.362 Mean :0.4362
3rd Qu.: 82.00 3rd Qu.:3.00 3rd Qu.:2.000 3rd Qu.:1.0000
Max. :106.00 Max. :3.00 Max. :2.000 Max. :1.0000
To better understand the data, we will label the numeric categories for sex, ethnicity, and energy shortage as factors.
Correlation measures the strength and direction of the linear relationship between two variables. We’ll explore Pearson’s, Spearman’s, and Kendall’s correlation coefficients.
Correlation Matrix
We’ll start by creating a correlation matrix for the continuous variables using the cor() function.
It will create a correlation matric with p value. The first matrix shows the correlation coefficients between the variables and the second matrix shows the corresponding p-values.
#create matrix of correlation coefficients and p-valuesrcorr(as.matrix(cont_var))
Pearson’s correlation measures the linear relationship between two continuous variables. It ranges from -1 to 1, where 1 means a perfect positive linear relationship, -1 means a perfect negative linear relationship, and 0 means no linear relationship.
To calculate Pearson’s correlation coefficient between the max_vo2 and duartion variables in the exercise dataset, use the following code:
# Pearson's correlation between max_vo2 and durationcor(excercise$max_vo2, excercise$duration, method ="pearson")
[1] 0.9063969
Spearman’s Correlation for Ranked Variables
Spearman’s correlation measures the monotonic relationship between two ranked variables. It is suitable for ordinal data or continuous data that do not meet the assumptions of Pearson’s correlation.
Kendall’s correlation coefficient is used when the sample size is small and there are many tied ranks.
# Spearman's correlation between max_vo2 and heart_ratecor(excercise$max_vo2, excercise$age, method ="kendall")
[1] -0.4221638
Properties of correlation coefficient (r) are:
r only measures the strength of a linear relationship.
r is always between -1 and 1 inclusive. -1 means perfect negative linear correlation and +1 means perfect positive linear correlation
r does not change if the independent (x) and dependent (y) variables are inter- changed
r does not change if the scale on either variable is changed. You may multiply, divide, add, or subtract a value from all the x-values or y-values without changing the value of r.
Scatter Plot
Let’s create a scatter plot to visualize the relationship between max_vo2 and duration.
# Scatter plotexcercise %>%ggplot() +# Add color# Add shape# Add transparencygeom_point(aes(x=duration, y=max_vo2,colour = sex),alpha =1/2) +# Add main title, axis labels, and legend titleslabs(title ="Correlation between maximum volume of oxygen and durations \n doing exercise on treadmill (n =94)", subtitle ="Source: Online",y ="Max Vo2 (minutes/kg)", x ="Duration (minutes)",colour="Sex") +# Customize themetheme_classic()
Figure 1: Scatter plot - the relationship between maximum volume of oxygen and duration of exercise.
As we can see the scatter plot displays the form and directions of the relationship between maximum volume of oxygen and duration of exercise - the maximum volume of oxygen increase when the duration of exercise increased.
Linear Regression
Linear regression models the relationship between two variables by fitting a linear equation to observed data. The explanatory variable (x) and the dependent variable (y) are key components, with Y being numeric (continuous). In Figure 1, the relationship between the maximum volume of oxygen (Y) and duration of exercise (X) the line that best predicts max_vo2 from duration by minimizing the sum of the squares of the vertical distances of the points from the line.
Assumptions of Linear Regression
Independence of Observations: Residuals should not be correlated.
Linearity: The relationship between the independent and dependent variables should be linear.
Homoscedasticity: Residuals should have constant variance.
Normality: Residuals should follow a normal distribution
1. Independence of Observations (No Autocorrelation)
The observations must be independent of each other, meaning the residuals (errors) from one observation should not correlate with the residuals of another. This assumption could be check using the Durbin-Watson test. We will check after we fit a model. A significant test statistics corresponding p-value depict the residuals in the regression model is autocorrelated.
# Fitting the modelmodel =lm(max_vo2 ~ duration, data = excercise) #perform Durbin-Watson testdurbinWatsonTest(model)
lag Autocorrelation D-W Statistic p-value
1 -0.007425459 1.976968 0.862
Alternative hypothesis: rho != 0
From the output above we can see that the p-value is 0.866 which is indicating that the residuals in this regression model are not autocorrelated.
2. Linearity
The relationship between the independent (duration) and dependent (max_vo2) variables must be linear. This can be assessed by plotting the variables and looking for a linear pattern.
# Plotting to check linearityplot(max_vo2 ~ duration, data = excercise)
The points in the plot above look like they fall on roughly a straight line, which indicates that there is a linear relationship between max_vo2 and duration in exercise.
3. Homoscedasticity
The equal variance assumptions is that the residuals have constant variance at every level of x. This means that the prediction error doesn’t change significantly over the range of prediction of the model. The assumptions will be evaluated by creating a fitted value vs. residual plot.
par(mfrow=c(2,2))plot(model)
par(mfrow=c(1,1))
The scatter plot shows the residual are closer the fitted values.
4. Normality
The residuals (errors) should follow a normal distribution. This assumption can be checked by plotting a histogram of the residuals, Q_Q plots.
Example of histogram:
truehist(excercise$max_vo2, xlab ="Max_VO2", ylab ="Percentage", main="Max_VO2 per minute per kilogram of body weight~Normal", prob =TRUE)curve (dnorm(x,mean=mean(excercise$max_vo2),sd=sd(excercise$max_vo2)), add=TRUE)
Figure 3: Histogram (with fitted Normal distribution) of Max_VO2(minutes/Kg)
The distribution (Figure3) extends from 20 to 50; a possible low and high values of the volume of oxygen, forcibly blown out in one second after a full intake of breath. The distribution seems asymmetric (skewed to the left).
Example of Q-Q plot:
qqnorm(excercise$max_vo2)qqline(excercise$max_vo2,main ='Q-Q Plot for Normality', xlab ='Theoretical Dist',ylab ='Sample dist', col ='steelblue')
OR , we can plot from the model
plot(model, 2)
Shapiro-Wilk Test is a common statistical test of normality. The test provides a test statistic W along with a corresponding p-value. If the p-value is less than α =.05, there is sufficient evidence to say that the sample does not come from a population that is normally distributed.
shapiro.test(excercise$max_vo2)
Shapiro-Wilk normality test
data: excercise$max_vo2
W = 0.97977, p-value = 0.1539
The Shapiro-Wilk Normality test tells us the maximum volume of oxygen is distributed normally.
Our assumptions check showed as all the assumptions are met.
How linear regression works?
The most common method for fitting a linear regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line.
Simple Linear Regression
A regression model aims to model the magnitude of the relationship between one independent variable and a dependent variable. The simple linear regression model can be written with an explicit error term as:
\[
y= a+ b_1 (x) +e
\]
Where \(a\) is the intercept of the regression line when the \(x\) is null; \(b_1\) is the slope of the regression line. and \(e\) is a random component. It is the difference between the actual and the estimated dependent variable based on the size of the independent variable. The value of error (e) will vary from subject to subject, even if independent variable remains the same. Note that both α and b are population parameters which are usually unknown and hence estimated from the data by a and b.
Below is the equation using our data
# Fit linear regression modelmodel<-lm(max_vo2 ~ duration, data = excercise)
The model results will be printed using the summary function
# model outputsummary(model)
Call:
lm(formula = max_vo2 ~ duration, data = excercise)
Residuals:
Min 1Q Median 3Q Max
-9.8341 -1.7243 0.0338 1.8036 7.7629
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.06473 1.56876 2.591 0.0111 *
duration 0.05474 0.00266 20.581 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.19 on 92 degrees of freedom
Multiple R-squared: 0.8216, Adjusted R-squared: 0.8196
F-statistic: 423.6 on 1 and 92 DF, p-value: < 2.2e-16
Estimated: the coefficient for duration
R-square: the proportion of variance explained by the model and it measures the strength of the relationship between the model and the dependent variable.
p-value and t-statistic: they measure the extent to which a given coefficient is statistically significant - a higher t-statistics corresponds to a smaller p-value which indicates strong evidence against the null hypothesis.
To get the confidence interval we can use the confint() function.
Multiple linear regression model aims to model the magnitude of the relationship between two variable while taking account the effect of other explantory variable and a dependent variable.
\[
y = a + bx_1 + bx_2 + bx_3 + bx_n + e
\]
Multicollinearity - the dependency between predictors is another assumptions of linear regression analysis when we fit multiple regression analysis.
The presence of multidisciplinary can be checked if there is a an inflated estimated coefficients and untrue relationships.
# Compute correlation at 2 decimal placescorr_matrix =round(cor(cont_var), 2)# Compute and show the resultggcorrplot(corr_matrix, hc.order =TRUE, type ="lower",lab =TRUE)
We can notice two strong correlations because their value is higher than 0.8.
- Max_vo2 and duration: 0.91
- BMI and weight: 0.86
This result makes sense because BMI is computed from weight and age. In this case, we can get rid of the either BMI or weight from our model.
Let’s illustrate this by adding more variable into the previous model
# Fit multiple linear regression modelmodel2 <-lm(lm(formula = max_vo2 ~., data = excercise))summary(model2)
Our model result above doesn’t indicate these issues present in our model, so multicollinearity is likely not an issue. To confirm this, let’s use a measure called the VIF (Variance Inflation Factor) computed as:
The VIF (Variance Inflation Factor) indicates that multicollinearity is present in our model. As a general guideline, a VIF value exceeding 10 is a cause for concern. In our case, height, weight, and BMI are contributing to this issue. Since BMI is derived from weight and height, we should include only one of these variables in our fitness model to address the multicollinearity problem.
formula = max_vo2 ~ duration + heart_rate + age + ethnicity + weight + sex + energymodel3 <-lm(formula, data = excercise)
Check the VIF again before interpreting the results. The model3 results indicated that multicollinearity is not an issue.
Explore the final model results and interprete the results
summary(model3)
Call:
lm(formula = formula, data = excercise)
Residuals:
Min 1Q Median 3Q Max
-7.543 -1.428 0.310 1.485 5.285
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.451716 5.652625 1.495 0.139
duration 0.039455 0.004159 9.486 5.59e-15 ***
heart_rate 0.024679 0.021822 1.131 0.261
age -0.006641 0.039216 -0.169 0.866
ethnicityWest African 0.497364 1.009112 0.493 0.623
ethnicityWhite 0.257835 0.805013 0.320 0.750
weight -0.014511 0.024418 -0.594 0.554
sexmale -0.847960 0.746998 -1.135 0.260
energyyes 3.990363 0.876642 4.552 1.76e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.686 on 85 degrees of freedom
Multiple R-squared: 0.8831, Adjusted R-squared: 0.8721
F-statistic: 80.26 on 8 and 85 DF, p-value: < 2.2e-16
Model selection
The process of selecting significant predictors for a model involves using criteria like R², Akaike Information Criteria (AIC), or Bayesian Information Criteria (BIC). There are various methods such as forward selection, backward selection, and stepwise selection, which are similar and yield comparable final models. Here, forward selection is used, which follows these steps:
Start with all predictors “not in the model.”
Iterate over these predictors, refitting the model with each variable.
Add the variable with the lowest p-value to the model if its p-value is less than α. Repeat until no more variables can be added.
# Run the forward selection algorithm usign a alpha of 0.05model2 <-lm(lm(formula = max_vo2 ~., data = excercise))fwdmodel2 <-ols_step_forward_p(model2, penter=0.05)fwdmodel2
Stepwise Summary
-------------------------------------------------------------------------
Step Variable AIC SBC SBIC R2 Adj. R2
-------------------------------------------------------------------------
0 Base Model 648.809 653.896 378.795 0.00000 0.00000
1 duration 488.802 496.432 220.812 0.82156 0.81962
2 energy 455.051 465.224 188.681 0.87801 0.87533
3 heart_rate 455.102 467.818 189.037 0.88051 0.87653
4 sex 455.598 470.857 189.879 0.88241 0.87712
-------------------------------------------------------------------------
Final Model Output
------------------
Model Summary
---------------------------------------------------------------
R 0.939 RMSE 2.562
R-Squared 0.882 MSE 6.930
Adj. R-Squared 0.877 Coef. Var 7.388
Pred R-Squared 0.868 AIC 455.598
MAE 1.949 SBC 470.857
---------------------------------------------------------------
RMSE: Root Mean Square Error
MSE: Mean Square Error
MAE: Mean Absolute Error
AIC: Akaike Information Criteria
SBC: Schwarz Bayesian Criteria
ANOVA
---------------------------------------------------------------------
Sum of
Squares DF Mean Square F Sig.
---------------------------------------------------------------------
Regression 4628.510 4 1157.128 166.966 0.0000
Residual 616.798 89 6.930
Total 5245.308 93
---------------------------------------------------------------------
Parameter Estimates
----------------------------------------------------------------------------------------
model Beta Std. Error Std. Beta t Sig lower upper
----------------------------------------------------------------------------------------
(Intercept) 6.416 3.005 2.135 0.036 0.445 12.388
duration 0.040 0.004 0.670 11.513 0.000 0.033 0.047
energyyes 4.154 0.714 0.276 5.815 0.000 2.734 5.573
heart_rate 0.026 0.020 0.065 1.314 0.192 -0.013 0.066
sexmale -0.861 0.719 -0.055 -1.198 0.234 -2.289 0.567
----------------------------------------------------------------------------------------
Variable selection balances making the model both realistic and simple. Let’s make a copy data set removing other variables
formula2 = max_vo2 ~ duration model4 <-lm(formula2, data = excercise)summary(model4)
Call:
lm(formula = formula2, data = excercise)
Residuals:
Min 1Q Median 3Q Max
-9.8341 -1.7243 0.0338 1.8036 7.7629
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.06473 1.56876 2.591 0.0111 *
duration 0.05474 0.00266 20.581 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.19 on 92 degrees of freedom
Multiple R-squared: 0.8216, Adjusted R-squared: 0.8196
F-statistic: 423.6 on 1 and 92 DF, p-value: < 2.2e-16
Add energy: there is a small change in R^2
formula3 = max_vo2 ~ duration + energy model5 <-lm(formula3, data = excercise)summary(model5)
Call:
lm(formula = formula3, data = excercise)
Residuals:
Min 1Q Median 3Q Max
-7.2909 -1.4316 0.4028 1.7800 6.0130
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.109167 1.445456 5.610 2.16e-07 ***
duration 0.044390 0.002727 16.281 < 2e-16 ***
energyyes 4.413022 0.680037 6.489 4.43e-09 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 2.652 on 91 degrees of freedom
Multiple R-squared: 0.878, Adjusted R-squared: 0.8753
F-statistic: 327.5 on 2 and 91 DF, p-value: < 2.2e-16
Add sex: same for sex there is a small change from the previous model
formula4 = max_vo2 ~ duration + energy + sex model6 <-lm(formula4, data = excercise)#summary(model4)# print estimated coefficient in a tidy modelbroom::tidy(model6)
we can also plot one by one to have a better understanding of each output.
Linearity : the relationship between the response (max_vo2) and the regression is linear - a straight line through the fitted values.
plot(model6, which =1)
Normality: the errors are distributed normally.
plot(model6, which =2)
Homogeneity of variance: the error term ε has constant variance
plot(model6, which =3)
Results:
In model4, the F-statistic has a p-value of less than 2.2e-16, indicating strong statistical significance. This suggests that at least one predictor variable significantly affects the outcome. The r^2 result telling us the model has improved by 5% from model. To see the coefficient only result
We estimate that the Max_vo2 value will be 6.416 per minute per kilogram of body weight when both predictors (duration and energy) are zero.
Coefficient for duration (0.04): Max_vo2 increases by 0.04 units for each additional minute of exercise.
Coefficient for energy (4.15): Max_vo2 increases by 4.15 units if the energy condition is “yes.”
Model Diagnostics
We check the model assumptions using various diagnostic plots and tests.
Independence of Observations: Durbin-Watson test
Linearity: Scatter plot of residuals vs. fitted values
Homoscedasticity: Fitted values vs. residuals plot
Normality: Histogram and Q-Q plot of residuals
Train-Test Split and Model Performance
Historically the performance of the model was largely based on the goodness-of-fit test and assessment of residuals. For predictive modeling, perform cross-validation to ensure the model performs well with resampling data (loss function) - compare the predicted value to the actual value. The simplest approach involves:
Splitting the data into a training set and a test set.
Fitting the model to the training set.
Predicting responses in the test set.
Evaluating the prediction quality.
Is there any rules for the percentage of classifying the train and test data ? There are no specific rules for the percentage split between training and test data, but two common proportions are 80:20 and 70:30. The key is to balance maximizing the data for fitting the model while retaining enough data to assess its performance effectively.
There are different mechanism of splitting the data into train and test set. For demonstration, we will be covering the simple random sampling and the caret package.
set.seed(123) # for reproducibilityn_train <-sample(1:nrow(excercise), round(nrow(excercise)*0.8))train_data <- excercise[n_train,]test_data <- excercise[-n_train,]
Using the Caret package.:
# Split the data randomly into a training set and a test set. set.seed(100) n_train2 <-createDataPartition(excercise$max_vo2, p =0.8, list =FALSE) train_data2 <- excercise[n_train2, ] test_data2 <- excercise[-n_train2, ]
Mean Absolute Error (MAE), Mean Squared Error (MSE) along with the Root Mean Squared Error (RMSE) are the most common metrics to use.
Fit the model on the training data model
model4 <-lm(max_vo2 ~ ., data = train_data2)predictions <-predict(model4, test_data2)
Measure performance by comparing the prediction with the data using multiple criterion. The model with the smaller the value of RMSE and MAE is optimal.
It means on average, our prediction for max_vo2 (per minute per kilogram of body weight) is varied by ~ 3 minutes/kg i.e., the prediction error is 8% .
---title: "Correlation and Linear Regression using R"description: | Exploring Correlation and Linear Regression with R: A Comprehensive Guideauthor: - name: Yalemzewod Gelaw url: https://yalemzewodgelaw.com/ orcid: 0000-0002-5338-586 date: 2024-07-11image: "media/cor_reg.png"slug: beginner’s guidecategories: [EDA, Regression]toc: truetoc-depth: 4number-sections: falsehighlight-style: githubformat: html: self-contained: true code-fold: false code-summary: "Show the code" code-tools: true theme: united knitr: opts_knit: warning: false message: falseeditor: visual---# Install and load packages```{r, package}# Install necessary packages if not already installedif (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")if (!requireNamespace("janitor", quietly = TRUE)) install.packages("janitor")if (!requireNamespace("car", quietly = TRUE)) install.packages("car")if (!requireNamespace("Hmisc", quietly = TRUE)) install.packages("Hmisc")if (!requireNamespace("corrplot", quietly = TRUE)) install.packages("corrplot")if (!requireNamespace("MASS", quietly = TRUE)) install.packages("MASS")if (!requireNamespace("ggpubr", quietly = TRUE)) install.packages("ggpubr")if (!requireNamespace("ggcorrplot", quietly = TRUE)) install.packages("ggcorrplot")if (!requireNamespace("olsrr", quietly = TRUE)) install.packages("olsrr")if (!requireNamespace("caret", quietly = TRUE)) install.packages("caret")# Load the packageslibrary(tidyverse) # For data wrganling library(janitor) # To clean column nameslibrary(Hmisc) # To create a correlation coeficient library(car) # Independent testlibrary(corrplot) # To visualise correlation matrixlibrary(MASS) # Create a hsitogramlibrary(ggpubr) # customizing the plotlibrary(ggcorrplot) # correlation matrixlibrary(olsrr) # variable selectionlibrary(caret) # cross validation```### **About the data**The dataset used in this analysis comes from a study of 94 participants performing treadmill exercises. Data were collected on the following variables: Maximum Volume of Oxygen used per minute per kilogram of body weight `(max VO2)`, duration of exercise `(duration)`, maximum heart rate during exercise `(heart rate)`, `age`, `height`, `weight`, `sex` (1 = male, 2 = female), `ethnicity` (1 = Habesha, 2 = White, 3 = West African), and whether there was an energy shortage during `exercise` (energy; 1 = yes, 0 = no).```{r, data}excercise <- read_csv("Data/excercise.csv") %>% clean_names()```### **Exploratory data analysis**Lets investigate the relationships among the continuous variables in our dataset.```{r, summary_res}#Summary of continous variablesummary(excercise)```To better understand the data, we will label the numeric categories for sex, ethnicity, and energy shortage as factors.```{r, label,echo=TRUE,eval=TRUE}excercise <- excercise %>% mutate( sex = if_else(sex == 1, "male", if_else(sex == 2, "female", NA)), ethnicity = if_else(ethnicity == 1, "Habesha", if_else(ethnicity == 2, "White", if_else(ethnicity == 3, "West African", NA))), energy = if_else(energy_shortage == 1, "yes", if_else(energy_shortage==0, "no", NA)), BMI = round(weight/(height/100)^2) ) %>% dplyr::select(-energy_shortage)```#### **Correlation Analysis**Correlation measures the strength and direction of the linear relationship between two variables. We'll explore Pearson’s, Spearman’s, and Kendall’s correlation coefficients.#### Correlation MatrixWe'll start by creating a correlation matrix for the continuous variables using the `cor()` function.```{r,duplicate, echo=TRUE,eval=TRUE}# Select continuous variablescont_var <- excercise %>% dplyr::select(duration:weight, BMI)# Create correlation matrixcor(cont_var)```**The `rcorr ()` Function**It will create a correlation matric with p value. The first matrix shows the correlation coefficients between the variables and the second matrix shows the corresponding p-values.```{r}#create matrix of correlation coefficients and p-valuesrcorr(as.matrix(cont_var))```#### Visualizing CorrelationWe can visualize the correlation matrix using the `corrplot` function from the `corrplot` package.```{r,corplot,echo=TRUE,eval=TRUE}#visualize correlation matrixcorrplot(cor(cont_var))```#### Correlation Coefficients##### Pearson CorrelationPearson’s correlation measures the linear relationship between two continuous variables. It ranges from -1 to 1, where 1 means a perfect positive linear relationship, -1 means a perfect negative linear relationship, and 0 means no linear relationship.To calculate Pearson’s correlation coefficient between the `max_vo2` and `duartion` variables in the `exercise` dataset, use the following code:```{r,Pearson,echo=TRUE,eval=TRUE }# Pearson's correlation between max_vo2 and durationcor(excercise$max_vo2, excercise$duration, method = "pearson")```##### **Spearman’s Correlation for Ranked Variables**Spearman’s correlation measures the monotonic relationship between two ranked variables. It is suitable for ordinal data or continuous data that do not meet the assumptions of Pearson’s correlation.```{r, spear,echo=TRUE,eval=TRUE}cor(excercise$max_vo2, excercise$heart_rate, method = "spearman")```##### Kendall’s CorrelationKendall’s correlation coefficient is used when the sample size is small and there are many tied ranks.```{r,Kendall’s,echo=TRUE,eval=TRUE }# Spearman's correlation between max_vo2 and heart_ratecor(excercise$max_vo2, excercise$age, method = "kendall")```**Properties of correlation coefficient (r) are:**- `r` only measures the strength of a linear relationship.- `r` is always between -1 and 1 inclusive. -1 means perfect negative linear correlation and +1 means perfect positive linear correlation- `r` does not change if the independent (x) and dependent (y) variables are inter- changed- `r` does not change if the scale on either variable is changed. You may multiply, divide, add, or subtract a value from all the x-values or y-values without changing the value of r.##### Scatter PlotLet's create a scatter plot to visualize the relationship between `max_vo2` and `duration`.```{r, scatter,echo=TRUE,eval=TRUE} # Scatter plotexcercise %>% ggplot() + # Add color # Add shape # Add transparency geom_point(aes(x=duration, y=max_vo2, colour = sex), alpha = 1/2) + # Add main title, axis labels, and legend titles labs(title = "Correlation between maximum volume of oxygen and durations \n doing exercise on treadmill (n =94)", subtitle = "Source: Online", y = "Max Vo2 (minutes/kg)", x = "Duration (minutes)", colour="Sex") + # Customize theme theme_classic()```*Figure 1: Scatter plot - the relationship between maximum volume of oxygen and duration of exercise.*As we can see the scatter plot displays the form and directions of the relationship between maximum volume of oxygen and duration of exercise - the maximum volume of oxygen increase when the duration of exercise increased.### Linear RegressionLinear regression models the relationship between two variables by fitting a linear equation to observed data. The explanatory variable (x) and the dependent variable (y) are key components, with Y being numeric (continuous). In Figure 1, the relationship between the maximum volume of oxygen (Y) and duration of exercise (X) the line that best predicts `max_vo2` from `duration` by minimizing the sum of the squares of the vertical distances of the points from the line.::: callout-note#### Assumptions of Linear Regression1. **Independence of Observations**: Residuals should not be correlated.2. **Linearity**: The relationship between the independent and dependent variables should be linear.3. **Homoscedasticity**: Residuals should have constant variance.4. **Normality**: Residuals should follow a normal distribution:::##### **1. Independence of Observations (No Autocorrelation)**The observations must be independent of each other, meaning the residuals (errors) from one observation should not correlate with the residuals of another. This assumption could be check using the [Durbin-Watson test](https://www.statology.org/durbin-watson-test/). We will check after we fit a model. A significant test statistics corresponding p-value depict the residuals in the regression model is autocorrelated.```{r, independent,echo=TRUE,eval=TRUE}# Fitting the modelmodel = lm(max_vo2 ~ duration, data = excercise) #perform Durbin-Watson testdurbinWatsonTest(model)```From the output above we can see that the p-value is 0.866 which is indicating that the residuals in this regression model are not autocorrelated.##### 2. LinearityThe relationship between the independent (duration) and dependent (max_vo2) variables must be linear. This can be assessed by plotting the variables and looking for a linear pattern.```{r,linear,echo=TRUE,eval=TRUE}# Plotting to check linearityplot(max_vo2 ~ duration, data = excercise) ```The points in the plot above look like they fall on roughly a straight line, which indicates that there is a linear relationship between max_vo2 and duration in exercise.##### **3. Homoscedasticity**The equal variance assumptions is that the residuals have constant variance at every level of x. This means that the prediction error doesn’t change significantly over the range of prediction of the model. The assumptions will be evaluated by creating a fitted value vs. residual plot.```{r, homosc,echo=TRUE,eval=TRUE}par(mfrow=c(2,2))plot(model)par(mfrow=c(1,1))```The scatter plot shows the residual are closer the fitted values.##### 4. NormalityThe residuals (errors) should follow a normal distribution. This assumption can be checked by plotting a histogram of the residuals, Q_Q plots.Example of histogram:```{r, normality,echo=TRUE,eval=TRUE}truehist(excercise$max_vo2, xlab = "Max_VO2", ylab = "Percentage", main= "Max_VO2 per minute per kilogram of body weight~Normal", prob = TRUE)curve (dnorm(x,mean=mean(excercise$max_vo2),sd=sd(excercise$max_vo2)), add=TRUE)```***Figure 3**: Histogram (with fitted Normal distribution) of Max_VO2(minutes/Kg)*The distribution (**Figure3**) extends from 20 to 50; a possible low and high values of the volume of oxygen, forcibly blown out in one second after a full intake of breath. The distribution seems asymmetric (skewed to the left).Example of Q-Q plot:```{r, q-q plot,echo=TRUE,eval=TRUE}qqnorm(excercise$max_vo2)qqline(excercise$max_vo2, main = 'Q-Q Plot for Normality', xlab = 'Theoretical Dist', ylab = 'Sample dist', col = 'steelblue')```OR , we can plot from the model```{r,plt,echo=TRUE,eval=TRUE}plot(model, 2)```**Shapiro-Wilk Test** is a common statistical test of normality. The test provides a test statistic *W* along with a corresponding *p-value*. If the *p-value* is less than *α =.05*, there is sufficient evidence to say that the sample does not come from a population that is normally distributed.```{r,Shapiro,echo=TRUE,eval=TRUE}shapiro.test(excercise$max_vo2)```The *Shapiro-Wilk Normality test* tells us the maximum volume of oxygen is distributed normally.Our assumptions check showed as all the assumptions are met.**How linear regression works?**The most common method for fitting a linear regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line.#### Simple Linear RegressionA regression model aims to model the magnitude of the relationship between one independent variable and a dependent variable. The simple linear regression model can be written with an explicit error term as:$$y= a+ b_1 (x) +e$$Where $a$ is the intercept of the regression line when the $x$ is null; $b_1$ is the slope of the regression line. and $e$ is a random component. It is the difference between the actual and the estimated dependent variable based on the size of the independent variable. The value of error (e) will vary from subject to subject, even if independent variable remains the same. Note that both α and b are population parameters which are usually unknown and hence estimated from the data by a and b.Below is the equation using our data```{r, model,echo=TRUE,eval=TRUE}# Fit linear regression modelmodel<- lm(max_vo2 ~ duration, data = excercise) ```The model results will be printed using the summary function```{r,m_sum,model,echo=TRUE,eval=TRUE}# model outputsummary(model)```*Estimated*: the coefficient for duration*R-square*: the proportion of variance explained by the model and it measures the strength of the relationship between the model and the dependent variable.*p-value and t-statistic*: they measure the extent to which a given coefficient is statistically significant - a higher t-statistics corresponds to a smaller p-value which indicates strong evidence against the null hypothesis.To get the confidence interval we can use the `confint()` function.```{r, conflimit}# confidence intervalconfint.lm(model)```We can also plot the residuals with histogram to check the normality assumptions.```{r, resid}Residuals <- residuals(model)hist(Residuals, freq = FALSE)curve(dnorm(x,mean = mean(Residuals),sd=sd(Residuals)), add = TRUE)```**Interpreting the results:**The fitted model of max_vo2 by duaration (minutes):$$max_vo2 = 4.1 + 0.055 * durations (minutes)$$A minutes increase in duration is associated with an expected increase of 0.055 minutes/kg in max_vo2.The fitted model can be visualized with a regression line added to the scatter plot.```{r, result_graph}inc.graph<-ggplot(model, aes(x=duration, y=max_vo2 ))+ geom_point() + labs(title = "The fitted model of max_vo2 by duration", y = "Max_vo2 (minutes/kg)", x = "Durations (minutes)") + theme_bw()# printinc.graph```Add the linear regression line to the plotted data```{r, add_line}inc.graph <- inc.graph + geom_smooth(method="lm", col="red")# printinc.graph```Add the equation for the regression line.```{r,add_equation }inc.graph <- inc.graph + stat_regline_equation(label.x = 205, label.y = 27) + # for regression equation stat_cor(aes(label = after_stat(rr.label)), abel.x = 205, label.y = 24) # for R^2inc.graph```#### Multiple Linear RegressionMultiple linear regression model aims to model the magnitude of the relationship between two variable while taking account the effect of other explantory variable and a dependent variable.$$y = a + bx_1 + bx_2 + bx_3 + bx_n + e$$***Multicollinearity*** - the dependency between predictors is another assumptions of linear regression analysis when we fit multiple regression analysis.The presence of multidisciplinary can be checked if there is a an inflated estimated coefficients and untrue relationships.```{r, cor_matrix}# Compute correlation at 2 decimal placescorr_matrix = round(cor(cont_var), 2)# Compute and show the resultggcorrplot(corr_matrix, hc.order = TRUE, type = "lower", lab = TRUE)```We can notice two strong correlations because their value is higher than 0.8.- Max_vo2 and duration: 0.91- BMI and weight: 0.86This result makes sense because BMI is computed from weight and age. In this case, we can get rid of the either BMI or weight from our model.Let's illustrate this by adding more variable into the previous model```{r, multiple_reg}# Fit multiple linear regression modelmodel2 <- lm(lm(formula = max_vo2 ~., data = excercise))summary(model2)```Our model result above doesn’t indicate these issues present in our model, so multicollinearity is likely not an issue. To confirm this, let’s use a measure called the VIF (Variance Inflation Factor) computed as:```{r,vif}vif(model2)```The VIF (Variance Inflation Factor) indicates that multicollinearity is present in our model. As a general guideline, a VIF value exceeding 10 is a cause for concern. In our case, height, weight, and BMI are contributing to this issue. Since BMI is derived from weight and height, we should include only one of these variables in our fitness model to address the multicollinearity problem.```{r, formula}formula = max_vo2 ~ duration + heart_rate + age + ethnicity + weight + sex + energymodel3 <- lm(formula, data = excercise)```Check the VIF again before interpreting the results. The model3 results indicated that multicollinearity is not an issue.```{r,vif_model}vif(model3)```Explore the final model results and interprete the results```{r, summary_out}summary(model3)```##### **Model selection**The process of selecting significant predictors for a model involves using criteria like R², Akaike Information Criteria (AIC), or Bayesian Information Criteria (BIC). There are various methods such as forward selection, backward selection, and stepwise selection, which are similar and yield comparable final models. Here, forward selection is used, which follows these steps:1. Start with all predictors "not in the model."2. Iterate over these predictors, refitting the model with each variable.3. Add the variable with the lowest p-value to the model if its p-value is less than α. Repeat until no more variables can be added.```{r, forward_sel}# Run the forward selection algorithm usign a alpha of 0.05model2 <- lm(lm(formula = max_vo2 ~., data = excercise))fwdmodel2 <- ols_step_forward_p(model2, penter=0.05)fwdmodel2```Variable selection balances making the model both realistic and simple. Let's make a copy data set removing other variables```{r, subset}excercise <- excercise %>%dplyr:: select(max_vo2,duration,energy,sex)```Let's iterate the model with only duration```{r, one_var}formula2 = max_vo2 ~ duration model4 <- lm(formula2, data = excercise)summary(model4)```Add energy: there is a small change in R\^2```{r, energy}formula3 = max_vo2 ~ duration + energy model5 <- lm(formula3, data = excercise)summary(model5)```Add sex: same for sex there is a small change from the previous model```{r, sex}formula4 = max_vo2 ~ duration + energy + sex model6 <- lm(formula4, data = excercise)#summary(model4)# print estimated coefficient in a tidy modelbroom::tidy(model6)```Check assumptions:```{r, assumption}par(mfrow = c(2, 2))plot(model6)```we can also plot one by one to have a better understanding of each output.***Linearity*** : the relationship between the response (max_vo2) and the regression is linear - a straight line through the fitted values.```{r, plot}plot(model6, which = 1)```Normality: the errors are distributed normally.```{r}plot(model6, which =2)```Homogeneity of variance: the error term *ε* has constant variance```{r}plot(model6, which =3)```**Results**:In *model4*, the F-statistic has a p-value of less than 2.2e-16, indicating strong statistical significance. This suggests that at least one predictor variable significantly affects the outcome. The r\^2 result telling us the model has improved by 5% from model. To see the coefficient only result```{r}summary(model6)$coefficient```$$max_vo2 = 6.416 + 0.04.duration(minutes) + 4.15.energy(yes) $$We estimate that the Max_vo2 value will be 6.416 per minute per kilogram of body weight when both predictors (duration and energy) are zero.Coefficient for duration (0.04): Max_vo2 increases by 0.04 units for each additional minute of exercise.Coefficient for energy (4.15): Max_vo2 increases by 4.15 units if the energy condition is "yes."::: callout-note#### **Model Diagnostics**We check the model assumptions using various diagnostic plots and tests.1. **Independence of Observations**: Durbin-Watson test2. **Linearity**: Scatter plot of residuals vs. fitted values3. **Homoscedasticity**: Fitted values vs. residuals plot4. **Normality**: Histogram and Q-Q plot of residuals:::#### Train-Test Split and Model PerformanceHistorically the performance of the model was largely based on the goodness-of-fit test and assessment of residuals. For predictive modeling, perform cross-validation to ensure the model performs well with resampling data (loss function) - compare the predicted value to the actual value. The simplest approach involves:1. Splitting the data into a training set and a test set.2. Fitting the model to the training set.3. Predicting responses in the test set.4. Evaluating the prediction quality.Is there any rules for the percentage of classifying the train and test data ? There are [no specific rules](https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio) for the percentage split between training and test data, but two common proportions are 80:20 and 70:30. The key is to balance maximizing the data for fitting the model while retaining enough data to assess its performance effectively.There are different mechanism of splitting the data into train and test set. For demonstration, we will be covering the simple random sampling and the caret package.```{r}set.seed(123) # for reproducibilityn_train <-sample(1:nrow(excercise), round(nrow(excercise)*0.8))train_data <- excercise[n_train,]test_data <- excercise[-n_train,]```Using the `Caret` package.:```{r}# Split the data randomly into a training set and a test set. set.seed(100) n_train2 <-createDataPartition(excercise$max_vo2, p =0.8, list =FALSE) train_data2 <- excercise[n_train2, ] test_data2 <- excercise[-n_train2, ]```Mean Absolute Error (MAE), Mean Squared Error (MSE) along with the Root Mean Squared Error (RMSE) are the most common metrics to use.Fit the model on the training data model```{r}model4 <-lm(max_vo2 ~ ., data = train_data2)predictions <-predict(model4, test_data2)```Measure performance by comparing the prediction with the data using multiple criterion. The model with the smaller the value of RMSE and MAE is optimal.```{r}R_sq <-R2(predictions, test_data2$max_vo2)RMSE <-RMSE(predictions, test_data2$max_vo2)MSE <-MAE(predictions, test_data2$max_vo2)print(c(R_sq, RMSE, MAE))```It means on average, our prediction for max_vo2 (per minute per kilogram of body weight) is varied by \~ 3 minutes/kg i.e., the prediction error is 8% .Prediction error:```{r}pred_error_rate <- RMSE/mean(test_data2$max_vo2)pred_error_rate```### Reference:1. [A Step-By-Step Guide for Running a Complete Multiple Linear Regression Analysis in R](https://medium.com/analytics-vidhya/a-step-by-step-guide-for-running-a-complete-multiple-linear-regression-analysis-in-r-c08be169fe01)2. [Stats and R](https://statsandr.com/blog/multiple-linear-regression-made-simple/)3. [Multiple Linear Regression in R](http://sthda.com/english/articles/40-regression-analysis/168-multiple-linear-regression-in-r/)