Install and load packages

# Install necessary packages if not already installed
if (!requireNamespace("tidyverse", quietly = TRUE)) install.packages("tidyverse")

if (!requireNamespace("janitor", quietly = TRUE)) install.packages("janitor")

if (!requireNamespace("car", quietly = TRUE)) install.packages("car")

if (!requireNamespace("Hmisc", quietly = TRUE)) install.packages("Hmisc")

if (!requireNamespace("corrplot", quietly = TRUE)) install.packages("corrplot")

if (!requireNamespace("MASS", quietly = TRUE)) install.packages("MASS")

if (!requireNamespace("ggpubr", quietly = TRUE)) install.packages("ggpubr")

if (!requireNamespace("ggcorrplot", quietly = TRUE)) install.packages("ggcorrplot")

if (!requireNamespace("olsrr", quietly = TRUE)) install.packages("olsrr")

if (!requireNamespace("caret", quietly = TRUE)) install.packages("caret")

# Load the packages
library(tidyverse)   # For data wrganling

Warning: package 'tidyverse' was built under R version 4.3.3

Warning: package 'ggplot2' was built under R version 4.3.3

Warning: package 'tibble' was built under R version 4.3.3

Warning: package 'tidyr' was built under R version 4.3.3

Warning: package 'readr' was built under R version 4.3.3

Warning: package 'purrr' was built under R version 4.3.3

Warning: package 'forcats' was built under R version 4.3.3

Warning: package 'lubridate' was built under R version 4.3.3

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(janitor)     # To clean column names

Warning: package 'janitor' was built under R version 4.3.3


Attaching package: 'janitor'

The following objects are masked from 'package:stats':

    chisq.test, fisher.test

library(Hmisc)       # To create a correlation coeficient

Warning: package 'Hmisc' was built under R version 4.3.3


Attaching package: 'Hmisc'

The following objects are masked from 'package:dplyr':

    src, summarize

The following objects are masked from 'package:base':

    format.pval, units

library(car)         # Independent test

Warning: package 'car' was built under R version 4.3.3

Loading required package: carData

Warning: package 'carData' was built under R version 4.3.3


Attaching package: 'car'

The following object is masked from 'package:dplyr':

    recode

The following object is masked from 'package:purrr':

    some

library(corrplot)    # To visualise correlation matrix

Warning: package 'corrplot' was built under R version 4.3.3

corrplot 0.92 loaded

library(MASS)        # Create a hsitogram


Attaching package: 'MASS'

The following object is masked from 'package:dplyr':

    select

library(ggpubr)      # customizing the plot

Warning: package 'ggpubr' was built under R version 4.3.3

library(ggcorrplot)  # correlation matrix

Warning: package 'ggcorrplot' was built under R version 4.3.3

library(olsrr)       # variable selection

Warning: package 'olsrr' was built under R version 4.3.3


Attaching package: 'olsrr'

The following object is masked from 'package:MASS':

    cement

The following object is masked from 'package:datasets':

    rivers

library(caret)       # cross validation

Warning: package 'caret' was built under R version 4.3.3

Loading required package: lattice

Attaching package: 'caret'

The following object is masked from 'package:purrr':

    lift

About the data

The dataset used in this analysis comes from a study of 94 participants performing treadmill exercises. Data were collected on the following variables: Maximum Volume of Oxygen used per minute per kilogram of body weight (max VO2), duration of exercise (duration), maximum heart rate during exercise (heart rate), age, height, weight, sex (1 = male, 2 = female), ethnicity (1 = Habesha, 2 = White, 3 = West African), and whether there was an energy shortage during exercise (energy; 1 = yes, 0 = no).

excercise <- read_csv("Data/excercise.csv") %>% 
  clean_names()

Rows: 94 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
dbl (9): duration, max VO2, HeartRate, age, height, weight, ethnicity, sex, ...

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Exploratory data analysis

Lets investigate the relationships among the continuous variables in our dataset.

#Summary of continous variable
summary(excercise)

    duration        max_vo2        heart_rate         age           height     
 Min.   :180.0   Min.   :15.30   Min.   :104.0   Min.   :21.0   Min.   :154.0  
 1st Qu.:520.8   1st Qu.:30.40   1st Qu.:168.0   1st Qu.:44.0   1st Qu.:168.0  
 Median :599.5   Median :35.85   Median :175.0   Median :49.5   Median :174.0  
 Mean   :576.7   Mean   :35.63   Mean   :175.4   Mean   :49.6   Mean   :173.9  
 3rd Qu.:647.8   3rd Qu.:41.38   3rd Qu.:188.0   3rd Qu.:57.0   3rd Qu.:180.0  
 Max.   :827.0   Max.   :50.90   Max.   :210.0   Max.   :74.0   Max.   :193.0  
     weight         ethnicity         sex        energy_shortage 
 Min.   : 47.00   Min.   :1.00   Min.   :1.000   Min.   :0.0000  
 1st Qu.: 65.25   1st Qu.:1.25   1st Qu.:1.000   1st Qu.:0.0000  
 Median : 76.50   Median :2.00   Median :1.000   Median :0.0000  
 Mean   : 73.31   Mean   :2.17   Mean   :1.362   Mean   :0.4362  
 3rd Qu.: 82.00   3rd Qu.:3.00   3rd Qu.:2.000   3rd Qu.:1.0000  
 Max.   :106.00   Max.   :3.00   Max.   :2.000   Max.   :1.0000

To better understand the data, we will label the numeric categories for sex, ethnicity, and energy shortage as factors.

excercise <- excercise %>%
  mutate(
    sex = if_else(sex == 1, "male",
                  if_else(sex == 2, "female", NA)),
    ethnicity = if_else(ethnicity == 1, "Habesha",
                        if_else(ethnicity == 2, "White",
                                if_else(ethnicity == 3, "West African", NA))),
    energy = if_else(energy_shortage == 1, "yes",
                     if_else(energy_shortage==0, "no", NA)),
  BMI = round(weight/(height/100)^2)
  ) %>% 
  dplyr::select(-energy_shortage)

Correlation Analysis

Correlation measures the strength and direction of the linear relationship between two variables. We’ll explore Pearson’s, Spearman’s, and Kendall’s correlation coefficients.

Correlation Matrix

We’ll start by creating a correlation matrix for the continuous variables using the cor() function.

# Select continuous variables
cont_var <- excercise %>% 
  dplyr::select(duration:weight, BMI)

# Create correlation matrix
cor(cont_var)

             duration     max_vo2   heart_rate        age      height
duration    1.0000000  0.90639686  0.677786803 -0.6570770 -0.07834020
max_vo2     0.9063969  1.00000000  0.647399860 -0.6247552 -0.04562266
heart_rate  0.6777868  0.64739986  1.000000000 -0.5859920  0.09109966
age        -0.6570770 -0.62475521 -0.585992041  1.0000000 -0.14772007
height     -0.0783402 -0.04562266  0.091099661 -0.1477201  1.00000000
weight     -0.1789527 -0.16159781 -0.006629393 -0.1834303  0.76228720
BMI        -0.1875713 -0.19115742 -0.078175596 -0.1400085  0.34483241
                 weight        BMI
duration   -0.178952680 -0.1875713
max_vo2    -0.161597810 -0.1911574
heart_rate -0.006629393 -0.0781756
age        -0.183430258 -0.1400085
height      0.762287200  0.3448324
weight      1.000000000  0.8633786
BMI         0.863378561  1.0000000

The rcorr () Function

It will create a correlation matric with p value. The first matrix shows the correlation coefficients between the variables and the second matrix shows the corresponding p-values.

#create matrix of correlation coefficients and p-values
rcorr(as.matrix(cont_var))

           duration max_vo2 heart_rate   age height weight   BMI
duration       1.00    0.91       0.68 -0.66  -0.08  -0.18 -0.19
max_vo2        0.91    1.00       0.65 -0.62  -0.05  -0.16 -0.19
heart_rate     0.68    0.65       1.00 -0.59   0.09  -0.01 -0.08
age           -0.66   -0.62      -0.59  1.00  -0.15  -0.18 -0.14
height        -0.08   -0.05       0.09 -0.15   1.00   0.76  0.34
weight        -0.18   -0.16      -0.01 -0.18   0.76   1.00  0.86
BMI           -0.19   -0.19      -0.08 -0.14   0.34   0.86  1.00

n= 94 


P
           duration max_vo2 heart_rate age    height weight BMI   
duration            0.0000  0.0000     0.0000 0.4529 0.0844 0.0702
max_vo2    0.0000           0.0000     0.0000 0.6624 0.1197 0.0650
heart_rate 0.0000   0.0000             0.0000 0.3825 0.9494 0.4539
age        0.0000   0.0000  0.0000            0.1554 0.0768 0.1783
height     0.4529   0.6624  0.3825     0.1554        0.0000 0.0007
weight     0.0844   0.1197  0.9494     0.0768 0.0000        0.0000
BMI        0.0702   0.0650  0.4539     0.1783 0.0007 0.0000

Visualizing Correlation

We can visualize the correlation matrix using the corrplot function from the corrplot package.

#visualize correlation matrix
corrplot(cor(cont_var))

Correlation Coefficients

Pearson Correlation

Pearson’s correlation measures the linear relationship between two continuous variables. It ranges from -1 to 1, where 1 means a perfect positive linear relationship, -1 means a perfect negative linear relationship, and 0 means no linear relationship.

To calculate Pearson’s correlation coefficient between the max_vo2 and duartion variables in the exercise dataset, use the following code:

# Pearson's correlation between max_vo2 and duration
cor(excercise$max_vo2, excercise$duration, method = "pearson")

[1] 0.9063969

Spearman’s Correlation for Ranked Variables

Spearman’s correlation measures the monotonic relationship between two ranked variables. It is suitable for ordinal data or continuous data that do not meet the assumptions of Pearson’s correlation.

cor(excercise$max_vo2, excercise$heart_rate, method = "spearman")

[1] 0.5390491

Kendall’s Correlation

Kendall’s correlation coefficient is used when the sample size is small and there are many tied ranks.

# Spearman's correlation between max_vo2 and heart_rate
cor(excercise$max_vo2, excercise$age, method = "kendall")

[1] -0.4221638

Properties of correlation coefficient (r) are:

r only measures the strength of a linear relationship.
r is always between -1 and 1 inclusive. -1 means perfect negative linear correlation and +1 means perfect positive linear correlation
r does not change if the independent (x) and dependent (y) variables are inter- changed
r does not change if the scale on either variable is changed. You may multiply, divide, add, or subtract a value from all the x-values or y-values without changing the value of r.

Scatter Plot

Let’s create a scatter plot to visualize the relationship between max_vo2 and duration.

 # Scatter plot
excercise %>% 
  ggplot() +
  # Add color
  # Add shape
  # Add transparency
  geom_point(aes(x=duration, y=max_vo2,
                 colour = sex),
             alpha = 1/2) +
  # Add main title, axis labels, and legend titles
  labs(title = "Correlation between maximum volume of oxygen and durations \n doing exercise on treadmill (n =94)", 
       subtitle = "Source: Online",
       y = "Max Vo2 (minutes/kg)", 
       x = "Duration (minutes)",
       colour="Sex") +
  # Customize theme
  theme_classic()

Figure 1: Scatter plot - the relationship between maximum volume of oxygen and duration of exercise.

As we can see the scatter plot displays the form and directions of the relationship between maximum volume of oxygen and duration of exercise - the maximum volume of oxygen increase when the duration of exercise increased.

Linear Regression

Linear regression models the relationship between two variables by fitting a linear equation to observed data. The explanatory variable (x) and the dependent variable (y) are key components, with Y being numeric (continuous). In Figure 1, the relationship between the maximum volume of oxygen (Y) and duration of exercise (X) the line that best predicts max_vo2 from duration by minimizing the sum of the squares of the vertical distances of the points from the line.

Assumptions of Linear Regression

Independence of Observations: Residuals should not be correlated.
Linearity: The relationship between the independent and dependent variables should be linear.
Homoscedasticity: Residuals should have constant variance.
Normality: Residuals should follow a normal distribution

1. Independence of Observations (No Autocorrelation)

The observations must be independent of each other, meaning the residuals (errors) from one observation should not correlate with the residuals of another. This assumption could be check using the Durbin-Watson test. We will check after we fit a model. A significant test statistics corresponding p-value depict the residuals in the regression model is autocorrelated.

# Fitting the model
model = lm(max_vo2 ~ duration, data = excercise) 
#perform Durbin-Watson test
durbinWatsonTest(model)

 lag Autocorrelation D-W Statistic p-value
   1    -0.007425459      1.976968   0.862
 Alternative hypothesis: rho != 0

From the output above we can see that the p-value is 0.866 which is indicating that the residuals in this regression model are not autocorrelated.

2. Linearity

The relationship between the independent (duration) and dependent (max_vo2) variables must be linear. This can be assessed by plotting the variables and looking for a linear pattern.

# Plotting to check linearity
plot(max_vo2 ~ duration, data = excercise)

The points in the plot above look like they fall on roughly a straight line, which indicates that there is a linear relationship between max_vo2 and duration in exercise.

3. Homoscedasticity

The equal variance assumptions is that the residuals have constant variance at every level of x. This means that the prediction error doesn’t change significantly over the range of prediction of the model. The assumptions will be evaluated by creating a fitted value vs. residual plot.

par(mfrow=c(2,2))
plot(model)

par(mfrow=c(1,1))

The scatter plot shows the residual are closer the fitted values.

4. Normality

The residuals (errors) should follow a normal distribution. This assumption can be checked by plotting a histogram of the residuals, Q_Q plots.

Example of histogram:

truehist(excercise$max_vo2, xlab = "Max_VO2", ylab = "Percentage", main= "Max_VO2 per minute per kilogram of body weight~Normal", prob = TRUE)
curve (dnorm(x,mean=mean(excercise$max_vo2),sd=sd(excercise$max_vo2)), add=TRUE)

Figure 3: Histogram (with fitted Normal distribution) of Max_VO2(minutes/Kg)

The distribution (Figure3) extends from 20 to 50; a possible low and high values of the volume of oxygen, forcibly blown out in one second after a full intake of breath. The distribution seems asymmetric (skewed to the left).

Example of Q-Q plot:

qqnorm(excercise$max_vo2)
qqline(excercise$max_vo2,
       main = 'Q-Q Plot for Normality', xlab = 'Theoretical Dist',
       ylab = 'Sample dist', col = 'steelblue')

OR , we can plot from the model

plot(model, 2)

Shapiro-Wilk Test is a common statistical test of normality. The test provides a test statistic W along with a corresponding p-value. If the p-value is less than α =.05, there is sufficient evidence to say that the sample does not come from a population that is normally distributed.

shapiro.test(excercise$max_vo2)


    Shapiro-Wilk normality test

data:  excercise$max_vo2
W = 0.97977, p-value = 0.1539

The Shapiro-Wilk Normality test tells us the maximum volume of oxygen is distributed normally.

Our assumptions check showed as all the assumptions are met.

How linear regression works?

The most common method for fitting a linear regression line is the method of least-squares. This method calculates the best-fitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line (if a point lies on the fitted line exactly, then its vertical deviation is 0). More precisely, the goal of regression is to minimize the sum of the squares of the vertical distances of the points from the line.

Simple Linear Regression

A regression model aims to model the magnitude of the relationship between one independent variable and a dependent variable. The simple linear regression model can be written with an explicit error term as:

\[ y= a+ b_1 (x) +e \]

Where \(a\) is the intercept of the regression line when the \(x\) is null; \(b_1\) is the slope of the regression line. and \(e\) is a random component. It is the difference between the actual and the estimated dependent variable based on the size of the independent variable. The value of error (e) will vary from subject to subject, even if independent variable remains the same. Note that both α and b are population parameters which are usually unknown and hence estimated from the data by a and b.

Below is the equation using our data

# Fit linear regression model
model<- lm(max_vo2 ~ duration, data = excercise)

The model results will be printed using the summary function

# model output
summary(model)


Call:
lm(formula = max_vo2 ~ duration, data = excercise)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8341 -1.7243  0.0338  1.8036  7.7629 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.06473    1.56876   2.591   0.0111 *  
duration     0.05474    0.00266  20.581   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.19 on 92 degrees of freedom
Multiple R-squared:  0.8216,    Adjusted R-squared:  0.8196 
F-statistic: 423.6 on 1 and 92 DF,  p-value: < 2.2e-16

Estimated: the coefficient for duration

R-square: the proportion of variance explained by the model and it measures the strength of the relationship between the model and the dependent variable.

p-value and t-statistic: they measure the extent to which a given coefficient is statistically significant - a higher t-statistics corresponds to a smaller p-value which indicates strong evidence against the null hypothesis.

To get the confidence interval we can use the confint() function.

# confidence interval
confint.lm(model)

                 2.5 %     97.5 %
(Intercept) 0.94903418 7.18041819
duration    0.04945863 0.06002395

We can also plot the residuals with histogram to check the normality assumptions.

Residuals <- residuals(model)
hist(Residuals, freq = FALSE)
curve(dnorm(x,mean = mean(Residuals),sd=sd(Residuals)), add = TRUE)

Interpreting the results:

The fitted model of max_vo2 by duaration (minutes):

\[ max_vo2 = 4.1 + 0.055 * durations (minutes) \]

A minutes increase in duration is associated with an expected increase of 0.055 minutes/kg in max_vo2.

The fitted model can be visualized with a regression line added to the scatter plot.

inc.graph<-ggplot(model,
                  aes(x=duration,
                      y=max_vo2 ))+
  geom_point() +
  labs(title = "The fitted model of max_vo2 by duration",
       y = "Max_vo2 (minutes/kg)",
       x = "Durations (minutes)") +
  theme_bw()
# print
inc.graph

Add the linear regression line to the plotted data

inc.graph <- inc.graph + geom_smooth(method="lm", col="red")

# print
inc.graph

`geom_smooth()` using formula = 'y ~ x'

Add the equation for the regression line.

inc.graph <- inc.graph +
  stat_regline_equation(label.x = 205, label.y = 27) + # for regression equation
  stat_cor(aes(label = after_stat(rr.label)), abel.x = 205, label.y = 24) # for R^2

Warning in stat_cor(aes(label = after_stat(rr.label)), abel.x = 205, label.y =
24): Ignoring unknown parameters: `abel.x`

inc.graph

`geom_smooth()` using formula = 'y ~ x'

Multiple Linear Regression

Multiple linear regression model aims to model the magnitude of the relationship between two variable while taking account the effect of other explantory variable and a dependent variable.

\[ y = a + bx_1 + bx_2 + bx_3 + bx_n + e \]

Multicollinearity - the dependency between predictors is another assumptions of linear regression analysis when we fit multiple regression analysis.

The presence of multidisciplinary can be checked if there is a an inflated estimated coefficients and untrue relationships.

# Compute correlation at 2 decimal places
corr_matrix = round(cor(cont_var), 2)

# Compute and show the  result
ggcorrplot(corr_matrix, hc.order = TRUE, type = "lower",
          lab = TRUE)

We can notice two strong correlations because their value is higher than 0.8.

- Max_vo2 and duration: 0.91

- BMI and weight: 0.86

This result makes sense because BMI is computed from weight and age. In this case, we can get rid of the either BMI or weight from our model.

Let’s illustrate this by adding more variable into the previous model

# Fit multiple linear regression model
model2 <- lm(lm(formula = max_vo2 ~., data = excercise))
summary(model2)


Call:
lm(formula = lm(formula = max_vo2 ~ ., data = excercise))

Residuals:
   Min     1Q Median     3Q    Max 
-7.647 -1.653  0.298  1.527  5.388 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)           26.265707  30.979448   0.848    0.399    
duration               0.039669   0.004217   9.406 1.01e-14 ***
heart_rate             0.020959   0.022136   0.947    0.346    
age                   -0.004533   0.040201  -0.113    0.911    
height                -0.092808   0.185259  -0.501    0.618    
weight                 0.137680   0.227125   0.606    0.546    
ethnicityWest African  0.607009   1.018407   0.596    0.553    
ethnicityWhite         0.381494   0.814760   0.468    0.641    
sexmale               -0.843426   0.750917  -1.123    0.265    
energyyes              4.010409   0.879724   4.559 1.76e-05 ***
BMI                   -0.519829   0.660626  -0.787    0.434    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.694 on 83 degrees of freedom
Multiple R-squared:  0.8851,    Adjusted R-squared:  0.8713 
F-statistic: 63.95 on 10 and 83 DF,  p-value: < 2.2e-16

Our model result above doesn’t indicate these issues present in our model, so multicollinearity is likely not an issue. To confirm this, let’s use a measure called the VIF (Variance Inflation Factor) computed as:

vif(model2)

                 GVIF Df GVIF^(1/(2*Df))
duration     3.523124  1        1.876999
heart_rate   2.169643  1        1.472971
age          2.406007  1        1.551131
height      32.491547  1        5.700136
weight     113.782522  1       10.666889
ethnicity    2.585810  2        1.268087
sex          1.685510  1        1.298272
energy       2.464164  1        1.569766
BMI         52.942047  1        7.276129

The VIF (Variance Inflation Factor) indicates that multicollinearity is present in our model. As a general guideline, a VIF value exceeding 10 is a cause for concern. In our case, height, weight, and BMI are contributing to this issue. Since BMI is derived from weight and height, we should include only one of these variables in our fitness model to address the multicollinearity problem.

formula =  max_vo2 ~ duration + heart_rate + age + ethnicity + weight + sex + energy
model3 <- lm(formula, data = excercise)

Check the VIF again before interpreting the results. The model3 results indicated that multicollinearity is not an issue.

vif(model3)

               GVIF Df GVIF^(1/(2*Df))
duration   3.448220  1        1.856938
heart_rate 2.122043  1        1.456723
age        2.304269  1        1.517982
ethnicity  2.509935  2        1.258681
weight     1.323586  1        1.150472
sex        1.678684  1        1.295641
energy     2.462661  1        1.569287

Explore the final model results and interprete the results

summary(model3)


Call:
lm(formula = formula, data = excercise)

Residuals:
   Min     1Q Median     3Q    Max 
-7.543 -1.428  0.310  1.485  5.285 

Coefficients:
                       Estimate Std. Error t value Pr(>|t|)    
(Intercept)            8.451716   5.652625   1.495    0.139    
duration               0.039455   0.004159   9.486 5.59e-15 ***
heart_rate             0.024679   0.021822   1.131    0.261    
age                   -0.006641   0.039216  -0.169    0.866    
ethnicityWest African  0.497364   1.009112   0.493    0.623    
ethnicityWhite         0.257835   0.805013   0.320    0.750    
weight                -0.014511   0.024418  -0.594    0.554    
sexmale               -0.847960   0.746998  -1.135    0.260    
energyyes              3.990363   0.876642   4.552 1.76e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.686 on 85 degrees of freedom
Multiple R-squared:  0.8831,    Adjusted R-squared:  0.8721 
F-statistic: 80.26 on 8 and 85 DF,  p-value: < 2.2e-16

Model selection

The process of selecting significant predictors for a model involves using criteria like R², Akaike Information Criteria (AIC), or Bayesian Information Criteria (BIC). There are various methods such as forward selection, backward selection, and stepwise selection, which are similar and yield comparable final models. Here, forward selection is used, which follows these steps:

Start with all predictors “not in the model.”
Iterate over these predictors, refitting the model with each variable.
Add the variable with the lowest p-value to the model if its p-value is less than α. Repeat until no more variables can be added.

# Run the forward selection algorithm usign a alpha of 0.05
model2 <- lm(lm(formula = max_vo2 ~., data = excercise))

fwdmodel2 <- ols_step_forward_p(model2, penter=0.05)
fwdmodel2


                             Stepwise Summary                              
-------------------------------------------------------------------------
Step    Variable        AIC        SBC       SBIC        R2       Adj. R2 
-------------------------------------------------------------------------
 0      Base Model    648.809    653.896    378.795    0.00000    0.00000 
 1      duration      488.802    496.432    220.812    0.82156    0.81962 
 2      energy        455.051    465.224    188.681    0.87801    0.87533 
 3      heart_rate    455.102    467.818    189.037    0.88051    0.87653 
 4      sex           455.598    470.857    189.879    0.88241    0.87712 
-------------------------------------------------------------------------

Final Model Output 
------------------

                         Model Summary                          
---------------------------------------------------------------
R                       0.939       RMSE                 2.562 
R-Squared               0.882       MSE                  6.930 
Adj. R-Squared          0.877       Coef. Var            7.388 
Pred R-Squared          0.868       AIC                455.598 
MAE                     1.949       SBC                470.857 
---------------------------------------------------------------
 RMSE: Root Mean Square Error 
 MSE: Mean Square Error 
 MAE: Mean Absolute Error 
 AIC: Akaike Information Criteria 
 SBC: Schwarz Bayesian Criteria 

                                ANOVA                                 
---------------------------------------------------------------------
                Sum of                                               
               Squares        DF    Mean Square       F         Sig. 
---------------------------------------------------------------------
Regression    4628.510         4       1157.128    166.966    0.0000 
Residual       616.798        89          6.930                      
Total         5245.308        93                                     
---------------------------------------------------------------------

                                  Parameter Estimates                                    
----------------------------------------------------------------------------------------
      model      Beta    Std. Error    Std. Beta      t        Sig      lower     upper 
----------------------------------------------------------------------------------------
(Intercept)     6.416         3.005                  2.135    0.036     0.445    12.388 
   duration     0.040         0.004        0.670    11.513    0.000     0.033     0.047 
  energyyes     4.154         0.714        0.276     5.815    0.000     2.734     5.573 
 heart_rate     0.026         0.020        0.065     1.314    0.192    -0.013     0.066 
    sexmale    -0.861         0.719       -0.055    -1.198    0.234    -2.289     0.567 
----------------------------------------------------------------------------------------

Variable selection balances making the model both realistic and simple. Let’s make a copy data set removing other variables

excercise <- excercise %>%dplyr:: select(max_vo2,duration,energy,sex)

Let’s iterate the model with only duration

formula2 =  max_vo2 ~ duration 
model4 <- lm(formula2, data = excercise)
summary(model4)


Call:
lm(formula = formula2, data = excercise)

Residuals:
    Min      1Q  Median      3Q     Max 
-9.8341 -1.7243  0.0338  1.8036  7.7629 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.06473    1.56876   2.591   0.0111 *  
duration     0.05474    0.00266  20.581   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.19 on 92 degrees of freedom
Multiple R-squared:  0.8216,    Adjusted R-squared:  0.8196 
F-statistic: 423.6 on 1 and 92 DF,  p-value: < 2.2e-16

Add energy: there is a small change in R^2

formula3 =  max_vo2 ~ duration + energy     
model5 <- lm(formula3, data = excercise)
summary(model5)


Call:
lm(formula = formula3, data = excercise)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.2909 -1.4316  0.4028  1.7800  6.0130 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 8.109167   1.445456   5.610 2.16e-07 ***
duration    0.044390   0.002727  16.281  < 2e-16 ***
energyyes   4.413022   0.680037   6.489 4.43e-09 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.652 on 91 degrees of freedom
Multiple R-squared:  0.878, Adjusted R-squared:  0.8753 
F-statistic: 327.5 on 2 and 91 DF,  p-value: < 2.2e-16

Add sex: same for sex there is a small change from the previous model

formula4 =  max_vo2 ~ duration + energy + sex    
model6 <- lm(formula4, data = excercise)
#summary(model4)
# print estimated coefficient in a tidy model
broom::tidy(model6)

# A tibble: 4 × 5
  term        estimate std.error statistic  p.value
  <chr>          <dbl>     <dbl>     <dbl>    <dbl>
1 (Intercept)   9.55     1.84         5.20 1.26e- 6
2 duration      0.0431   0.00290     14.9  5.35e-26
3 energyyes     4.12     0.717        5.75 1.22e- 7
4 sexmale      -0.909    0.720       -1.26 2.10e- 1

Check assumptions:

par(mfrow = c(2, 2))
plot(model6)

we can also plot one by one to have a better understanding of each output.

Linearity : the relationship between the response (max_vo2) and the regression is linear - a straight line through the fitted values.

plot(model6, which = 1)

Normality: the errors are distributed normally.

plot(model6, which = 2)

Homogeneity of variance: the error term ε has constant variance

plot(model6, which = 3)

Results:

In model4, the F-statistic has a p-value of less than 2.2e-16, indicating strong statistical significance. This suggests that at least one predictor variable significantly affects the outcome. The r^2 result telling us the model has improved by 5% from model. To see the coefficient only result

summary(model6)$coefficient

               Estimate  Std. Error   t value     Pr(>|t|)
(Intercept)  9.54806984 1.837485216  5.196270 1.258410e-06
duration     0.04312345 0.002897214 14.884456 5.346798e-26
energyyes    4.11926717 0.716703611  5.747518 1.221488e-07
sexmale     -0.90907388 0.720484933 -1.261753 2.102980e-01

\[ max_vo2 = 6.416 + 0.04.duration(minutes) + 4.15.energy(yes) \]

We estimate that the Max_vo2 value will be 6.416 per minute per kilogram of body weight when both predictors (duration and energy) are zero.

Coefficient for duration (0.04): Max_vo2 increases by 0.04 units for each additional minute of exercise.

Coefficient for energy (4.15): Max_vo2 increases by 4.15 units if the energy condition is “yes.”

Model Diagnostics

We check the model assumptions using various diagnostic plots and tests.

Independence of Observations: Durbin-Watson test
Linearity: Scatter plot of residuals vs. fitted values
Homoscedasticity: Fitted values vs. residuals plot
Normality: Histogram and Q-Q plot of residuals

Train-Test Split and Model Performance

Historically the performance of the model was largely based on the goodness-of-fit test and assessment of residuals. For predictive modeling, perform cross-validation to ensure the model performs well with resampling data (loss function) - compare the predicted value to the actual value. The simplest approach involves:

Splitting the data into a training set and a test set.
Fitting the model to the training set.
Predicting responses in the test set.
Evaluating the prediction quality.

Is there any rules for the percentage of classifying the train and test data ? There are no specific rules for the percentage split between training and test data, but two common proportions are 80:20 and 70:30. The key is to balance maximizing the data for fitting the model while retaining enough data to assess its performance effectively.

There are different mechanism of splitting the data into train and test set. For demonstration, we will be covering the simple random sampling and the caret package.

set.seed(123) # for reproducibility

n_train <- sample(1:nrow(excercise), round(nrow(excercise)*0.8))

train_data <- excercise[n_train,]

test_data <- excercise[-n_train,]

Using the Caret package.:

# Split the data randomly into a training set and a test set. 
set.seed(100) 

n_train2 <- createDataPartition(excercise$max_vo2, p =0.8, list = FALSE) 


train_data2 <- excercise[n_train2, ] 

test_data2 <- excercise[-n_train2, ]

Mean Absolute Error (MAE), Mean Squared Error (MSE) along with the Root Mean Squared Error (RMSE) are the most common metrics to use.

Fit the model on the training data model

model4 <- lm(max_vo2 ~ ., data = train_data2)

predictions <- predict(model4, test_data2)

Measure performance by comparing the prediction with the data using multiple criterion. The model with the smaller the value of RMSE and MAE is optimal.

R_sq <- R2(predictions, test_data2$max_vo2)
RMSE <- RMSE(predictions, test_data2$max_vo2)
MSE <- MAE(predictions, test_data2$max_vo2)

print(c(R_sq, RMSE, MAE))

[[1]]
[1] 0.8712019

[[2]]
[1] 2.966111

[[3]]
function (pred, obs, na.rm = FALSE) 
mean(abs(pred - obs), na.rm = na.rm)
<bytecode: 0x000002468b9d68f8>
<environment: namespace:caret>

It means on average, our prediction for max_vo2 (per minute per kilogram of body weight) is varied by ~ 3 minutes/kg i.e., the prediction error is 8% .

Prediction error:

pred_error_rate <- RMSE/mean(test_data2$max_vo2)
pred_error_rate

[1] 0.08370499

Correlation and Linear Regression using R