Topic 8 Simple Regression
8.1 What is regression?
- Testing to see if we can make predictions based on data that are correlated
We found a strong correlation between treatment duration and agression levels. Can we use this data to predict aggression levels of other clients, based on their treatment duration?
- When we carry out regression, we get a information about:
- How much variance in the outcome is explained by the predictor
- How confident we can be about these results generalising (i.e. significance)
- How much error we can expect from anu predictions that we make (i.e. standard error of the estimate)
- The figures we need to calculate a predicted outcome value (i.e. coefficient values)
8.2 How is regression calculated?
- When we run a regression analysis, a calculation is done to select the “line of best fit”
- This is a “prediction line” that minimises the overall amount of error
- Error = difference between the data points and the line
8.3 The regression equation
Once the line of best fit is calculated, predictions are based on this line
To make predictions we need the intercept and slope of the line
- Intercept or constant= where the line crosses the y axis
- Slope or beta = the angle of the line
Predictions are made using the calculation for a line: Y = bX + c
You can think of the equation like this:
predicted outcome value = beta coefficient * value of predictor + constant
8.4 Running regression in R
- Step 1: Run regression
- Step 2: Check assumptions
- Data
- Distribution
- Linearity
- Homogeneity of variance
- Uncorrelated predictors
- Indpendence of residuals
- No influental cases / outliers
- Step 3: Check R^2 value
- Step 4: Check model significance
- Step 5: Check coefficient values
8.5 Run regression
- We use the lm() command to run regression while saving the results
- We then use the summary() function to check the results
<- lm(formula= aggression_level ~ treatment_duration ,data=regression_data)
model1 summary(model1)
##
## Call:
## lm(formula = aggression_level ~ treatment_duration, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q
## -3.4251 -1.1493 -0.0593 0.8814
## Max
## 3.4542
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 12.3300 0.7509
## treatment_duration -0.6933 0.0726
## t value Pr(>|t|)
## (Intercept) 16.42 < 2e-16
## treatment_duration -9.55 1.15e-15
##
## (Intercept) ***
## treatment_duration ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05
## '.' 0.1 ' ' 1
##
## Residual standard error: 1.551 on 98 degrees of freedom
## Multiple R-squared: 0.4821, Adjusted R-squared: 0.4768
## F-statistic: 91.21 on 1 and 98 DF, p-value: 1.146e-15
8.6 What are residuals?
- In regression, the assumptions apply to the residuals, not the data themselves
- Residual just means the difference between the data point and the regression line
8.7 Check assumptions: distribution
- Using the plot() command on our regression model will give us some useful diagnostic plots
- The second plot that it outputs shows the normality
plot(model1, which=2)
- We could also use a histogram to check the distribution
- Notice how we can use the $ sign to get the residuals from the model
hist(model1$residuals)
8.8 Check assumptions: linearity
- Using the plot() command on our regression model will give us some useful diagnostic plots
- The first plot that it outputs shows the residuals vs the fitted values
- Here, we want to see them spread out, with the line being horizontal and straight
plot(model1, which=1)
- There is a slight amount of curvilinearity here but nothing to be worried about
8.9 Check assumptions: Homogeneity of Variance #1
- We can use the sample plot to check Homogeneity of Variance
- We want the variance to be constant across the data set. We do not want the variance to change at different points in the data
plot(model1, which=1)
- A violation of Homogeneity of Variance would usually look like a funnel, with the data narrowing
8.10 Check assumptions: Influential cases
- We need to check that there are no extreme outliers - they could throw off our predictions
- We are looking for participants that have high rediduals + high leverage
- Some guidance suggests anything higher than 1 is an influential case
- Others suggest 4/n is the cut off point (4 divided by number of participants)
plot(model1, which=4)
- We are looking for participants that have high rediduals + high leverage
- No cases over 1
- Many are over 0.04 (4/n = 0.04)
plot(model1, which=5)
8.11 Check the r squared value
- r^2 = the amount of variance in the outcome that is explained by the predictor(s)
- The closer this value is to 1, the more useful our regression model is for predicting the outcome
<- summary(model1)
modelSummary modelSummary
##
## Call:
## lm(formula = aggression_level ~ treatment_duration, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q
## -3.4251 -1.1493 -0.0593 0.8814
## Max
## 3.4542
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 12.3300 0.7509
## treatment_duration -0.6933 0.0726
## t value Pr(>|t|)
## (Intercept) 16.42 < 2e-16
## treatment_duration -9.55 1.15e-15
##
## (Intercept) ***
## treatment_duration ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05
## '.' 0.1 ' ' 1
##
## Residual standard error: 1.551 on 98 degrees of freedom
## Multiple R-squared: 0.4821, Adjusted R-squared: 0.4768
## F-statistic: 91.21 on 1 and 98 DF, p-value: 1.146e-15
- The r^2 of 0.482052 means that 48% of the variance in aggression level is explained by treatment duration
8.12 Check model significance
- The model significance is displayed at the very end of the output
- p-value: 1.146e-15
- As p < 0.05, the model is significant
modelSummary
##
## Call:
## lm(formula = aggression_level ~ treatment_duration, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q
## -3.4251 -1.1493 -0.0593 0.8814
## Max
## 3.4542
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 12.3300 0.7509
## treatment_duration -0.6933 0.0726
## t value Pr(>|t|)
## (Intercept) 16.42 < 2e-16
## treatment_duration -9.55 1.15e-15
##
## (Intercept) ***
## treatment_duration ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05
## '.' 0.1 ' ' 1
##
## Residual standard error: 1.551 on 98 degrees of freedom
## Multiple R-squared: 0.4821, Adjusted R-squared: 0.4768
## F-statistic: 91.21 on 1 and 98 DF, p-value: 1.146e-15
8.13 Check coefficient values
- The coefficient values are displayed in the coefficients table
- If we have more than one predictor, they are all listed here
$coefficients modelSummary
## Estimate
## (Intercept) 12.3300211
## treatment_duration -0.6933201
## Std. Error
## (Intercept) 0.75087601
## treatment_duration 0.07259671
## t value
## (Intercept) 16.420848
## treatment_duration -9.550297
## Pr(>|t|)
## (Intercept) 6.840516e-30
## treatment_duration 1.145898e-15
- The beta coefficient for treatment duration is in the Estimate column
- For every unit increase in treatment duration, aggression level decreases by 0.69
8.14 The regression equation
- The regression equation is:
Outcome = predictor value * beta coefficient + constant
- For this model, that is:
Aggression level = treatment duration * -0.69 + 12.33
$coefficients modelSummary
## Estimate
## (Intercept) 12.3300211
## treatment_duration -0.6933201
## Std. Error
## (Intercept) 0.75087601
## treatment_duration 0.07259671
## t value
## (Intercept) 16.420848
## treatment_duration -9.550297
## Pr(>|t|)
## (Intercept) 6.840516e-30
## treatment_duration 1.145898e-15
8.15 Accounting for error in predictions
- We also know that the accuracy of predictions will be within a certain margin of error
- This is known as standard error of the estimate or residual standard error
modelSummary
##
## Call:
## lm(formula = aggression_level ~ treatment_duration, data = regression_data)
##
## Residuals:
## Min 1Q Median 3Q
## -3.4251 -1.1493 -0.0593 0.8814
## Max
## 3.4542
##
## Coefficients:
## Estimate Std. Error
## (Intercept) 12.3300 0.7509
## treatment_duration -0.6933 0.0726
## t value Pr(>|t|)
## (Intercept) 16.42 < 2e-16
## treatment_duration -9.55 1.15e-15
##
## (Intercept) ***
## treatment_duration ***
## ---
## Signif. codes:
## 0 '***' 0.001 '**' 0.01 '*' 0.05
## '.' 0.1 ' ' 1
##
## Residual standard error: 1.551 on 98 degrees of freedom
## Multiple R-squared: 0.4821, Adjusted R-squared: 0.4768
## F-statistic: 91.21 on 1 and 98 DF, p-value: 1.146e-15