Topic 8 Simple Regression

8.1 What is regression?

  • Testing to see if we can make predictions based on data that are correlated

We found a strong correlation between treatment duration and agression levels. Can we use this data to predict aggression levels of other clients, based on their treatment duration?

  • When we carry out regression, we get a information about:
    • How much variance in the outcome is explained by the predictor
    • How confident we can be about these results generalising (i.e. significance)
    • How much error we can expect from anu predictions that we make (i.e. standard error of the estimate)
    • The figures we need to calculate a predicted outcome value (i.e. coefficient values)

8.2 How is regression calculated?

  • When we run a regression analysis, a calculation is done to select the “line of best fit”
  • This is a “prediction line” that minimises the overall amount of error
    • Error = difference between the data points and the line

8.3 The regression equation

  • Once the line of best fit is calculated, predictions are based on this line

  • To make predictions we need the intercept and slope of the line

    • Intercept or constant= where the line crosses the y axis
    • Slope or beta = the angle of the line
  • Predictions are made using the calculation for a line: Y = bX + c

  • You can think of the equation like this:

predicted outcome value = beta coefficient * value of predictor + constant

8.4 Running regression in R

  • Step 1: Run regression
  • Step 2: Check assumptions
    • Data
    • Distribution
    • Linearity
    • Homogeneity of variance
    • Uncorrelated predictors
    • Indpendence of residuals
    • No influental cases / outliers
  • Step 3: Check R^2 value
  • Step 4: Check model significance
  • Step 5: Check coefficient values

8.5 Run regression

  • We use the lm() command to run regression while saving the results
  • We then use the summary() function to check the results
model1 <- lm(formula= aggression_level ~ treatment_duration ,data=regression_data)
summary(model1)
## 
## Call:
## lm(formula = aggression_level ~ treatment_duration, data = regression_data)
## 
## Residuals:
##     Min      1Q  Median      3Q 
## -3.4251 -1.1493 -0.0593  0.8814 
##     Max 
##  3.4542 
## 
## Coefficients:
##                    Estimate Std. Error
## (Intercept)         12.3300     0.7509
## treatment_duration  -0.6933     0.0726
##                    t value Pr(>|t|)
## (Intercept)          16.42  < 2e-16
## treatment_duration   -9.55 1.15e-15
##                       
## (Intercept)        ***
## treatment_duration ***
## ---
## Signif. codes:  
##   0 '***' 0.001 '**' 0.01 '*' 0.05
##   '.' 0.1 ' ' 1
## 
## Residual standard error: 1.551 on 98 degrees of freedom
## Multiple R-squared:  0.4821, Adjusted R-squared:  0.4768 
## F-statistic: 91.21 on 1 and 98 DF,  p-value: 1.146e-15

8.6 What are residuals?

  • In regression, the assumptions apply to the residuals, not the data themselves
  • Residual just means the difference between the data point and the regression line

8.7 Check assumptions: distribution

  • Using the plot() command on our regression model will give us some useful diagnostic plots
  • The second plot that it outputs shows the normality
plot(model1, which=2)

  • We could also use a histogram to check the distribution
  • Notice how we can use the $ sign to get the residuals from the model
hist(model1$residuals)

8.8 Check assumptions: linearity

  • Using the plot() command on our regression model will give us some useful diagnostic plots
  • The first plot that it outputs shows the residuals vs the fitted values
  • Here, we want to see them spread out, with the line being horizontal and straight
plot(model1, which=1)

  • There is a slight amount of curvilinearity here but nothing to be worried about

8.9 Check assumptions: Homogeneity of Variance #1

  • We can use the sample plot to check Homogeneity of Variance
  • We want the variance to be constant across the data set. We do not want the variance to change at different points in the data
plot(model1, which=1)

  • A violation of Homogeneity of Variance would usually look like a funnel, with the data narrowing

8.10 Check assumptions: Influential cases

  • We need to check that there are no extreme outliers - they could throw off our predictions
  • We are looking for participants that have high rediduals + high leverage
    • Some guidance suggests anything higher than 1 is an influential case
    • Others suggest 4/n is the cut off point (4 divided by number of participants)
plot(model1, which=4)

  • We are looking for participants that have high rediduals + high leverage
    • No cases over 1
    • Many are over 0.04 (4/n = 0.04)
plot(model1, which=5)

8.11 Check the r squared value

  • r^2 = the amount of variance in the outcome that is explained by the predictor(s)
  • The closer this value is to 1, the more useful our regression model is for predicting the outcome
modelSummary <- summary(model1)
modelSummary
## 
## Call:
## lm(formula = aggression_level ~ treatment_duration, data = regression_data)
## 
## Residuals:
##     Min      1Q  Median      3Q 
## -3.4251 -1.1493 -0.0593  0.8814 
##     Max 
##  3.4542 
## 
## Coefficients:
##                    Estimate Std. Error
## (Intercept)         12.3300     0.7509
## treatment_duration  -0.6933     0.0726
##                    t value Pr(>|t|)
## (Intercept)          16.42  < 2e-16
## treatment_duration   -9.55 1.15e-15
##                       
## (Intercept)        ***
## treatment_duration ***
## ---
## Signif. codes:  
##   0 '***' 0.001 '**' 0.01 '*' 0.05
##   '.' 0.1 ' ' 1
## 
## Residual standard error: 1.551 on 98 degrees of freedom
## Multiple R-squared:  0.4821, Adjusted R-squared:  0.4768 
## F-statistic: 91.21 on 1 and 98 DF,  p-value: 1.146e-15
  • The r^2 of 0.482052 means that 48% of the variance in aggression level is explained by treatment duration

8.12 Check model significance

  • The model significance is displayed at the very end of the output
    • p-value: 1.146e-15
    • As p < 0.05, the model is significant
modelSummary
## 
## Call:
## lm(formula = aggression_level ~ treatment_duration, data = regression_data)
## 
## Residuals:
##     Min      1Q  Median      3Q 
## -3.4251 -1.1493 -0.0593  0.8814 
##     Max 
##  3.4542 
## 
## Coefficients:
##                    Estimate Std. Error
## (Intercept)         12.3300     0.7509
## treatment_duration  -0.6933     0.0726
##                    t value Pr(>|t|)
## (Intercept)          16.42  < 2e-16
## treatment_duration   -9.55 1.15e-15
##                       
## (Intercept)        ***
## treatment_duration ***
## ---
## Signif. codes:  
##   0 '***' 0.001 '**' 0.01 '*' 0.05
##   '.' 0.1 ' ' 1
## 
## Residual standard error: 1.551 on 98 degrees of freedom
## Multiple R-squared:  0.4821, Adjusted R-squared:  0.4768 
## F-statistic: 91.21 on 1 and 98 DF,  p-value: 1.146e-15

8.13 Check coefficient values

  • The coefficient values are displayed in the coefficients table
  • If we have more than one predictor, they are all listed here
modelSummary$coefficients
##                      Estimate
## (Intercept)        12.3300211
## treatment_duration -0.6933201
##                    Std. Error
## (Intercept)        0.75087601
## treatment_duration 0.07259671
##                      t value
## (Intercept)        16.420848
## treatment_duration -9.550297
##                        Pr(>|t|)
## (Intercept)        6.840516e-30
## treatment_duration 1.145898e-15
  • The beta coefficient for treatment duration is in the Estimate column
  • For every unit increase in treatment duration, aggression level decreases by 0.69

8.14 The regression equation

  • The regression equation is:

Outcome = predictor value * beta coefficient + constant

  • For this model, that is:

Aggression level = treatment duration * -0.69 + 12.33

modelSummary$coefficients
##                      Estimate
## (Intercept)        12.3300211
## treatment_duration -0.6933201
##                    Std. Error
## (Intercept)        0.75087601
## treatment_duration 0.07259671
##                      t value
## (Intercept)        16.420848
## treatment_duration -9.550297
##                        Pr(>|t|)
## (Intercept)        6.840516e-30
## treatment_duration 1.145898e-15

8.15 Accounting for error in predictions

  • We also know that the accuracy of predictions will be within a certain margin of error
  • This is known as standard error of the estimate or residual standard error
modelSummary
## 
## Call:
## lm(formula = aggression_level ~ treatment_duration, data = regression_data)
## 
## Residuals:
##     Min      1Q  Median      3Q 
## -3.4251 -1.1493 -0.0593  0.8814 
##     Max 
##  3.4542 
## 
## Coefficients:
##                    Estimate Std. Error
## (Intercept)         12.3300     0.7509
## treatment_duration  -0.6933     0.0726
##                    t value Pr(>|t|)
## (Intercept)          16.42  < 2e-16
## treatment_duration   -9.55 1.15e-15
##                       
## (Intercept)        ***
## treatment_duration ***
## ---
## Signif. codes:  
##   0 '***' 0.001 '**' 0.01 '*' 0.05
##   '.' 0.1 ' ' 1
## 
## Residual standard error: 1.551 on 98 degrees of freedom
## Multiple R-squared:  0.4821, Adjusted R-squared:  0.4768 
## F-statistic: 91.21 on 1 and 98 DF,  p-value: 1.146e-15