5  Correlation

5.1 What is Correlation?

  • The relationship between 2 variables
  • Question: Is treatment duration related to aggression levels?

5.2 How is correlation calculated?

  • Think of this as covariance divided by individual variance
  • If the changes are consistent with both variables, the final value will be higher

5.3 Running correlation in R

  • Step 1: Check assumptions
    • Data,distribution,linearity
  • Step 2: Run correlation
  • Step 3: Check R value
  • Step 4: Check significance

5.3.1 Check assumptions: data

  • Parametric tests require interval or ratio data
  • If the data are ordinal then a non-parametric correlation is used

What type of data are treatment duration and aggression level?

5.3.2 Check assumptions: distribution

  • Parametric tests require normally distributed data

5.3.3 Check assumptions: distribution #2

  • Parametric tests require normally distributed data
shapiro.test(regression_data$treatment_duration)

    Shapiro-Wilk normality test

data:  regression_data$treatment_duration
W = 0.94971, p-value = 0.0007939
shapiro.test(regression_data$aggression_level)

    Shapiro-Wilk normality test

data:  regression_data$aggression_level
W = 0.9928, p-value = 0.8756
  • The normality assumption is less of an issue when sample size is > 30

5.3.4 Checking assumptions: linearity

regression_data %>% ggplot(aes(x=treatment_duration,y=aggression_level)) +
  geom_point()

  • Here we are looking to see if the relationship is linear

5.3.5 Run correlation

  • R can run correlations using the cor.test() command
cor.test(regression_data$treatment_duration,regression_data$aggression_level)

    Pearson's product-moment correlation

data:  regression_data$treatment_duration and regression_data$aggression_level
t = -9.5503, df = 98, p-value = 1.146e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7838251 -0.5765006
sample estimates:
       cor 
-0.6942996 

5.3.6 Check r Value (correlation value)

  • The r value tells us the strength and direction of the relationship
  • In the output it is labelled as “cor” (short for correlation)
cor.test(regression_data$treatment_duration,regression_data$aggression_level)

    Pearson's product-moment correlation

data:  regression_data$treatment_duration and regression_data$aggression_level
t = -9.5503, df = 98, p-value = 1.146e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7838251 -0.5765006
sample estimates:
       cor 
-0.6942996 

5.3.7 Check the significance of the correlation

  • We can see that the significance by looking at the p value
    • The significance is 1.146^-15
    • This means: 0.0000000000000001146
  • Therefore p value < 0.05
cor.test(regression_data$treatment_duration,regression_data$aggression_level)

    Pearson's product-moment correlation

data:  regression_data$treatment_duration and regression_data$aggression_level
t = -9.5503, df = 98, p-value = 1.146e-15
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.7838251 -0.5765006
sample estimates:
       cor 
-0.6942996 

6 Simple Regression

6.1 What is regression?

  • Testing to see if we can make predictions based on data that are correlated

We found a strong correlation between treatment duration and agression levels. Can we use this data to predict aggression levels of other clients, based on their treatment duration?

  • When we carry out regression, we get a information about:
    • How much variance in the outcome is explained by the predictor
    • How confident we can be about these results generalising (i.e. significance)
    • How much error we can expect from anu predictions that we make (i.e. standard error of the estimate)
    • The figures we need to calculate a predicted outcome value (i.e. coefficient values)

6.2 How is regression calculated?

  • When we run a regression analysis, a calculation is done to select the “line of best fit”
  • This is a “prediction line” that minimises the overall amount of error
    • Error = difference between the data points and the line

6.3 The regression equation

  • Once the line of best fit is calculated, predictions are based on this line

  • To make predictions we need the intercept and slope of the line

    • Intercept or constant= where the line crosses the y axis
    • Slope or beta = the angle of the line
  • Predictions are made using the calculation for a line: Y = bX + c

  • You can think of the equation like this:

predicted outcome value = beta coefficient * value of predictor + constant

6.4 Running regression in R

  • Step 1: Run regression
  • Step 2: Check assumptions
    • Data
    • Distribution
    • Linearity
    • Homogeneity of variance
    • Uncorrelated predictors
    • Indpendence of residuals
    • No influental cases / outliers
  • Step 3: Check R^2 value
  • Step 4: Check model significance
  • Step 5: Check coefficient values

6.5 Run regression

  • We use the lm() command to run regression while saving the results
  • We then use the summary() function to check the results
model1 <- lm(formula= aggression_level ~ treatment_duration ,data=regression_data)
summary(model1)

Call:
lm(formula = aggression_level ~ treatment_duration, data = regression_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4251 -1.1493 -0.0593  0.8814  3.4542 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         12.3300     0.7509   16.42  < 2e-16 ***
treatment_duration  -0.6933     0.0726   -9.55 1.15e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.551 on 98 degrees of freedom
Multiple R-squared:  0.4821,    Adjusted R-squared:  0.4768 
F-statistic: 91.21 on 1 and 98 DF,  p-value: 1.146e-15

6.6 What are residuals?

  • In regression, the assumptions apply to the residuals, not the data themselves
  • Residual just means the difference between the data point and the regression line

6.7 Check assumptions: distribution

  • Using the plot() command on our regression model will give us some useful diagnostic plots
  • The second plot that it outputs shows the normality
plot(model1, which=2)

  • We could also use a histogram to check the distribution
  • Notice how we can use the $ sign to get the residuals from the model
hist(model1$residuals)

6.8 Check assumptions: linearity

  • Using the plot() command on our regression model will give us some useful diagnostic plots
  • The first plot that it outputs shows the residuals vs the fitted values
  • Here, we want to see them spread out, with the line being horizontal and straight
plot(model1, which=1)

  • There is a slight amount of curvilinearity here but nothing to be worried about

6.9 Check assumptions: Homogeneity of Variance #1

  • We can use the sample plot to check Homogeneity of Variance
  • We want the variance to be constant across the data set. We do not want the variance to change at different points in the data
plot(model1, which=1)

  • A violation of Homogeneity of Variance would usually look like a funnel, with the data narrowing

6.10 Check assumptions: Influential cases

  • We need to check that there are no extreme outliers - they could throw off our predictions
  • We are looking for participants that have high rediduals + high leverage
    • Some guidance suggests anything higher than 1 is an influential case
    • Others suggest 4/n is the cut off point (4 divided by number of participants)
plot(model1, which=4)

  • We are looking for participants that have high rediduals + high leverage
    • No cases over 1
    • Many are over 0.04 (4/n = 0.04)
plot(model1, which=5)

6.11 Check the r squared value

  • r^2 = the amount of variance in the outcome that is explained by the predictor(s)
  • The closer this value is to 1, the more useful our regression model is for predicting the outcome
modelSummary <- summary(model1)
modelSummary

Call:
lm(formula = aggression_level ~ treatment_duration, data = regression_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4251 -1.1493 -0.0593  0.8814  3.4542 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         12.3300     0.7509   16.42  < 2e-16 ***
treatment_duration  -0.6933     0.0726   -9.55 1.15e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.551 on 98 degrees of freedom
Multiple R-squared:  0.4821,    Adjusted R-squared:  0.4768 
F-statistic: 91.21 on 1 and 98 DF,  p-value: 1.146e-15
  • The r^2 of 0.482052 means that 48% of the variance in aggression level is explained by treatment duration

6.12 Check model significance

  • The model significance is displayed at the very end of the output
    • p-value: 1.146e-15
    • As p < 0.05, the model is significant
modelSummary

Call:
lm(formula = aggression_level ~ treatment_duration, data = regression_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4251 -1.1493 -0.0593  0.8814  3.4542 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         12.3300     0.7509   16.42  < 2e-16 ***
treatment_duration  -0.6933     0.0726   -9.55 1.15e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.551 on 98 degrees of freedom
Multiple R-squared:  0.4821,    Adjusted R-squared:  0.4768 
F-statistic: 91.21 on 1 and 98 DF,  p-value: 1.146e-15

6.13 Check coefficient values

  • The coefficient values are displayed in the coefficients table
  • If we have more than one predictor, they are all listed here
modelSummary$coefficients
                     Estimate Std. Error   t value     Pr(>|t|)
(Intercept)        12.3300211 0.75087601 16.420848 6.840516e-30
treatment_duration -0.6933201 0.07259671 -9.550297 1.145898e-15
  • The beta coefficient for treatment duration is in the Estimate column
  • For every unit increase in treatment duration, aggression level decreases by 0.69

6.14 The regression equation

  • The regression equation is:

Outcome = predictor value * beta coefficient + constant

  • For this model, that is:

Aggression level = treatment duration * -0.69 + 12.33

modelSummary$coefficients
                     Estimate Std. Error   t value     Pr(>|t|)
(Intercept)        12.3300211 0.75087601 16.420848 6.840516e-30
treatment_duration -0.6933201 0.07259671 -9.550297 1.145898e-15

6.15 Accounting for error in predictions

  • We also know that the accuracy of predictions will be within a certain margin of error
  • This is known as standard error of the estimate or residual standard error
modelSummary

Call:
lm(formula = aggression_level ~ treatment_duration, data = regression_data)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.4251 -1.1493 -0.0593  0.8814  3.4542 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)         12.3300     0.7509   16.42  < 2e-16 ***
treatment_duration  -0.6933     0.0726   -9.55 1.15e-15 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.551 on 98 degrees of freedom
Multiple R-squared:  0.4821,    Adjusted R-squared:  0.4768 
F-statistic: 91.21 on 1 and 98 DF,  p-value: 1.146e-15