In this chapter, we will learn how to perform basic statistical tests in R. These are all tests that you should be familiar with already from your statistics courses. We will cover the following tests:
Independent t-test
Paired t-test
Wilcoxon signed-rank test
Mann-Whitney U test
Chi-squared test
Correlation
ANOVA
These examples will not be exhaustive, but they should give you a good starting point for performing these tests in R. For theoretical background, you can refer to any standard statistics textbook.
At the end of this chapter, you will be able to:
Conduct several basic inferential tests with R
These examples will all use built-in datasets in R. Each example will include the necessary code to load the dataset and perform the test. However, you can also use your own datasets by loading them into R yourself and replacing the dataset name in the examples.
5.1 Independent t-test
The independent t-test is used to compare the means of two independent groups. In R, you can use the t.test() function to perform an independent t-test. Here is an example:
# This example uses the mtcars dataset, which is a built-in dataset in R that contains data on various car models.# Load the mtcars datasetdata(mtcars)# Independent t-test example: Is there a difference in fuel efficiency between automatic and manual cars?# the variable am is a binary variable indicating the type of transmission (0 = automatic, 1 = manual)# Perform the independent t-testt_test_result <-t.test(mpg ~ am, data = mtcars)# Print the resultt_test_result
Welch Two Sample t-test
data: mpg by am
t = -3.7671, df = 18.332, p-value = 0.001374
alternative hypothesis: true difference in means between group 0 and group 1 is not equal to 0
95 percent confidence interval:
-11.280194 -3.209684
sample estimates:
mean in group 0 mean in group 1
17.14737 24.39231
In this example, we are comparing the fuel efficiency (mpg) of automatic and manual cars in the mtcars dataset. The mpg variable is the dependent variable, and the am variable is the independent variable. The t.test() function is used to perform the independent t-test, and the result is stored in the t_test_result variable.
If we look at the output, we can see the following:
The t-test result is presented.
The alternative hypothesis is that the means are not equal.
The 95% confidence interval for the difference in means is presented.
The 2 sample means are presented.
5.2 Paired t-test
The paired t-test is used to compare the means of two related groups. In R, you can use the t.test() function with the paired = TRUE argument to perform a paired t-test. Here is an example:
# This example uses the sleep dataset, which is a built-in dataset in R that contains data on the effect of two soporific drugs on sleep duration.# Load the sleep datasetdata(sleep)# Paired t-test example: Is there a difference in sleep duration between the two drugs?# Perform the paired t-testpaired_t_test_result <-t.test(sleep$extra ~ sleep$group, paired =TRUE)# Print the resultpaired_t_test_result
Paired t-test
data: sleep$extra by sleep$group
t = -4.0621, df = 9, p-value = 0.002833
alternative hypothesis: true mean difference is not equal to 0
95 percent confidence interval:
-2.4598858 -0.7001142
sample estimates:
mean difference
-1.58
In this example, we are comparing the sleep duration (extra) between patients who were given different drugs in the sleep dataset (note: 10 people were each measured twice). The extra variable is the dependent variable, and the group variable is the independent variable. The t.test() function is used to perform the paired t-test, and the result is stored in the paired_t_test_result variable.
If we look at the output, we can see the following:
The t-test result is presented.
The alternative hypothesis is that the means are not equal.
The 95% confidence interval for the difference in means is presented.
The mean difference is presented.
5.3 Wilcoxon signed-rank test
The Wilcoxon signed-rank test is a non-parametric test used to compare two related groups. In R, you can use the wilcox.test() function to perform a Wilcoxon signed-rank test. Here is an example:
# This example uses the sleep dataset, which is a built-in dataset in R that contains data on the effect of two soporific drugs on sleep duration.# Load the sleep datasetdata(sleep)# Wilcoxon signed-rank test example: Is there a difference in sleep duration between the two drugs?# Perform the Wilcoxon signed-rank testwilcoxon_test_result <-wilcox.test(sleep$extra ~ sleep$group, paired =TRUE)
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
compute exact p-value with ties
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
compute exact p-value with zeroes
# Print the resultwilcoxon_test_result
Wilcoxon signed rank test with continuity correction
data: sleep$extra by sleep$group
V = 0, p-value = 0.009091
alternative hypothesis: true location shift is not equal to 0
In this example, we are comparing the sleep duration (extra) between patients who were given different drugs in the sleep dataset. The extra variable is the dependent variable, and the group variable is the independent variable. The wilcox.test() function is used to perform the Wilcoxon signed-rank test, and the result is stored in the wilcoxon_test_result variable.
If we look at the output, we can see the following:
The Wilcoxon signed-rank test result is presented ( V is the sum of the ranks of the differences between the pairs of observations).
The alternative hypothesis (true location shift is not equal to 0) is presented. This means that the medians of the two groups are not equal.
Note: if there are ties in the data, the exact p-value is not calculated. Instead, an approximate p-value is presented.
5.4 Mann-Whitney U test
The Mann-Whitney U test is a non-parametric test used to compare the means of two independent groups. In R, you can use the wilcox.test() function to perform a Mann-Whitney U test. Note: we will not include the paired = TRUE argument that we did in the previous example. Here is an example:
# This example uses the mtcars dataset, which is a built-in dataset in R that contains data on various car models.# Load the mtcars datasetdata(mtcars)# Mann-Whitney U test example: Is there a difference in fuel efficiency between automatic and manual cars?# Perform the Mann-Whitney U testmann_whitney_test_result <-wilcox.test(mpg ~ am, data = mtcars)
Warning in wilcox.test.default(x = DATA[[1L]], y = DATA[[2L]], ...): cannot
compute exact p-value with ties
# Print the resultmann_whitney_test_result
Wilcoxon rank sum test with continuity correction
data: mpg by am
W = 42, p-value = 0.001871
alternative hypothesis: true location shift is not equal to 0
The output is interpreted in the same way as the Wilcoxon signed-rank test output.
5.5 Chi-squared test
The chi-squared test is used to test the association between two categorical variables. In R, you can use the chisq.test() function to perform a chi-squared test. Here is an example:
# This example uses the mtcars dataset, which is a built-in dataset in R that contains data on various car models.# Load the mtcars datasetdata(mtcars)# Chi-squared test example: Is there an association between the number of cylinders and the type of transmission?# Create a contingency tablecontingency_table <-table(mtcars$cyl, mtcars$am)# Perform the chi-squared testchi_squared_test_result <-chisq.test(contingency_table)
Warning in chisq.test(contingency_table): Chi-squared approximation may be
incorrect
In this example, we are testing the association between the number of cylinders (cyl) and the type of transmission (am) in the mtcars dataset. The chisq.test() function is used to perform the chi-squared test, and the result is stored in the chi_squared_test_result variable.
If we look at the output, we can see the following:
The chi-squared test result is presented.
5.6 Correlation
Correlation is used to test the relationship between two continuous variables. In R, you can use the cor.test() function to calculate the correlation coefficient with significance test. Here is an example:
# This example uses the mtcars dataset, which is a built-in dataset in R that contains data on various car models.# Load the mtcars datasetdata(mtcars)# Correlation example: Is there a relationship between fuel efficiency and horsepower?# Calculate the correlation coefficientcorrelation_result <-cor.test(mtcars$mpg, mtcars$hp)# Print the resultcorrelation_result
Pearson's product-moment correlation
data: mtcars$mpg and mtcars$hp
t = -6.7424, df = 30, p-value = 1.788e-07
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
-0.8852686 -0.5860994
sample estimates:
cor
-0.7761684
In this example, we are testing the relationship between fuel efficiency (mpg) and horsepower (hp) in the mtcars dataset. The cor.test() function is used to calculate the correlation coefficient, and the result is stored in the correlation_result variable.
If we look at the output, we can see the following:
The correlation coefficient is presented in the last line of the output.
The significance level of the correlation is tested using a t-test, which is also presented in the output.
The confidence interval for the correlation coefficient is presented.
The alternative hypothesis is that the correlation is not equal to 0.
5.7 One-way ANOVA
One-way ANOVA (Analysis of Variance) is used to test the differences between the means of three or more groups. In R, you can use the aov() function to perform an ANOVA. Here is an example:
# This example uses the mtcars dataset, which is a built-in dataset in R that contains data on various car models.# Load the mtcars datasetdata(mtcars)# One-way ANOVA example: Is there a difference in fuel efficiency between cars with different numbers of cylinders?# cyl should be a factormtcars$cyl <-as.factor(mtcars$cyl)# Perform the ANOVAanova_result <-aov(mpg ~ cyl, data = mtcars)# Print the resultsummary(anova_result)
Df Sum Sq Mean Sq F value Pr(>F)
cyl 2 824.8 412.4 39.7 4.98e-09 ***
Residuals 29 301.3 10.4
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this example, we are testing the differences in fuel efficiency (mpg) between cars with different numbers of cylinders (cyl) in the mtcars dataset. The aov() function is used to perform the ANOVA, and the result is stored in the anova_result variable.
If we look at the output, we can see the following:
The ANOVA result is within the summary() function. The summary includes the F-statistic, the p-value, and the significance level of the ANOVA test.
5.8 Factorial ANOVA
Factorial ANOVA is used to test the effects of two or more independent variables on a dependent variable. In R, you can use the aov() function with interaction terms to perform a factorial ANOVA. Here is an example:
# This example uses the mtcars dataset, which is a built-in dataset in R that contains data on various car models.# Load the mtcars datasetdata(mtcars)# Factorial ANOVA example: Is there an interaction effect between the number of cylinders and the type of transmission on fuel efficiency?# cyl and am should be factorsmtcars$cyl <-as.factor(mtcars$cyl)mtcars$am <-as.factor(mtcars$am)# Perform the factorial ANOVAfactorial_anova_result <-aov(mpg ~ cyl * am, data = mtcars)# Print the resultsummary(factorial_anova_result)
Df Sum Sq Mean Sq F value Pr(>F)
cyl 2 824.8 412.4 44.852 3.73e-09 ***
am 1 36.8 36.8 3.999 0.0561 .
cyl:am 2 25.4 12.7 1.383 0.2686
Residuals 26 239.1 9.2
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
In this example, we are testing the interaction effect between the number of cylinders (cyl) and the type of transmission (am) on fuel efficiency (mpg) in the mtcars dataset. The aov() function is used to perform the factorial ANOVA, and the result is stored in the factorial_anova_result variable.
The interaction term is specified using the * operator in the formula.
If we look at the output, we can see the following:
The ANOVA result is within the summary() function. The summary includes the F-statistic, the p-value, and the significance level of each term tested in the ANOVA.
Each independent variable is tested separately (main effects), and the interaction effect is also tested.
The interaction effect is denoted by the cyl:am term in the output.
5.9 Conclusion
In this chapter, we have covered several basic statistical tests that you can perform in R. These are included for reference and to help you get started with performing statistical tests in R. However, we will be focusing on modelling our data, using different types of regression models, rather than re-learning these tests.