In this chapter, you will learn how to conduct some analysis on the related concepts of sampling, power, and effect size.
Effect size is a measure of the strength of the relationship between two variables in a statistical population. It is used to quantify the size of the difference between two groups or the strength of an association between two variables. In this section, we will learn how to calculate the effect size for different types of data.
The size of an effect is important when planning a study and trying to determine the sample size required. If an effect is small, we need a larger sample size to detect it. If an effect is large, we can detect it with a smaller sample size.
The ability of a study to detect an effect, using its sample, is called statistical power. Power is the probability that a study will correctly reject a false null hypothesis. In other words, power is the probability that a study will find a true effect when there is one (avoiding a Type 2 error).
Tip
The power of a study is influenced by the sample size, the effect size, and the significance level. A larger sample size, a larger effect size, and a higher significance level all increase the power of a study.
However, since our significance level is usually set at 0.05 and the effect size is determined by the data, the sample size is the only factor that we can control to increase the power of a study.
4.1 Different measures of effect size
There are different ways to calculate effect size depending on the type of data and the statistical test used. Here are some common effect size measures:
Cohen’s d: This is a measure of the difference between two means in standard deviation units. It is commonly used in t-tests and ANOVA tests.
Eta-squared (\(\eta^2\)): This is a measure of the proportion of variance in the dependent variable that is explained by the independent variable. It is commonly used in ANOVA tests.
Phi coefficient (\(\phi\)): This is a measure of the association between two binary variables. It is commonly used in chi-square tests.
Correlation coefficient (\(r\)): This is a measure of the strength and direction of the relationship between two continuous variables. It is commonly used in correlation tests.
In the following sections, we will calculate the effect size for different types of data using some of these measures.
4.2 Calculating Cohen’s d
Cohen’s d is a measure of the difference between two means in standard deviation units. It is calculated as the difference between the means divided by the pooled standard deviation. The formula for Cohen’s d is:
\[ d = \frac{{\bar{X}_1 - \bar{X}_2}}{{s_p}} \]
where:
\(\bar{X}_1\) and \(\bar{X}_2\) are the means of the two groups.
\(s_p\) is the pooled standard deviation, calculated as:
\(n_1\) and \(n_2\) are the sample sizes of the two groups.
\(s_1\) and \(s_2\) are the standard deviations of the two groups.
Let’s calculate Cohen’s d for a hypothetical dataset with two groups. The dataset contains the following information:
Group 1: Mean = 10, Standard deviation = 2, Sample size = 30
Group 2: Mean = 12, Standard deviation = 3, Sample size = 30
To calculate Cohen’s d, we first need to calculate the pooled standard deviation (\(s_p\)) using the formula above. Then, we can calculate Cohen’s d using the formula for Cohen’s d.
Let’s calculate Cohen’s d for this dataset using R:
# Calculate Cohen's d# Group 1mean1 <-10sd1 <-2n1 <-30# Group 2mean2 <-12sd2 <-3n2 <-30# Calculate pooled standard deviationsp <-sqrt(((n1 -1) * sd1^2+ (n2 -1) * sd2^2) / (n1 + n2 -2))# Calculate Cohen's dd <- (mean1 - mean2) / spd
[1] -0.7844645
The calculated value of Cohen’s d is -0.7844645. This negative value indicates that the mean of Group 1 is smaller than the mean of Group 2 by approximately 0.78 standard deviations.
4.3 Calculating Eta-squared and other effect size measures
It is possible to calculate other effect size measures such as Eta-squared, Phi coefficient. However, these measures are most commonly calculated using the output of statistical tests such as ANOVA and chi-square tests etc. To obtain these measures for the purpose of sample size calculation, you would usually look at previous studies or meta analyses to determine the expected effect size. For clinical research, you may also use the minimal clinically important difference (MCID) as a guide to determine the effect size.
4.4 Power analysis
Power analysis is a method used to determine the sample size required to detect an effect of a given size with a certain level of confidence. It is important to conduct a power analysis before conducting a study to ensure that the sample size is adequate to detect the effect of interest.
To conduct a power analysis, you need to specify the following parameters:
The effect size: The size of the effect you want to detect. This is usually determined based on previous studies, meta-analyses or MCID.
The significance level (\(\alpha\)): The probability of rejecting the null hypothesis when it is true (Type 1 error rate). This is commonly set at 0.05.
The power (\(1 - \beta\)): The probability of correctly rejecting the null hypothesis when it is false (1 - Type 2 error rate). This is commonly set at 0.80 or 0.90.
The number of groups or conditions: The number of groups or conditions in the study.
The sample size required to achieve a desired power level can be calculated using power analysis functions in R. There are many packages in r for power analysis. The package pwr is one such package that provides functions to calculate the sample size required for different types of statistical tests.
The functions for some basic research designs are:
pwr.t.test(): For t-tests
pwr.anova.test(): For ANOVA tests
pwr.chisq.test(): For chi-square tests
pwr.f2.test(): For regression models
Let’s calculate the sample size required to achieve a power of 0.80 for a t-test with the example data we used earlier. We will use the pwr.t.test() function from the pwr package to calculate the sample size required to achieve a power of 0.80 for a t-test with the following parameters:
Effect size (Cohen’s d) = -0.7844645
Significance level (\(\alpha\)) = 0.05
# Load the pwr packagelibrary(pwr)# Calculate the sample size required for a t-test# using the d value calculated earlierpwr.t.test(d = d, sig.level =0.05, power =0.80)
Two-sample t test power calculation
n = 26.50429
d = 0.7844645
sig.level = 0.05
power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
The output of the pwr.t.test() function provides the sample size required to achieve a power of 0.80 for a t-test with the specified effect size and significance level. The output includes the following information:
n = The sample size required for each group to achieve a power of 0.80.
d = The effect size (Cohen’s d) used in the power analysis.
sig.level = The significance level used in the power analysis.
power = The power level achieved with the specified sample size.
The sample size required to achieve a power of 0.80 for a t-test with the specified effect size and significance level is 26.5 for each group. Since the sample size must be a whole number, we would need to round up to the nearest whole number. Therefore, the sample size required for each group is 27.
4.5 Power analysis by simulation
For more complex designs, different approaches to power analysis might be necessary, such as using simulation. This is approach us based on simulating the type of data you expect to see in your study.
That is to say, if you are planning a study based on a difference between two means, you would simulate the data based on the expected means, and standard deviations. Or if you are planning a regression study, you would simulate data based on the expected regression coefficients. By doing this, you can create many simulated datasets that reflect the design of your planned study.
You would then analyze the simulated data using the statistical test you plan to use in your study and record whether the null hypothesis is rejected. By repeating this process many times (for example, creating 1000 simulated data sets), you can estimate the power of your study based on the proportion of simulations where the null hypothesis is rejected. This represents the power of your study to detect the effect of interest.
4.5.1 Step 1: Simulate the data
library(dplyr)
Attaching package: 'dplyr'
The following objects are masked from 'package:stats':
filter, lag
The following objects are masked from 'package:base':
intersect, setdiff, setequal, union
# simulate some data: # 2 groups, treatment and control# wellbeing outcome mesasure with different meansn <-20# number of p per groupm1 <-20# mean g1m2 <-30# mean g2# create outcome data for groupoutcome1 <-rnorm(n, m1, sd =1)outcome2 <-rnorm(n, m2, sd =1)# combine both of these groups into a single vectoroutcome <-c(outcome1,outcome2)# create grouping variablegroup1 <-rep("control", n)group2 <-rep("treatment", n)group <-c(group1, group2)# combine into data framedata <-data.frame(group,outcome)
The above code simulates data for a study with two groups (treatment and control) and a continuous outcome variable (wellbeing). The treatment group has a higher mean wellbeing score than the control group. The difference between the means is the important element here, as it represents the effect size we want to detect. When planning a study, you would set these values based on previous research or clinical relevance.
TipWhy do we keep the standard deviations the same for each group in the simulation?
In the simulation, we keep the standard deviations the same for each group to simplify the analysis and focus on the effect of the difference in means. By keeping the standard deviations constant, we can isolate the impact of the mean difference on the statistical power of the test. However, there might be situations where the standard deviations differ between groups, and in such cases, the simulation can be adjusted accordingly to reflect the expected variability in the data.
TipDo the exact mean values matter?
Yes and no, the exact mean values themselves do not matter as much as the difference between them (the effect size). In power analysis, we are primarily concerned with the magnitude of the effect we want to detect, rather than the specific values of the means. Therefore, when simulating data for power analysis, it is more important to focus on the difference between the means rather than their absolute values.
TipCan we simulate a specific effect size directly?
Yes, you can simulate a specific effect size directly by calculating the means based on the desired effect size and standard deviation. For example, if you want to simulate a Cohen’s d of 0.2 with a standard deviation of 1, you can set the means of the two groups accordingly. In such a case, the mean of group 1 could be set to 0, and the mean of group 2 could be set to 0.2 (since Cohen’s d = (mean2 - mean1) / sd). This way, you can directly control the effect size in your simulation.
4.5.2 Step 2: Analyze the data
# run a t-testtest <-t.test(data$outcome ~ data$group)# get the values from t-testpval <- test$p.valueis_sig <- test$p.value <0.05data.frame(pval, is_sig)
pval is_sig
1 1.774442e-26 TRUE
The above code runs a t-test on the simulated data and extracts the p-value and whether the result is statistically significant (p < 0.05). This is the analysis you would perform on your actual study data. By creating a column indicating whether the result is significant, we can later use this information to estimate the power of the study.
4.5.3 Step 3: Repeat the simulation
In order to repeat the simulation many times, we can wrap the above code into a function and then use a loop or an apply function to run the simulation multiple times. The complete function code would look like this:
sim_data <-function(n,m1,m2) {# create outcome data for groupoutcome1 <-rnorm(n, m1, sd =1)outcome2 <-rnorm(n, m2, sd =1)# combine both of these groups into a single vectoroutcome <-c(outcome1,outcome2)# create grouping variablegroup1 <-rep("control", n)group2 <-rep("treatment", n)group <-c(group1, group2)# combine into data framedata <-data.frame(group,outcome)# run a t-testtest <-t.test(data$outcome ~ data$group)# get the values from t-testpval <- test$p.valueis_sig <- test$p.value <0.05data.frame(pval, is_sig)}
The code above is the same as before, but wrapped in a function called sim_data(). The function takes three arguments: n (the sample size per group), m1 (the mean for group 1), and m2 (the mean for group 2). The function returns a data frame with the p-value and whether the result is significant. By using this function, we can easily repeat the simulation multiple times.
TipWhy use a function for the simulation?
Using a function for the simulation allows us to encapsulate the simulation logic in a reusable way. This makes it easier to run the simulation multiple times with different parameters (e.g., sample size, means) without having to rewrite the code each time. It also improves code organization and readability, making it clear what the simulation does and what inputs it requires.
TipWhat are the inputs to the function?
These are the values what we might want to vary when planning a study. The sample size per group (n) is important because it directly affects the power of the study. The means for each group (m1 and m2) determine the effect size we want to detect. By varying these inputs, we can modify the sample size and effect size in our simulations to see how they impact the power of the study.
4.5.4 Step 4: Estimate power
results_data <-list()# run the simulation for(i in1:1000) { simdata <-sim_data(n,m1,m2) results_data[[i]] <- simdata}#bind the rowsresults_data <-bind_rows(results_data)# calculate power# what proportion were significant?mean(results_data$is_sig)
[1] 1
There are a few things happening in the above code. First, we create an empty list called results_data to store the results of each simulation.
Then, we use a for loop to run the sim_data() function 1000 times, storing the results in the results_data list. After the loop, we combine all the results into a single data frame using bind_rows().
Finally, we calculate the power by taking the mean of the is_sig column, which gives us the proportion of simulations where the result was statistically significant.
Normally, we would want the power to be at least 0.80, meaning that we would correctly reject the null hypothesis in at least 80% of the simulations when there is a true effect. If the calculated power is lower than this threshold, we may need to increase the sample size to achieve the desired power level. :::
TipWhy do we use a loop to run the simulation multiple times?
This is a core part of the simulation approach to power analysis. By running the simulation multiple times, we can estimate the variability in the results and get a more accurate estimate of the power of the study. Each simulation represents a possible outcome of the study, and by aggregating the results, we can determine how often we would correctly reject the null hypothesis when there is a true effect.
TipWhat does the final power estimate represent?
The final power estimate represents the probability of correctly rejecting the null hypothesis when there is a true effect in the population. It is calculated as the proportion of simulations where the result was statistically significant (p < 0.05). A higher power estimate indicates a greater likelihood of detecting a true effect, while a lower power estimate suggests that the study may not have enough sensitivity to detect the effect of interest. Our usual goal is to achieve a power of at least 0.80. This would mean that when we run our real study, we have an 80% chance of detecting an effect of the same size, if it truly exists.
TipCan I increase the required power level?
Yes, you can increase the required power level to 0.90 or even higher if you want to be more confident in detecting a true effect. To do this, you would just need to increase the n value (sample size per group) in the simulation function until the power estimate reaches the desired level.
However, keep in mind that increasing the power level will also increase the required sample size, which may have practical implications for the feasibility and cost of the study. It is important to balance the desired power level with the available resources and constraints of the study. In most cases, a power level of 0.80 is considered sufficient.