DClin Research Methods 1
Teesside University
One of the common issues with clinical research is that studies are underpowered.
Today we are going to look at Power and Effect Size, and how these relate to sample size.
Statistical significance, sample power and effect size are 3 inter-related concepts that are important to understand when designing a study.
Does anyone remember what statistical significance is?
The null hypothesis is the hypothesis that there is no difference between groups, or no relationship between variables.
For example, if we are comparing the mean score on a measure between two groups, the null hypothesis is that, in the population, the mean score on the measure is the same for both groups.
When we test hypotheses, we are trying to see if our data are compatible with the null hypothesis or not.
In short: Statistical significance is a test of how compatible the data are with the null hypothesis. It does not tell us what our data proves, only that it is unlikely that the null hypothesis is true.
Imagine we could see them all, and knew that there was not much difference between those who received treatment and those who did not.
What would the results say?
In psychological research, we usually use a p-value of 0.05 as the threshold for statistical significance.
This means that, even if the null hypothesis were true, if we were to repeat the study 100 times, we could still find a “extreme” difference, 5 times (5% of the time).
This is 5% threshold is an arbitrary choice, and is not used for any theoretical reason.
For example: for a difference between 2 means, we could still obtain a difference of up to 1.96 standard deviations, 95% of the time, even if the null hypothesis were true.
There’s still a 5% chance (2.5% + 2.5%) of obtaining a difference higher than 1.96 standard deviations, even if the null hypothesis were true.
Small samples increase the likelihood of false positives (significant results that don’t reflect the population)
This is because the true pattern in the population is not perfectly reflected in small samples.
Within small samples, there will be random variation that can lead to differences between groups.
Even when the null hypothesis is true, we can still obtain a range of differences when we sample from the population.
Some of these differences will even result in p-values below 0.05.
It’s important to recognise that false positives (Type 1 error) can happen. Smaller samples are subject to more random variation and lack of control in study design can lead to spurious conclusions.
“It’s science’s dirtiest secret: The ‘scientific method’ of testing hypotheses by statistical analysis stands on a flimsy foundation.”
In February 2014, George Cobb, Professor Emeritus of Mathematics and Statistics at Mount Holyoke College, posed these questions to an ASA discussion forum:
Q: Why do so many colleges and grad schools teach p = 0.05?
A: Because that’s still what the scientific community and journal editors use.
Q: Why do so many people still use p = 0.05?
A: Because that’s what they were taught in college or grad school.
“Researchers commonly use p-values to answer the question: How strongly does the evidence favor the alternative hypothesis relative to the null hypothesis? p-Values themselves do not directly answer this question and are often misinterpreted” (Benjamin & Berger, 2019)
p-values can put people into a binary mindset about results (yes/no) when the data collected is likely more complex
p-values are not stable from one study to the next, so be careful about interpreting them.
p-values are not a good measure of evidence for or against a hypothesis, only how likely the data are to have occurred if the null hypothesis were true.
There is no such thing as a “near-significant” result. A p-value of 0.048 is not more “important” than a p-value of 0.05.
p-values and confidence intervals for 25 samples
In a later class, Dr Alan Bowman will discuss clinically meaningful results
To report confidence intervals in APA style, we need to report the lower and upper bounds of the interval, and the confidence level.
For example:
The results of the t-test showed a significant difference between the two groups, t(18) = 2.43, p = .03, 95% CI [0.02, 0.45].
This tells us that the true mean difference between the two groups is likely to be between 0.02 and 0.45. We can decide (depending on our research question) how meaningful this is to us.
p-values are not a good measure of the size of an effect or the importance of a result.
Confidence intervals are a better way to interpret results than p-values.
Confidence intervals can tell us the likely range of outcomes, and a better sense of how precise our estimate of the true value is.
There’s not much difference between the two groups. Does that mean there is no difference?
Effect size is a standardized measure of the magnitude of an effect
Often people talk in terms of small, medium and large effect sizes. This is not very helpful, as it is not clear what these terms mean.
It is better to understand the literature in your field, and to know what effect sizes are typically found in studies similar to yours.
However, we need to be careful (Albers & Lakens, 2018):
effect size estimates from small studies are inaccurate
publication bias inflates effect sizes.
“Follow-up” bias, is an issue with people deciding whether or not to conduct studies based on the effects found in pilot data
This can lead to seriously underpowered study designs, especially when the sample size of the pilot and/or the true effect size is small to medium.
Not only is this approach inaccurate, it is inefficient.
Instead:
What does this mean?
We first need to decide what effect size we are interested in detecting. This can be called the smallest effect size of interest (SESOI).
We then need to decide how much power we want to have to detect this effect size.
A common choice is 80% power, but this is arbitrary. This means:
With the given effect size and the calculated sample size, we will be able to detect the effect (if it exists) 80% of the time.
Power is ability to detect an effect if it exists
Before conducting a study, we need to know what effect size we are interested in detecting
We need to ensure that our study has sufficient power to detect this effect size
Very large samples will show very small effects as statistically significant (but these may not be important)
