Understanding Data Dredging (p-hacking) in Multiple Comparison Tests

When performing multiple statistical tests on the same dataset, each test has a chance of producing a significant result. The more tests you perform, the greater the probability that at least one of these tests will yield a statistically significant result, even if there is no true effect. This phenomenon, often referred to as data dredging or p-hacking, is a common statistical malpractice associated with multiple comparison tests. Other types of data dredging include splitting data into subgroups until a significant result is found or selectively reporting variables that show significance.

For example, suppose you have a dataset of 30 individuals and want to find out if any specific factor(s) or variable(s) (e.g., age, diet, exercise, or stress levels) out of 20 random variables correlates with blood glucose levels. The more variables you test, the greater the likelihood that one or more will show a significant correlation by random chance, regardless of any true relationship to blood glucose.

This concept can be illustrated with the following code:

library(ggplot2)

library(dplyr)


# Set seed for reproducibility

set.seed(42)


# Generate synthetic data: 30 individuals with 20 unrelated variables

n_individuals <- 30

n_variables <- 20


# Simulate blood glucose level

blood_glucose_level <- rnorm(n_individuals, mean = 100, sd = 10)


# Simulate 20 unrelated variables

data <- data.frame(blood_glucose_level)

for (i in 1:n_variables) {

  data[[paste0("Variable_", i)]] <- rnorm(n_individuals)

}


# Perform multiple comparisons: Test each variable's correlation with blood glucose level

results <- data.frame(Variable = character(0), p_value = numeric(0))

for (i in 1:n_variables) {

  test <- cor.test(data$blood_glucose_level, data[[paste0("Variable_", i)]])

  results <- rbind(results, data.frame(Variable = paste0("Variable_", i), p_value = test$p.value))

}


# Add a significance threshold line

alpha <- 0.05


# Plot p-values for each test

ggplot(results, aes(x = Variable, y = p_value)) +

  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +

  geom_hline(yintercept = alpha, color = "red", linetype = "dashed") +

  ggtitle("Multiple Comparisons: P-Values of 20 Variables") +

  ylab("P-Value") +

  xlab("Variables") +

  theme_minimal() +

  coord_flip()


# Identify variables that would be considered significant (p < 0.05)

significant_vars <- results %>% filter(p_value < alpha)

significant_vars

So, we get one variable (Variable_20) that is significantly correlated (p = 0.02, below the red dotted line) with blood glucose. However, this "significant" result may be a false positive due to the large number of tests conducted. To avoid such pitfalls, it's important to apply appropriate corrections for multiple comparisons. One of the most common methods for correcting multiple comparisons is the Bonferroni correction.

Bonferroni correction 

The Bonferroni correction adjusts the significance level by dividing the original significance threshold (𝛼) by the total number of tests (𝑚). This new threshold (𝛼'=𝛼/𝑚) is used to determine if a result is statistically significant. 

For example, if you have performed 20 tests (𝑚) in the problem above and set your original significance level (𝛼) at 0.05, the Bonferroni-adjusted threshold (𝛼') would be: 0.05/20 = 0.0025. Thus, results with p-values less than 0.0025 are considered statistically significant after applying the Bonferroni correction.  

# Bonferroni correction

alpha <- 0.05

bonferroni_alpha <- alpha / n_variables


# Plot p-values for each test with Bonferroni correction

ggplot(results, aes(x = Variable, y = p_value)) +

  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +

  geom_hline(yintercept = alpha, color = "red", linetype = "dashed", size = 1) +

  geom_hline(yintercept = bonferroni_alpha, color = "green", linetype = "dotted", size = 1) +

  ggtitle("P-Values with Bonferroni Correction") +

  ylab("P-Value") +

  xlab("Variables") +

  theme_minimal() +

  coord_flip()

The Bonferroni-adjusted threshold is shown in the green dotted line (very close to the Y-axis). It reveals that no variable is significantly correlated with blood glucose.

One drawback of the Bonferroni correction is that it is very conservative. While it effectively lowers the threshold, reducing the likelihood of false positives (Type I errors), it also increases the risk of missing true effects, or false negatives (Type II errors).

To overcome this stringency, less conservative approaches can be used to correct for multiple comparisons. These approaches aim to balance the risk of false positives with the power to detect true effects. Two such approaches are discussed below:

Holm-Bonferroni correction

The Holm-Bonferroni method sorts p-values in ascending order and adjusts the significance threshold for each p-value. For the smallest p-value, the threshold is 𝛼/𝑚; for the next smallest it is 𝛼/(𝑚−1), and so on. This sequential adjustment is less conservative than the Bonferroni correction.

For example, as you have performed 20 tests (𝑚) in the above problem and set your original significance level (𝛼) at 0.05, the Holm-Bonferroni correction will be applied as: i) the smallest p-value is compared to (0.05/20), ii) the second smallest p-value is compared to (0.05/19), and so on

# Apply Holm-Bonferroni correction

results$p_adj_holm <- p.adjust(results$p_value, method = "holm")


# Plot original and Holm-Bonferroni adjusted p-values

ggplot(results, aes(x = Variable, y = p_value)) +

  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +

  geom_hline(yintercept = alpha, color = "red", linetype = "dashed") +

  geom_point(aes(y = p_adj_holm, color = "Holm-Bonferroni"), size = 3) +

  scale_y_continuous(name = "P-Value", sec.axis = sec_axis(~ ., name = "Adjusted P-Value (Holm)")) +

  ggtitle("P-Values with Holm-Bonferroni Correction") +

  xlab("Variables") +

  theme_minimal() +

  coord_flip()

The Holm-Bonferroni correction (indicated by red circles) reveals that no variable is significantly correlated with blood glucose.

Benjamini-Hochberg (BH) correction 

The BH procedure controls the False Discovery Rate (FDR). It ranks the p-values from smallest to largest and compares each p-value to its rank, divided by the total number of tests, multiplied by the desired FDR level. The adjusted p-values are then compared to the desired FDR level.

In the above example, as you have performed 20 tests (𝑚) and set the FDR level (𝛼) at 0.05, the BH procedure adjusts each p-value based on its rank. For instance, if a p-value is ranked 5th out of 20, it is compared to: [(5/20) * 0.05 = 0.0125]. Thus, if the 5th p-value is less than 0.0125, it is considered significant, and similar calculations are performed for all the p-values.

In the BH procedure, you can control the amount of false positive results allowed in your test by setting the desired FDR level. By default, the FDR value (i.e. 𝛼) is set at 0.05, or you are permitting 5% false positive results in your analysis. The choice of cutoff for adjusted p-values in BH typically depends on the specific context of the analysis and the desired balance between Type I and Type II errors. 

# Set the FDR level (alpha)

alpha <- 0.05


# Apply Benjamini-Hochberg correction

results$p_adj_bh <- p.adjust(results$p_value, method = "BH")


# Identify significant variables

results$significant <- ifelse(results$p_adj_bh < alpha, "Significant", "Not Significant")


# Plot original and Benjamini-Hochberg adjusted p-values

ggplot(results, aes(x = Variable, y = p_value)) +

  geom_bar(stat = "identity", fill = "blue", alpha = 0.7) +

  geom_hline(yintercept = alpha, color = "red", linetype = "dashed", 

             aes(yintercept = alpha)) +

  geom_point(aes(y = p_adj_bh, color = significant), size = 3) +

  scale_y_continuous(name = "P-Value", sec.axis = sec_axis(~ ., name = "Adjusted P-Value (BH)")) +

  ggtitle("P-Values with Benjamini-Hochberg Correction") +

  xlab("Variables") +

  theme_minimal() +

  coord_flip() +

  scale_color_manual(values = c("Significant" = "red", "Not Significant" = "green"))

Try different cutoff levels of 𝛼 below to check if you can get any significant variable that is associated with the blood glucose level.

Lastly, performing that many tests and adjusting for multiple corrections is not the ideal statistic for this kind of dataset. In such cases, you could have used i) multiple regression (if your goal is to understand the individual contributions of several independent variables to the dependent variable), ii) linear models (if you want to control the contribution of independent variables at different levels, i.e., individual-level or group-level effects) or, iii) partial correlation (when you suspect a direct relationship between two continuous variables while controlling the effects of others).