The Overfitting Dilemma: Finding Balance in Statistical Insights

Overfitting occurs when a statistical model or analysis is too closely tailored to the specific dataset on which it was trained. As a result, the model captures noise or random fluctuations in the data rather than the underlying pattern or trend. This means that while the model may perform very well on the data it was trained on, it fails to generalize to new, unseen data.

Suppose you have the following two parameters in your experiment: 

a) Microbiome Diversity: A measure representing the diversity of microbial species in a sample.

b) Health Score: An outcome that is influenced by the microbiome diversity. 

You hypothesize that there is some degree of relationship between these two parameters. You have been given the option to choose one from the following two statistical models:

How do you decide which model to use? Let's create some simulated data and train both models in it.

# Load libraries

library(ggplot2)

library(caret)  


# Set seed for reproducibility

set.seed(42)


# Generate synthetic microbiome data: Microbiome diversity vs. Health Score

n_samples <- 30

microbiome_diversity <- runif(n_samples, 1, 10)  # Microbiome diversity index

health_score <- 5 * microbiome_diversity + rnorm(n_samples, mean = 0, sd = 2)  # Health outcome score


# Create a data frame

data <- data.frame(microbiome_diversity, health_score)


# Split the data into training and test sets

set.seed(123)

train_index <- createDataPartition(data$health_score, p = 0.7, list = FALSE)

train_data <- data[train_index,]

test_data <- data[-train_index,]

#############################################################################

# Train data analysis (Test data should remain unseen)

#############################################################################

# Define the control using LOOCV

train_control <- trainControl(method = "LOOCV")


# Fit a simple linear model using LOOCV

linear_model_loocv <- train(health_score ~ microbiome_diversity, 

                            data = train_data, 

                            method = "lm", 

                            trControl = train_control)


# Fit a high-degree polynomial model using LOOCV

poly_model_loocv <- train(health_score ~ poly(microbiome_diversity, 10), 

                          data = train_data,  

                          trControl = train_control)


# Predictions on the train data using the LOOCV models

train_data$pred_linear_loocv <- predict(linear_model_loocv, train_data)

train_data$pred_poly_loocv <- predict(poly_model_loocv, train_data)


# Plot train data with fitted LOOCV models

ggplot(train_data, aes(x = microbiome_diversity, y = health_score)) +

  geom_point(color = "blue", size = 2) +

  geom_line(aes(y = pred_linear_loocv), color = "red", linetype = "dashed") +

  geom_line(aes(y = pred_poly_loocv), color = "green") +

  ggtitle("Training Data: Linear Model vs Polynomial Model (LOOCV)") +

  xlab("Microbiome Diversity Index") +

  ylab("Health Score") +

  theme_minimal()

The plot resulting from the comparison of the linear vs polynomial model is attached below. It seems that the polynomial model (green line) fits the data better than the linear one (red dotted line) in the training dataset. 

Although the plot may look deceptive and in favor of using the polynomial model, you should need to consider the model complexities in these two approaches before finalizing one.

Overfitting happens when a model captures noise rather than the true underlying pattern. The polynomial model, with its flexibility, might fit the training data too well, including random fluctuations that do not generalize to new data. An overfitted model performs well on training data but poorly on test data, indicating that it has not learned the true relationship but rather memorized the data. 

But it's all assumptions until we prove that the polynomial model indeed overfits our data.  Let's run the test data with our trained model and evaluate the model performance on the test data. The metric that we are going to use for evaluation is called Mean Squared Error (MSE). It measures the average squared difference between predicted values and actual values. Lower MSE indicates better predictive performance. By comparing the MSE of the linear and polynomial models, you can assess which model generalizes better to new data.

#############################################################################

# Test data analysis 

#############################################################################

# Predictions on the test data using the LOOCV models

test_data$pred_linear_loocv <- predict(linear_model_loocv, test_data)

test_data$pred_poly_loocv <- predict(poly_model_loocv, test_data)


# Evaluate model performance on test data

mse_linear <- mean((test_data$health_score - test_data$pred_linear_loocv)^2)

mse_poly <- mean((test_data$health_score - test_data$pred_poly_loocv)^2)


cat("Mean Squared Error on Test Data (Linear Model):", round(mse_linear, 2), "\n")

cat("Mean Squared Error on Test Data (Polynomial Model):", round(mse_poly, 2), "\n")

Which yields:

Mean Squared Error on Test Data (Linear Model): 8.42

Mean Squared Error on Test Data (Polynomial Model): 13.65

We observed a higher MSE for the polynomial model (13.65) compared to the linear model (8.42), which suggests overfitting in the former. It shows that the complexity of the polynomial model is not justified by the underlying data structure, and we should better use the linear model in this dataset.

Note that you could have guessed the robustness of the linear model over the polynomial one on the training data itself and need not proceed to the test data. The term 'LOOCV' used in the model stands for Leave-One-Out Cross-Validation, which is a technique used to evaluate the performance of a statistical model, particularly when dealing with small datasets. In LOOCV, the dataset is split into '𝑁' folds, where  '𝑁' is the number of data points. For each fold, the model is trained on 'N−1' data points and tested on the one data point that was left out. This process is repeated '𝑁' times, with each data point taking a turn as the test set (within the training dataset). The results from each iteration are then averaged to provide a final performance estimate. LOOCV is precise and unbiased because it ensures every data point is used for both training and testing, though it can be computationally intensive with larger datasets.

After we performed LOOCV in our training dataset, you could have checked the LOOCV of both models as:

# LOOCV results on the Training dataset

cat("Linear Model RMSE:", round(linear_model_loocv$results$RMSE, 2), "\n")

cat("Polynomial Model RMSE:", round(poly_model_loocv$results$RMSE, 2), "\n")

Which indicates linear model performs better than the polynomial one:

Linear Model RMSE: 2.21 

Polynomial Model RMSE: 2.67

Note that RMSE is simply the square root of MSE. RMSE is preferred during the evaluation of the training dataset for its interpretability and direct relation to the scale of the data, while MSE gives more weight to larger errors due to the squaring of differences.