Bakshi Lab @ Presidency, Kolkata

Parallel Programming in Biological Data: The R vs Python Showdown

In the modern era of biology, the sheer scale of data has grown beyond what traditional methods can handle efficiently. From sequencing genomes to analyzing vast gene expression datasets, biologists increasingly find themselves facing computational bottlenecks. This is where parallel computation shines. By dividing tasks among multiple processors, parallel computing can dramatically reduce the time required for resource-heavy operations. In this post, we’ll explore how parallel computation strategies in R and Python can be applied to solve the same biological problem and compare their performance.

Let’s suppose you’re analyzing gene expression data for 100 genes under 10 conditions. This analysis involves simulating expression data multiple times and calculating simple statistics like mean and standard deviation for each gene. While the task is straightforward, repeating it hundreds or thousands of times can be computationally expensive, especially when running sequentially. The question then becomes: How much faster can we go by leveraging parallel computation?

Solving the Problem in R

Strategy: Serial vs. Parallel Execution

In R, we will implement the doParallel and foreach libraries to distribute the computation across six processor cores. Here's the step-by-step breakdown:

Serial Execution:
- For each iteration, simulate gene expression data for 100 genes.
- Compute the mean and standard deviation for each gene.
Parallel Execution:
- Create a cluster using six cores.
- Distribute iteration tasks across cores using the foreach function.

Implementation in R

library(foreach)

library(doParallel)

library(ggplot2)

# Number of iterations to time

iters <- seq(10, 1000, by = 10)

# Output time vectors for iteration sets

times_serial <- numeric(length(iters))

times_parallel <- numeric(length(iters))

# Loop over iteration sets

for (val in 1:length(iters)) {

cat(val, ' of ', length(iters), '\n')

to.iter <- iters[val]

# Serial execution

# Vector for appending output

gene_data_serial <- vector('list', length = to.iter)

# Start time

strt_serial <- Sys.time()

# Serial for loop

for (i in 1:to.iter) {

cat(i, '\n')

# Simulate gene expression data for 100 genes

gene_expression <- matrix(rnorm(100 * 10), nrow = 100)

# Perform basic statistical analysis

stats_summary <- apply(gene_expression, 1, function(x) c(mean = mean(x), sd = sd(x)))

# Export

gene_data_serial[[i]] <- stats_summary

}

# End time

times_serial[val] <- Sys.time() - strt_serial

# Parallel execution

# Vector for appending output

gene_data_parallel <- vector('list', length = to.iter)

# Start time

strt_parallel <- Sys.time()

# Create cluster with 6 cores

cl <- makeCluster(6)

registerDoParallel(cl)

# Parallel foreach loop

gene_data_parallel <- foreach(i = 1:to.iter, .combine = 'c') %dopar% {

# Simulate gene expression data for 100 genes

gene_expression <- matrix(rnorm(100 * 10), nrow = 100)

# Perform basic statistical analysis

stats_summary <- apply(gene_expression, 1, function(x) c(mean = mean(x), sd = sd(x)))

stats_summary

}

# End time

times_parallel[val] <- Sys.time() - strt_parallel

stopCluster(cl)

}

# Create data frame for plotting

to.plo <- data.frame(iters, times_serial, times_parallel)

# Plot the times with labels

ggplot(to.plo) +

geom_point(aes(x = iters, y = times_serial, color = "Serial")) +

geom_smooth(aes(x = iters, y = times_serial, color = "Serial"), se = FALSE) +

geom_point(aes(x = iters, y = times_parallel, color = "Parallel")) +

geom_smooth(aes(x = iters, y = times_parallel, color = "Parallel"), se = FALSE) +

theme_bw() +

scale_x_continuous('Number of loop iterations') +

scale_y_continuous('Time in seconds') +

ggtitle('Comparison of Serial and Parallel Execution Times for Gene Expression Analysis') +

labs(color = "Execution Type") +

scale_color_manual(values = c("Serial" = "blue", "Parallel" = "red")) +

annotate("text", x = 500, y = max(to.plo$times_serial, to.plo$times_parallel),

label = "Note: Parallel execution shows better scalability as iteration count increases.",

color = "black", size = 3.5, hjust = 0.5)

The results were striking. The plot below shows that parallel execution outperformed the serial approach, especially as the iteration count increased. The scalability of the parallel method became evident, with the time savings growing larger as the task size expanded. This highlights the power of R’s parallel computing capabilities for handling biologically relevant data analysis tasks.

Solving the Problem in Python

Strategy: Serial vs. Parallel Execution

Python offers its own robust solutions for parallel computation. For this analysis, we used the ProcessPoolExecutor from the concurrent.futures module, which provides an easy-to-use interface for parallel processing. The implementation involves:

Serial Execution:
- Similar to R, each iteration simulates gene expression and computes summary statistics.
Parallel Execution:
- Use a pool of six processes to distribute computation tasks concurrently.

Note: R users, please note that Python libraries (viz., matplotlib and pandas) can be installed in your Jupyter Notebook using pip:

!pip install matplotlib

!pip install pandas

As expected, serial execution in Python also became slower as the number of iterations grew. To speed things up, we adopted a parallel strategy using Python’s ProcessPoolExecutor. This allowed us to divide the workload across six processes, with each process handling a portion of the iterations. The difference in performance was remarkable. Parallel execution not only reduced computation times significantly but also showed excellent scalability as the number of iterations increased.

import numpy as np

import pandas as pd

import matplotlib.pyplot as plt

from concurrent.futures import ProcessPoolExecutor

import time

# Number of iterations to time

iters = np.arange(10, 1000 + 10, 10)

# Output time vectors for iteration sets

times_serial = np.zeros(len(iters))

times_parallel = np.zeros(len(iters))

# Function to simulate gene expression data and perform basic statistical analysis

def process_data():

# Simulate gene expression data for 100 genes

gene_expression = np.random.randn(100, 10)

# Perform basic statistical analysis

stats_summary = np.apply_along_axis(lambda x: [np.mean(x), np.std(x)], 1, gene_expression)

return stats_summary

# Loop over iteration sets

for idx, val in enumerate(iters):

print(f"{idx + 1} of {len(iters)}")

to_iter = val

# Serial execution

start_serial = time.time()

# Serial for loop

gene_data_serial = [process_data() for _ in range(to_iter)]

times_serial[idx] = time.time() - start_serial

# Parallel execution

start_parallel = time.time()

with ProcessPoolExecutor(max_workers=6) as executor:

future_to_index = {executor.submit(process_data): i for i in range(to_iter)}

gene_data_parallel = [future.result() for future in future_to_index]

times_parallel[idx] = time.time() - start_parallel

# Create DataFrame for plotting

to_plot = pd.DataFrame({

'iters': iters,

'times_serial': times_serial,

'times_parallel': times_parallel

})

# Plot the times with labels

plt.figure(figsize=(10, 6))

plt.plot(to_plot['iters'], to_plot['times_serial'], 'o-', color='blue', label='Serial')

plt.plot(to_plot['iters'], to_plot['times_parallel'], 'o-', color='red', label='Parallel')

plt.xlabel('Number of loop iterations')

plt.ylabel('Time in seconds')

plt.title('Comparison of Serial and Parallel Execution Times for Gene Expression Analysis')

plt.legend()

plt.grid(True)

plt.annotate(xy=(500, max(to_plot['times_serial'].max(), to_plot['times_parallel'].max())),

xytext=(550, max(to_plot['times_serial'].max(), to_plot['times_parallel'].max()) - 10),

arrowprops=dict(facecolor='black', shrink=0.05))

plt.show()

The plot below illustrates the results. Python’s parallel execution demonstrated a similar pattern to R’s, with dramatic improvements in runtime (check the Y-axis) for larger iteration counts. This reinforced the notion that parallel computation is not just a luxury but a necessity for handling large-scale biological analyses.

Verdict:

Both R and Python proved to be powerful tools for parallel computation, but there were subtle differences worth noting. In terms of raw performance, Python exhibited a clear edge over R for larger iteration counts, likely due to its optimized multiprocessing architecture. However, R’s simplicity and seamless integration with statistical workflows made it an excellent choice for biologists who primarily work within the R ecosystem.

Page updated

Google Sites

Report abuse