Bakshi Lab @ Presidency, Kolkata

Introduction to R Graphics - From Basic to High-Quality Plots

— The post is part of my presentation for the Computational Biology Workshop for Clinicians (March 6, 2025) at AIIMS Kalyani.

Basic Plots in R

A. Scatter Plots

R provides a versatile and flexible system for creating a variety of plots. In this section, we start with scatter plots, which are one of the simplest ways to visualize relationships between two numeric variables.

1. Generating Data

Before plotting, we generate some data for demonstration:

# Generate data

x <- 1:30

y <- rnorm(30, mean = x) # Normally distributed values centered around x

y2 <- rnorm(30, mean = x, sd = sqrt(x)) # More variation based on x

Here,

y is generated using rnorm() with a mean equal to x, simulating a linear trend with some randomness.
y2 has an increasing variance as x increases, using sd = sqrt(x).

2. Basic Scatter Plot

plot(x, y)

Explanation:

plot(x, y) creates a simple scatter plot with x on the x-axis and y on the y-axis.

Expected Output:

A scatter plot with default black circles representing the data points.

3. Line Plot Instead of Points

plot(x, y, type = "l")

Explanation:

type = "l" changes the representation from points to lines connecting the data points.

Expected Output:

A line plot instead of points.

4. Combining Points and Lines

plot(x, y, type = "b")

Explanation:

type = "b" displays both points and connecting lines.

Expected Output:

A plot where points and lines are shown together.

5. Changing Point Style

plot(x, y, type = "b", pch = 4)

Explanation:

pch = 4 changes the point shape to an “X” marker.

Expected Output:

A scatter plot with lines and X-shaped points.

6. Changing Point Color

plot(x, y, type = "b", pch = 2, col = "blue")

Explanation:

col = "blue" changes the color of the points and lines to blue.
pch = 2 changes the point shape to a triangle.

Expected Output:

A blue scatter plot with lines and triangle-shaped points.

7. Adding a Line to the Plot

abline(c(0, 1)) # Intercept = 0, Slope = 1

Explanation:

abline(c(0,1)) adds a reference line with intercept 0 and slope 1.

Expected Output:

A scatter plot with a diagonal reference line passing through the origin.

8. Adding a Second Dataset to the Plot

points(x, y2, col = "red")

Explanation:

points(x, y2, col = "red") adds another dataset (y2) in red without replacing the existing plot.

Expected Output:

A scatter plot where the original dataset is in blue, and the new dataset is in red.

9. Customizing Axis Labels

plot(x, y2, col = "orange", xlab = "my x-label", ylab = "yyy")

Explanation:

xlab = "my x-label" sets a custom x-axis label.
ylab = "yyy" sets a custom y-axis label.
col = "orange" colors the points and lines orange.

Expected Output:

A scatter plot with custom axis labels and orange points.

10. Setting Custom Axis Range

plot(x, y2, xlim = c(1,10), ylim = c(1,5))

Explanation:

xlim = c(1,10) restricts the x-axis range to 1 to 10, ignoring values outside this range.
ylim = c(1,5) restricts the y-axis range to 1 to 5, ignoring values outside this range.

Expected Output:

A scatter plot where only those points whose x-values are from 1 to 10 and y-values from 1 to 5 are displayed.

B. Histograms

Histograms are useful for visualizing the distribution of a dataset. They show how data points are distributed across different bins, helping to understand frequency patterns, skewness, and variability.

In this section, we will generate a dataset using a Poisson distribution and explore different ways to customize histograms in R.

1. Generating Data

# Create a random dataset of 100 numbers from a Poisson distribution with mean 3

d1 <- rpois(100, lambda = 3)

Explanation:

The rpois() function generates 100 random numbers from a Poisson distribution with a mean (λ) of 3.
This simulates count data, often used in biological and ecological studies.

2. Basic Histogram

hist(d1)

Explanation:

hist(d1) creates a default histogram of d1, automatically setting the number of bins.

Expected Output:

A histogram displaying the frequency distribution of values in d1, with automatically chosen bins.

3. Specifying the Number of Bins

hist(d1, breaks = 4)

Explanation:

breaks = 4 forces R to divide the data into 4 bins.

Expected Output:

A histogram with 4 bins, which may provide a less detailed distribution.

4. Manually Defining Bin Edges

hist(d1, breaks = c(0, 1, 3, 5, 7, 11, 21))

Explanation:

breaks = c(0, 1, 3, 5, 7, 11, 21) manually defines the edges of the bins instead of relying on automatic calculations.

Expected Output:

A histogram where bin sizes are irregular, allowing finer control over how data is grouped.

5. Using Frequency (Default)

hist(d1, freq = TRUE)

Explanation:

freq = TRUE ensures the histogram shows absolute frequencies (the count of data points in each bin).
This is the default setting when bins are of equal size.

Expected Output:

A histogram displaying counts on the y-axis.

6. Using Density Instead of Frequency

hist(d1, freq = FALSE)

Explanation:

freq = FALSE normalizes the histogram to show density instead of raw counts.
The area under the histogram sums to 1, making it useful for comparing distributions.

Expected Output:

A histogram with density values on the y-axis instead of counts.

7. Creating a Histogram Without Plotting

z <- hist(d1, plot = FALSE)

Explanation:

plot = FALSE prevents R from displaying the histogram but still stores the results in z.
This is useful when you need to extract histogram properties without visualizing it.

No Expected Output (since it's not plotted).

8. Accessing Histogram Properties

z$counts # Number of values in each bin

z$mids # Midpoints of the bins

Explanation:

z$counts gives the number of values in each bin.
z$mids returns the midpoints of the bins.
These values can be used for further analysis, such as overlaying a density curve.

Expected Output:

> z$counts # Number of values in each bin

[1] 25 22 17 13 11 7 4 0 0 0 1

> z$mids # Midpoints of the bins

[1] 0.5 1.5 2.5 3.5 4.5 5.5 6.5 7.5 8.5 9.5 10.5

C. Bar Plots

Bar plots are useful for visualizing categorical data by representing counts, proportions, or numerical values associated with categories. They help compare group sizes, distributions, and trends effectively. A bar plot is distinct from a histogram as it is specifically used for categorical data. Each bar represents a distinct category in a bar plot, and spaces separate the bars. In contrast, a histogram is used for continuous data, where data is grouped into bins, and the bars are adjacent to show the frequency distribution of values.

In this section, we will explore how to create bar plots in R using different datasets and customization options.

1. Basic Bar Plot (Using islands Dataset)

data(islands) # Load dataset containing areas of various islands

barplot(islands)

Explanation:

The islands dataset contains land areas (in 1000 square miles) of various world islands.
barplot(islands) creates a vertical bar plot, where each bar represents an island's area.

Expected Output:

A vertical bar plot showing different islands and their corresponding areas.

2. Horizontal Bar Plot

barplot(islands, horiz = TRUE)

Explanation:

Setting horiz = TRUE makes the bars horizontal, which is useful when dealing with long labels or emphasizing ranking.

Expected Output:

A horizontal bar plot with the same data.

3. Adjusting Label Orientation

barplot(islands, horiz = TRUE, las = 1)

Explanation:

las = 1 rotates axis labels horizontally for better readability.
Useful when category names are long or crowded.

Expected Output:

A horizontal bar plot with horizontally aligned labels.

4. Bar Plot Using iris Dataset

data(iris) # Load dataset

barplot(height = iris$Petal.Width, beside = TRUE, col = iris$Species)

Explanation:

The iris dataset contains measurements of Sepal and Petal dimensions across three species (setosa, versicolor, virginica).
height = iris$Petal.Width plots the petal width values as bars.
beside = TRUE ensures bars for different species are placed side-by-side instead of stacked.
col = iris$Species colors bars based on species, distinguishing groups visually.

Expected Output:

A grouped bar plot where bars represent Petal Width and are colored according to species.

D. Box Plots

Box plots (also called box-and-whisker plots) are useful for visualizing the distribution, spread, and potential outliers in numerical data across different categories. They summarize key statistics, including:

Median (Q2) – the middle value
Interquartile Range (IQR) – the range between Q1 (25th percentile) and Q3 (75th percentile)
Whiskers – data spread beyond IQR
Outliers – extreme values outside whiskers

1. Creating a Box Plot with the iris Dataset

xx <- data.frame(iris) # Load iris dataset

boxplot(xx$Petal.Width ~ xx$Species, col = c("red", "green", "blue"))

Explanation:

The iris dataset contains petal and sepal measurements for three flower species: setosa, versicolor, and virginica.
xx$Petal.Width ~ xx$Species groups Petal Width by Species, displaying separate box plots for each species.
col = c("red", "green", "blue") assigns different colors to each species for clear differentiation.

Expected Output:

A colored box plot showing the distribution of Petal Width across the three species:

setosa (red)
versicolor (green)
virginica (blue)

E. Density Plots

Density plots help visualize the distribution of continuous data in a smooth and interpretable way. Unlike histograms, density plots use kernel density estimation (KDE) to create a continuous probability distribution.

In this section, we explore 2D density visualization using filled.contour() to represent a 3D surface in 2D.

Creating a 2D Density Plot

# Generate normally distributed data

x <- sort(rnorm(100)) # 100 random values sorted

y <- sort(rnorm(50)) # 50 random values sorted

# Compute the outer product to create a density grid

z <- x %o% y

# 3D density plot in 2D

filled.contour(z)

Explanation:

rnorm(n) generates n normally distributed random values (mean = 0, sd = 1 by default).
sort() ensures the values are arranged in increasing order.
x %o% y computes the outer product, creating a grid of values that serve as height values in a density function.
filled.contour(z) produces a 2D filled contour plot, where:
- Contours represent different density levels (like a topographic map).
- Colors indicate density intensity, with darker shades showing higher density.

Expected Output:

A smooth 2D density plot (contour plot) where different colors indicate varying densities.

Advanced Features in R Graphics

Advanced plotting in R allows for better visualization by mainly incorporating scaling, labeling, legends, themes, and facets. These features help in making plots more informative, readable, and aesthetically appealing.

A. Scaling

In this section, we explore scaling options in ggplot2 to control axes, colors, and transformations for enhanced data visualization.

1. Continuous Scaling

Example: Scatter Plot with Continuous Scaling

# Load ggplot2

library(ggplot2)

# Create a dataset

data <- data.frame(x = rnorm(100), y = rnorm(100))

# Scatter plot with continuous scaling

ggplot(data, aes(x = x, y = y)) +

geom_point() +

scale_x_continuous(name = "X-axis label") +

scale_y_continuous(name = "Y-axis label")

Explanation:

scale_x_continuous() and scale_y_continuous() explicitly define the axis labels for continuous variables.
This ensures proper interpretation of the numeric scale on both axes.

2. Discrete Scaling

Example: Bar Plot with Discrete Scaling

# Create a dataset

data <- data.frame(category = c("A", "B", "C", "D"), value = c(10, 15, 8, 12))

# Bar plot with discrete scaling

ggplot(data, aes(x = category, y = value, fill = category)) +

geom_bar(stat = "identity") +

scale_x_discrete(name = "Categories") +

scale_y_continuous(name = "Values")

Explanation:

scale_x_discrete() is used when the x-axis represents categorical data (e.g., "A", "B", "C", "D").
The fill aesthetic is mapped to the category variable, coloring the bars accordingly.

3. Logarithmic Scaling

Example: Line Plot with Logarithmic Scaling

# Create a dataset

data <- data.frame(x = 1:10, y = exp(1:10))

# Line plot with logarithmic scaling

ggplot(data, aes(x = x, y = y)) +

geom_line() +

scale_y_log10(name = "Logarithmic Scale")

Explanation:

scale_y_log10() applies a log transformation to the y-axis, useful for datasets with exponential growth.
This technique helps visualize skewed distributions or large numeric ranges.

4. Color Scaling

Example: Scatter Plot with Color Scaling

# Create a dataset

data <- data.frame(x = rnorm(100), y = rnorm(100), color_var = rnorm(100))

# Scatter plot with color scaling

ggplot(data, aes(x = x, y = y, color = color_var)) +

geom_point() +

scale_color_continuous(name = "Color Legend")

Explanation:

scale_color_continuous() maps a continuous variable to color intensity.
The color gradient visually represents variations in the third numeric variable (color_var).

B. Labelling

Proper labeling in plots improves clarity, interpretation, and presentation. In this section, we cover:

Axis Labels
Plot Titles
Legend Titles

1. Adding Axis Labels

Example: Scatter Plot with Axis Labels

# Load ggplot2

library(ggplot2)

# Create a dataset

data <- data.frame(x = rnorm(100), y = rnorm(100))

# Scatter plot with labeled axes

ggplot(data, aes(x = x, y = y)) +

geom_point() +

labs(x = "X-axis label", y = "Y-axis label")

Explanation:

The labs(x = "...", y = "...") function adds custom labels for the x and y axes.
Helps in understanding variable meanings on each axis.

2. Adding a Plot Title

Example: Bar Plot with a Title

# Create a dataset

data <- data.frame(category = c("A", "B", "C", "D"), value = c(10, 15, 8, 12))

# Bar plot with a title

ggplot(data, aes(x = category, y = value, fill = category)) +

geom_bar(stat = "identity") +

labs(title = "Bar Plot with Title", x = "Categories", y = "Values")

Explanation:

labs(title = "...") adds a descriptive title to the plot.
Titles make plots more informative by summarizing insights.

3. Adding a Legend Title

Example: Scatter Plot with a Legend Title

# Create a dataset

data <- data.frame(x = rnorm(100), y = rnorm(100), group = rep(c("A", "B"), each = 50))

# Scatter plot with legend title

ggplot(data, aes(x = x, y = y, color = group)) +

geom_point() +

labs(title = "Scatter Plot with Legend Title",

x = "X-axis label",

y = "Y-axis label",

color = "Group") +

theme(legend.title = element_text(size = 12))

Explanation:

labs(color = "Group") customizes the legend title.
theme(legend.title = element_text(size = 12)) adjusts the legend text size.

C. Legends

Legends improve visualization by clarifying groupings and aesthetics. Here, we explore:

Color Legends
Shape Legends
Size Legends
Fill Legends
Legend Positioning

1. Adding a Color Legend

Example: Scatter Plot with Color Legend

library(ggplot2)

# Create a dataset

data <- data.frame(x = rnorm(150), y = rnorm(150), group = rep(c("A", "B", "C"), each = 50))

# Scatter plot with color legend

ggplot(data, aes(x = x, y = y, color = group)) +

geom_point() +

labs(title = "Scatter Plot with Color Legend", x = "X-axis label", y = "Y-axis label")

Explanation:

aes(color = group) assigns different colors to groups A, B and C.
ggplot2 automatically generates a legend for the color aesthetic.

2. Adding a Shape Legend

Example: Scatter Plot with Shape Legend

# Create a dataset

data <- data.frame(x = rnorm(100), y = rnorm(100), group = rep(c("A", "B"), each = 50))

# Scatter plot with shape legend

ggplot(data, aes(x = x, y = y, shape = group)) +

geom_point(size = 3) +

labs(title = "Scatter Plot with Shape Legend", x = "X-axis label", y = "Y-axis label")

Explanation:

aes(shape = group) assigns different shapes to groups.
Useful when printing in black and white (avoids reliance on colors).

3. Adding a Size Legend

Example: Scatter Plot with Size Legend

# Create a dataset

data <- data.frame(x = rnorm(100), y = rnorm(100), size_var = runif(100, 2, 8))

# Scatter plot with size legend

ggplot(data, aes(x = x, y = y, size = size_var)) +

geom_point() +

labs(title = "Scatter Plot with Size Legend", x = "X-axis label", y = "Y-axis label")

Explanation:

aes(size = size_var) scales point size by a continuous variable.
Ideal for emphasizing importance (e.g., population size).

4. Adding a Fill Legend

Example: Bar Plot with Fill Legend

# Create a dataset

data <- data.frame(category = c("A", "B", "C", "D"), value = c(10, 15, 8, 12), group = rep(c("X", "Y"), each = 2))

# Bar plot with fill legend

ggplot(data, aes(x = category, y = value, fill = group)) +

geom_bar(stat = "identity") +

labs(title = "Bar Plot with Fill Legend", x = "Categories", y = "Values")

Explanation:

aes(fill = group) colors bars by group.
Helps distinguish categories visually.

5. Positioning the Legend

Example: Moving the Legend to the Bottom-Right

# Create a dataset

data <- data.frame(x = rnorm(100), y = rnorm(100), group = rep(c("A", "B"), each = 50))

# Scatter plot with repositioned legend

ggplot(data, aes(x = x, y = y, color = group)) +

geom_point() +

labs(title = "Scatter Plot with Bottom-Right Legend", x = "X-axis label", y = "Y-axis label") +

theme(legend.position = "bottom", legend.justification = "right")

Explanation:

legend.position = "bottom" moves the legend.
legend.justification = "right" aligns it to the right side.

D. Themes

Themes in ggplot2 allow customization of plot appearance. Here, we explore:

Default Theme
Minimal Theme
Classic Theme
Dark Theme
Custom Themes

1. Default Theme

library(ggplot2)

# Create a dataset

data <- data.frame(x = rnorm(100), y = rnorm(100))

# Scatter plot with default theme

ggplot(data, aes(x = x, y = y)) +

geom_point() +

labs(title = "Scatter Plot with Default Theme", x = "X-axis label", y = "Y-axis label")

Explanation:

The default ggplot2 theme includes gray background and gridlines.
Good for quick exploratory analysis.

2. Minimal Theme

# Scatter plot with minimal theme

ggplot(data, aes(x = x, y = y)) +

geom_point() +

labs(title = "Scatter Plot with Minimal Theme", x = "X-axis label", y = "Y-axis label") +

theme_minimal()

Explanation:

theme_minimal() removes background color and gridlines, keeping only major ones.
Best for presentations and reports.

3. Classic Theme

# Scatter plot with classic theme

ggplot(data, aes(x = x, y = y)) +

geom_point() +

labs(title = "Scatter Plot with Classic Theme", x = "X-axis label", y = "Y-axis label") +

theme_classic()

Explanation:

theme_classic() removes gridlines and background but keeps axis lines.
Good for academic papers.

4. Dark Theme

# Scatter plot with dark theme

ggplot(data, aes(x = x, y = y)) +

geom_point() +

labs(title = "Scatter Plot with Dark Theme", x = "X-axis label", y = "Y-axis label") +

theme_dark()

Explanation:

theme_dark() is useful for dark mode interfaces or contrast-based visualizations.
Helps in low-light environments.

5. Custom Theme

# Scatter plot with custom theme

ggplot(data, aes(x = x, y = y)) +

geom_point() +

labs(title = "Scatter Plot with Custom Theme", x = "X-axis label", y = "Y-axis label") +

theme(

axis.text = element_text(size = 12, color = "blue"),

plot.title = element_text(hjust = 0.5, size = 16, face = "bold")

)

Explanation:

element_text(size = 12, color = "blue") changes axis text size and color.
plot.title = element_text(hjust = 0.5, size = 16, face = "bold") centers and bolds the title.
Highly customizable for publications and branding.

E. Facets

Facets allow visualization of subsets of data in separate panels within the same figure. We explore:

Facet Wrap (single categorical variable)
Facet Grid (two categorical variables)
Free Scales in Facets

1. Facet Wrap

library(ggplot2)

# Create a dataset

data <- data.frame(x = rnorm(200), y = rnorm(200), category = rep(c("A", "B"), each = 100))

# Scatter plot with facet wrap

ggplot(data, aes(x = x, y = y)) +

geom_point() +

facet_wrap(~ category) +

labs(title = "Scatter Plot with Facet Wrap", x = "X-axis label", y = "Y-axis label")

Explanation:

facet_wrap(~ category) creates separate plots for each category (A & B).
Useful when you have one categorical variable.
Arranges plots in a flexible grid.

2. Facet Grid

# Create a dataset with two categorical variables

data <- data.frame(x = rnorm(200), y = rnorm(200), category1 = rep(c("A", "B"), each = 100), category2 = rep(c("X", "Y"), times = 100))

# Scatter plot with facet grid

ggplot(data, aes(x = x, y = y)) +

geom_point() +

facet_grid(category1 ~ category2) +

labs(title = "Scatter Plot with Facet Grid", x = "X-axis label", y = "Y-axis label")

Explanation:

facet_grid(category1 ~ category2) creates a matrix-like layout.
Each combination of category1 (A & B) and category2 (X & Y) gets a separate panel.
Good for structured comparisons.

3. Free Scales in Facets

# Scatter plot with facet wrap and free scales

ggplot(data, aes(x = x, y = y)) +

geom_point() +

facet_wrap(~ category1, scales = "free") +

labs(title = "Scatter Plot with Facet Wrap and Free Scales", x = "X-axis label", y = "Y-axis label")

Explanation:

scales = "free" allows each facet to have independent axes.
Useful when data ranges vary greatly between groups.

Basic Plots with Advanced Features in R

These plots use the above-mentioned advanced features to create high-quality graphics. Please note that all input data files used in the construction of these plots can be downloaded from: https://github.com/utpalmtbi/R-Graphics

A. Scatter Plot

This scatter plot visualizes the relationship between Gene Expression and Cell Viability, with different colors representing distinct treatment groups.

# Load ggplot2 package

library(ggplot2)

# Read dataset

biological_data <- read.csv('adv_plot_1.csv')

# Scatter plot

ggplot(biological_data, aes(x = GeneExpression, y = CellViability, color = Treatment)) +

geom_point(size = 3, alpha = 0.8) + # Customizing point size and transparency

labs(title = "Scatter Plot of Gene Expression vs. Cell Viability",

x = "Gene Expression",

y = "Cell Viability") +

theme_minimal() + # Using a minimal theme for a clean look

scale_color_manual(values = c("blue", "red")) # Setting custom colors for treatments

What This Plot Does:

Plots Gene Expression vs. Cell Viability:
- The x-axis represents Gene Expression levels.
- The y-axis represents Cell Viability percentages.
Colors Different Treatment Groups:
- Each point represents a sample.
- The color of the point indicates the treatment group (e.g., Control vs. Treated).
Customizations for Better Readability:
- Point Size (size = 3) → Increases point visibility.
- Transparency (alpha = 0.8) → Reduces overlap for a clearer plot.
- Minimal Theme (theme_minimal()) → Makes the plot cleaner by removing unnecessary grid elements.
- Manual Color Mapping (scale_color_manual(values = c("blue", "red"))) → Ensures Control samples are blue and Treatment samples are red.

Why These Customizations Matter:

Coloring treatment groups makes it easier to distinguish patterns visually.
Adding transparency ensures that overlapping points don’t obscure the data.
The minimal theme reduces clutter, making the graph more readable.

B. Line Plot

This line plot visualizes how Gene Expression changes over time for different groups. Each group is represented by a separate line, showing trends in expression levels across different time points.

# Load ggplot2 package

library(ggplot2)

# Read dataset

time_series_data <- read.csv('adv_plot_2.csv')

# Line plot

ggplot(time_series_data, aes(x = Time, y = GeneExpression, color = Group, group = Group)) +

geom_line(linewidth = 1.5) + # Customizing line thickness

geom_point(size = 3, shape = 21, fill = "white") + # Customizing point appearance

labs(title = "Time Series Plot of Gene Expression",

x = "Time Points",

y = "Gene Expression") +

theme_bw() + # Using a black and white theme for a classic look

scale_color_manual(values = c("blue", "green")) + # Setting custom colors for groups

guides(fill = guide_legend(override.aes = list(shape = 21, size = 3))) # Adjusting legend appearance

What This Plot Does:

Plots Gene Expression over Time:
- The x-axis represents Time Points.
- The y-axis represents Gene Expression levels.
- Each line represents a different Group (e.g., Control vs. Treated).
Uses Lines and Points for Clarity:
- A line connects the points for each group to show trends over time.
- Points are added to indicate actual data values at each time point.
Customizations for Better Readability:
- Line Thickness (linewidth = 1.5) → Makes the trend lines clearer.
- Point Style (shape = 21, fill = "white") → Uses hollow circles to highlight data points.
- Black & White Theme (theme_bw()) → Provides a high-contrast, clean background.
- Manual Color Mapping (scale_color_manual(values = c("blue", "green"))) → Ensures different groups have distinct colors.
- Custom Legend (guides()) → Makes the legend easier to interpret.

Why These Customizations Matter:

Using both lines and points makes it easier to interpret trends while ensuring data points remain visible.
The black-and-white theme provides a clean and professional look.
Manually setting colors ensures consistency and better visual contrast.
A well-structured legend improves readability and clarity.

C. Box Plot

This box plot visually summarizes the distribution of Gene Expression levels for different treatment groups. It highlights key statistical properties such as medians, quartiles, and outliers.

# Load ggplot2 package

library(ggplot2)

# Read dataset

biological_data <- read.csv('adv_plot_3.csv')

# Box plot

ggplot(biological_data, aes(x = Treatment, y = GeneExpression, fill = Treatment)) +

geom_boxplot(width = 0.6, notch = TRUE, outlier.shape = 16, outlier.size = 3) + # Customize box appearance

labs(title = "Box Plot of Gene Expression by Treatment",

x = "Treatment",

y = "Gene Expression") +

theme_minimal() + # Use a minimal theme

scale_fill_manual(values = c("lightblue", "lightcoral")) + # Set custom fill colors

guides(fill = guide_legend(override.aes = list(shape = NA))) # Remove legend symbols

What This Plot Does:

Compares Gene Expression Across Treatment Groups:
- The x-axis represents Treatment Groups (e.g., Control vs. Treated).
- The y-axis represents Gene Expression levels.
- Each box represents the interquartile range (IQR), with the median shown as a horizontal line inside the box.
Identifies Data Spread & Outliers:
- Whiskers extend to the smallest and largest values within 1.5× IQR.
- Outliers (dots) are values outside this range.
Customizations for Clarity:
- Box Width (width = 0.6) → Adjusts the width for better spacing.
- Notched Boxes (notch = TRUE) → Adds a notch to compare medians visually.
- Outlier Shape & Size (outlier.shape = 16, outlier.size = 3) → Ensures outliers are clearly visible.
- Minimal Theme (theme_minimal()) → Removes unnecessary gridlines for a clean look.
- Manual Fill Colors (scale_fill_manual()) → Ensures consistency in treatment group colors.
- Legend Cleanup (guides(fill = guide_legend(override.aes = list(shape = NA)))) → Removes unnecessary symbols in the legend.

Why These Customizations Matter:

Notches help determine if medians are significantly different.
Outlier customization ensures anomalies stand out.
Manual color selection prevents default ggplot2 colors from being misleading.
A minimal theme improves focus on data rather than gridlines.

D. Bar Plot

This bar plot visually represents the mean gene expression across different treatment groups. The addition of error bars provides insights into data variability.

# Load ggplot2 package

library(ggplot2)

# Read dataset

biological_data <- read.csv('adv_plot_4.csv')

# Bar plot

ggplot(biological_data, aes(x = Treatment, y = MeanExpression, fill = Treatment)) +

geom_bar(stat = "identity", position = "dodge", width = 0.6, color = "black") + # Customize bar appearance

geom_errorbar(aes(ymin = MeanExpression - SDExpression, ymax = MeanExpression + SDExpression),

position = position_dodge(0.6), width = 0.25) + # Add error bars

labs(title = "Bar Plot of Mean Gene Expression by Treatment",

x = "Treatment",

y = "Mean Gene Expression") +

theme_minimal() + # Use a minimal theme

scale_fill_manual(values = c("lightblue", "lightgreen", "lightcoral")) + # Set custom fill colors

guides(fill = guide_legend(override.aes = list(shape = NA))) # Remove legend symbols

What This Plot Does:

Compares Mean Gene Expression Across Treatment Groups:
- The x-axis represents Treatment Groups (e.g., Control, Treated, etc.).
- The y-axis represents Mean Gene Expression levels.
- Each bar represents the mean expression for a given treatment.
Includes Error Bars to Show Variability:
- Standard deviation (SD) is displayed using vertical error bars.
- The top and bottom of each error bar show Mean ± SD, indicating variability.
Customizations for Better Visualization:
- Dodged Bars (position = "dodge") → Bars are placed side by side for better comparison.
- Bar Width (width = 0.6) → Ensures proper spacing between bars.
- Bar Borders (color = "black") → Enhances contrast for better visibility.
- Minimal Theme (theme_minimal()) → Reduces unnecessary gridlines.
- Custom Colors (scale_fill_manual()) → Assigns distinct colors for each treatment.
- Legend Cleanup (guides(fill = guide_legend(override.aes = list(shape = NA)))) → Removes unnecessary symbols in the legend.

Why These Customizations Matter:

Error bars provide a measure of data spread, making the plot more informative.
Dodged bars prevent overlap, making group comparisons clearer.
Manual color assignment ensures consistency in treatment representation.
Adding borders to bars enhances visual clarity.

E. Violin Plot

This violin plot is a powerful way to visualize the distribution of gene expression across treatment groups. It combines features of a box plot and a density plot, showing both summary statistics and the full data distribution.

# Load ggplot2 package

library(ggplot2)

# Read dataset

biological_data <- read.csv('adv_plot_5.csv')

# Violin plot

ggplot(biological_data, aes(x = Treatment, y = GeneExpression, fill = Treatment)) +

geom_violin(width = 0.8, trim = FALSE, draw_quantiles = c(0.25, 0.5, 0.75), fill = "lightblue") + # Violin appearance

geom_jitter(position = position_jitter(width = 0.2), size = 2, color = "black") + # Add jittered points

labs(title = "Violin Plot of Gene Expression by Treatment",

x = "Treatment",

y = "Gene Expression") +

theme_minimal() + # Use a minimal theme

scale_fill_manual(values = c("lightblue", "lightcoral")) + # Set custom fill colors

guides(fill = guide_legend(override.aes = list(shape = NA))) # Remove legend symbols

What This Plot Does:

Displays Distribution Shape & Density:
- The violin shape shows where most data points are concentrated.
- A wider section indicates a higher density of values.
Adds Statistical Summaries:
- Quartiles (draw_quantiles = c(0.25, 0.5, 0.75)) → Shows median and interquartile ranges.
- No trimming (trim = FALSE) → Retains full range of data without cutting tails.
Includes Individual Data Points:
- Jittered points (geom_jitter()) prevent overlapping, improving readability.
- Each black dot represents an individual observation.
Customizations for Better Visualization:
- Violin Width (width = 0.8) → Ensures proper spacing.
- Custom Fill Colors (scale_fill_manual()) → Different colors for treatment groups.
- Minimal Theme (theme_minimal()) → Reduces distractions.
- Legend Cleanup (guides()) → Removes unnecessary legend symbols.

Why These Customizations Matter:

Violin plots provide a more informative alternative to box plots by showing both summary statistics and full data distribution.
Jittered points prevent overlapping, ensuring all individual data points are visible.
Displaying quartiles helps in statistical interpretation.
Retaining the full range (trim = FALSE) prevents misleading conclusions.

F. Heatmap Visualization

The heatmap visualization provides an intuitive way to analyze biological data patterns across samples. Though there are several ways to visualize heatmaps, here we discuss two versions and their key features:

Version 1: ggplot2 + viridis Heatmap

# Load necessary libraries

library(ggplot2)

library(viridis)

library(reshape2)

# Read biological dataset

df <- read.csv('adv_plot_6.csv')

# Reshape data for ggplot2

melted_data <- melt(df, id.vars = "Gene")

# Generate heatmap

heatmap_plot <- ggplot(melted_data, aes(x = variable, y = Gene)) +

geom_tile(aes(fill = value), color = "white") + # Heatmap grid

scale_fill_viridis_c() + # Use viridis color scale

theme_minimal() +

labs(title = "Biological Data Heatmap",

x = "Samples", y = "Genes")

heatmap_plot

Customization Options:

Modify Axis Labels & Titles
heatmap_plot + labs(title = "Customized Heatmap", x = "Samples", y = "Genes")
Change Tile Size
heatmap_plot + geom_tile(width = 0.8, height = 0.8, aes(fill = value), color = "white")
Add Value Annotations
heatmap_plot + geom_text(aes(label = round(value, 2)), vjust = 1)
Customize Legend
heatmap_plot + guides(fill = guide_colorbar(title = "Expression Level"))

This version creates a customizable heatmap using ggplot2, where:

Data is first reshaped (reshape2::melt) for plotting.
geom_tile() fills the grid based on expression values.
The color scale is enhanced with viridis for better contrast.
Customizations include axis labels, tile size, annotations, and legends.

Version 2: pheatmap for Clustered Heatmap

# Load library

library(pheatmap)

# Example dataset: Extract gene expression-like data

data <- mtcars

heatmap_data <- data[c(1:7,9,11)]

annotation_data <- data[c(8,10)] # Metadata for annotation

# Define annotation colors

annotate <- list(

vs = c("0" = "blue", "1" = "red"),

gear = palette(gray.colors(100, start = 1, end = 0))

)

# Generate clustered heatmap

pheatmap(

heatmap_data,

annotation_row = annotation_data,

annotation_colors = annotate,

color = colorRampPalette(c("white", "blue", "red"))(100),

cellwidth = 40,

cellheight = 12,

fontsize_row = 5,

cluster_rows = TRUE,

cluster_cols = TRUE

)

This version uses pheatmap, which:

Automatically clusters genes/samples based on expression patterns.
Allows row and column annotations for extra metadata.
Supports custom color palettes (colorRampPalette(c("white","blue","red"))(100)).

Key Features of pheatmap:

Hierarchical Clustering (cluster_rows = TRUE, cluster_cols = TRUE)
Custom Color Palettes (colorRampPalette)
Row Annotations (annotation_row) for metadata
Adjustable Cell & Font Sizes (cellwidth, cellheight, fontsize_row)

Which Version Should You Use?

Use ggplot2 version if you need highly customizable static heatmaps with precise control over aesthetics.
Use pheatmap version if you need automatic clustering and metadata annotations for gene/sample relationships.

G. Volcano Plot

This volcano plot is useful for visualizing differential expression in biological datasets. It helps identify significantly upregulated and downregulated genes based on fold change and statistical significance.

library(ggplot2)

# Read biological dataset

biological_data <- read.csv('adv_plot_7.csv') # GSE182964

# Generate volcano plot

# The code assumes that you already have log2_Fold_Change values in 'log2FoldChange' column after GSEA analysis

# Here the cutoff values are: log₂ fold change = +/- 1 (meaning a minimum of 2-fold change in gene expression), p-value <0.05, and adjusted p-value <0.05

ggplot(biological_data, aes(x = log2FoldChange, y = -log10(pvalue),

color = abs(log2FoldChange) > 1 & pvalue < 0.05 & padj < 0.05)) +

geom_point(alpha = 0.7, size = 3) +

geom_text(aes(label = ifelse(abs(log2FoldChange) > 1 & pvalue < 0.05 & padj < 0.05, Gene, "")),

hjust = 0.5, vjust = -0.5) +

theme_minimal() + # Minimal theme for clarity

theme(legend.position = "none") + # Remove legend

geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "blue") + # P-value cutoff line

geom_vline(xintercept = c(-3, 3), linetype = "dashed", color = "darkgreen") # Fold Change cutoff lines

If you want to be bit more creative:

library(ggplot2)

library(dplyr)

# Read the data

biological_data <- read.csv('adv_plot_7.csv') # GSE182964

# Annotate significance category

biological_data <- biological_data %>%

mutate(Significance = case_when(

padj < 0.05 & pvalue < 0.05 & abs(log2FoldChange) > 1 ~ "Padj + p-value + log2 FC",

pvalue < 0.05 & abs(log2FoldChange) > 1 ~ "p-value + log2 FC",

abs(log2FoldChange) > 1 ~ "log2 FC",

TRUE ~ "NS"

))

# Define colors

colors <- c(

"NS" = "lightgray",

"log2 FC" = "#1b9e77",

"p-value + log2 FC" = "#d95f02",

"Padj + p-value + log2 FC" = "#7570b3"

)

# Generate the volcano plot

biological_data$Significance <- factor(

biological_data$Significance,

levels = c(

"NS",

"log2 FC",

"p-value + log2 FC",

"Padj + p-value + log2 FC"

)

ggplot(biological_data, aes(x = log2FoldChange, y = -log10(pvalue), color = Significance)) +

geom_point(alpha = 0.7, size = 3) +

geom_text(

data = subset(biological_data, Significance == "Padj + p-value + log2 FC"),

mapping = aes(x = log2FoldChange, y = -log10(pvalue), label = Gene),

size = 2.5,

vjust = -0.5,

check_overlap = TRUE,

inherit.aes = FALSE

) +

scale_color_manual(values = colors) +

theme_minimal(base_size = 12) +

theme(

legend.title = element_blank(),

legend.position = "top",

plot.title = element_text(hjust = 0.5)

) +

labs(

title = "Gm12840_KO_vs_WT",

x = expression(Log[2] ~ "fold change"),

y = expression(-Log[10] ~ "P")

) +

geom_hline(yintercept = -log10(0.05), linetype = "dashed", color = "darkgray") +

geom_vline(xintercept = c(-1, 1), linetype = "dashed", color = "darkgray") +

annotate(

"text",

x = max(biological_data$log2FoldChange, na.rm = TRUE),

y = min(-log10(biological_data$pvalue), na.rm = TRUE),

label = paste("total =", nrow(biological_data), "variables"),

hjust = 1, vjust = -1,

size = 3

)

Page updated

Google Sites

Report abuse