The First Course on R – Foundations and Data Structures
— The post is part of my presentation for the Computational Biology Workshop for Clinicians (March 6, 2025) at AIIMS Kalyani.
The First Course on R – Foundations and Data Structures
— The post is part of my presentation for the Computational Biology Workshop for Clinicians (March 6, 2025) at AIIMS Kalyani.
Contents
Let's begin by using R as a basic calculator to sum the following numbers:
10 + 2 + 4 + 6 + 9 + 4 + 3 + 2
However, if you need to add another number (e.g., 10) and your list is extensive, manually rewriting the entire expression can be inefficient—especially if you do not remember the previous result.
10 + 2 + 4 + 6 + 9 + 4 + 3 + 2 + 10
This approach is not practical for larger datasets. A more efficient solution is to store the sum in a variable (think of it like a bag that stores the data) and perform operations on it:
x <- 10 + 2 + 4 + 6 + 9 + 4 + 3 + 2
x # Displays the stored sum
y <- x + 10 # Adds 10 to the previously computed sum
y # Displays the updated sum
By using variables, you can avoid redundant calculations and make your code more manageable.
Continuing with our analogy of variables as bags that store data, the object is the actual stuff inside the bag—it could be numbers, words, lists, or even entire tables of data. A variable is just a label for the bag, helping you find it later, but the real data lives inside the object. In R, data can be stored in different types of objects, like vectors, lists, matrices, and data frames, each designed to hold specific types of information efficiently. Just like different bags are used for different purposes (backpack for clothes, briefcase for documents), R provides different objects to organize and manage data effectively.
Vectors are the simplest R objects, containing elements of the same data type. In R, vectors can store different types of data. Let's examine various data types and their associated vectors.
2.1.1. Numeric
Numeric values include both integers and decimals (floating-point numbers).
a <- 10 + 2 + 4
class(a) # Returns "numeric"
2.1.2. Integer
To explicitly define an integer, use the L suffix. This ensures the value is treated as an integer rather than a numeric (floating-point) value.
b <- 10L + 2L + 4L # 'L' specifies whole numbers without decimals
class(b) # Returns "integer"
2.1.3. Logical
Logical data types represent boolean values: TRUE or FALSE.
c <- TRUE
class(c) # Returns "logical"
2.1.4. Complex
Complex numbers in R include a real and an imaginary part, denoted by i.
d <- 1i + 2i + 3i # Complex numbers
class(d) # Returns "complex"
2.1.5. Character
Character data consists of text or string values, enclosed in single (') or double (") quotes.
e1 <- '10+2+4'
e2 <- "10L+2L+4L"
e3 <- 'TRUE'
e4 <- "1i+2i+3i"
class(e1) # Returns "character"
class(e2) # Returns "character"
class(e3) # Returns "character"
class(e4) # Returns "character"
2.1.6. Raw
Raw data type is used to store raw bytes. It is rarely used in standard data analysis but can be useful for handling binary data.
f <- raw(10) # Creates a raw vector of length 10
class(f) # Returns "raw"
Lists in R are versatile data structures that can store elements of different types, including numbers, characters, vectors, and even other lists.
2.2.1. Creating a List
A list can contain elements of various data types, including named components.
my_list <- list(name = "John", age = 25, grades = c(90, 85, 78))
2.2.2. Accessing List Elements
Elements in a list can be accessed using either double square brackets ([[ ]]) or the $ operator.
print(my_list[["name"]]) # Accessing by name using double brackets
print(my_list$age) # Accessing by name using the $ operator
2.2.3. Creating a Heterogeneous List
Lists can contain different data types within the same structure.
mixed_list <- list("John", 25.5, c(90, 85, 78))
# The function c() combines multiple numeric values into a vector
2.2.4. Creating a Nested List
Lists can also be nested, meaning a list can contain another list as an element.
nested_list <- list(inner_list = list(a = 1, b = 2), c = 3)
2.2.5. Adding Elements to a List
You can append new elements to an existing list using the c() function.
my_list2 <- c(my_list, city = "New York")
2.2.6. Converting a List to a Data Frame
If the list elements have the same length, you can convert the list into a data frame for easier tabular manipulation.
my_data_frame <- data.frame(my_list)
Matrices in R are two-dimensional data structures where elements are arranged in rows and columns. They are primarily used for numerical computations and support various mathematical operations.
2.3.1. Creating a Matrix
A matrix can be created using the matrix() function by specifying the data, number of rows (nrow), and number of columns (ncol).
my_matrix <- matrix(c(1, 2, 3, 4, 5, 6, 7, 8, 9), nrow = 3, ncol = 3)
2.3.2. Accessing Matrix Elements
Matrix elements can be accessed using row and column indices.
print(my_matrix[2, 3]) # Accesses the element in the second row, third column
2.3.3. Assigning Row and Column Names
To enhance readability, row and column names can be assigned to a matrix.
colnames(my_matrix) <- c("A", "B", "C")
rownames(my_matrix) <- c("X", "Y", "Z")
2.3.4. Performing Matrix Operations
Matrices support arithmetic operations such as addition, subtraction, multiplication, and division.
matrix_a <- matrix(c(1, 2, 3, 4), nrow = 2)
matrix_b <- matrix(c(5, 6, 7, 8), nrow = 2)
result_matrix <- matrix_a + matrix_b # Element-wise addition
2.3.5. Matrix Functions
Several built-in functions allow manipulation and analysis of matrices.
print(dim(my_matrix)) # Returns the dimensions (rows and columns)
print(t(my_matrix)) # Computes the transpose of the matrix
An array in R is a multi-dimensional data structure that can store elements of the same data type, typically arranged in two or more dimensions. Arrays are useful for handling complex datasets and performing multi-dimensional computations.
2.4.1. Creating an Array
An array is created using the array() function, where the data parameter specifies the elements, and the dim parameter defines the dimensions.
my_array <- array(data = c(1, 2, 3, 4, 5, 6, 7, 8, 9), dim = c(3, 3, 1))
This creates a 3×3 array with a single layer (depth of 1).
2.4.2. Accessing Array Elements
Elements within an array can be accessed using indices specifying the row, column, and depth.
print(my_array[2, 3, 1]) # Retrieves the element at row 2, column 3, depth 1
2.4.3. Performing Array Operations
Arrays support arithmetic operations, such as element-wise addition, subtraction, multiplication, and division.
array_a <- array(c(1, 2, 3, 4), dim = c(2, 2, 1))
array_b <- array(c(5, 6, 7, 8), dim = c(2, 2, 1))
result_array <- array_a + array_b # Element-wise addition
2.4.4. Array Functions
Several functions help analyze and manipulate arrays.
print(dim(my_array)) # Returns the dimensions (rows, columns, depth)
print(length(my_array)) # Returns the total number of elements
Factors in R are used to represent categorical data efficiently. They store both the values and the corresponding levels, making them ideal for handling qualitative data such as gender, education level, or survey responses.
2.5.1. Creating a Factor with Specified Levels
A factor can be created using the factor() function, where the levels argument explicitly defines the categories.
gender <- factor(c("Male", "Female", "Male", "Female"), levels = c("Male", "Female"))
2.5.2. Checking and Modifying Factor Levels
The levels() function allows us to view the categories in a factor.
print(levels(gender)) # Displays the defined levels
2.5.3. Ordering Factor Levels
Factors can also be ordered, which is useful when representing ranked data.
grade <- factor(c("A", "B", "C"), levels = c("C", "B", "A")) # Ordering from lowest to highest
print(levels(grade))
2.5.4. Summarizing Factor Data
Factors can be summarized using the summary() function, which provides a count of occurrences for each level.
summary(gender)
Using factors ensures efficient storage and proper handling of categorical variables in R, making them crucial for statistical analysis and data visualization.
A data frame in R is a structured table-like object used for storing and managing datasets. Unlike matrices, which usually holds numerical data; a data frame can hold different types of data, including numerical, character, and categorical variables.
2.6.1. Creating a Data Frame
The data.frame() function is used to construct a data frame with multiple columns of different data types.
student_data <- data.frame(
Name = c("Alice", "Bob", "Charlie"),
Age = c(22, 25, 21),
Grade = c("A", "B", "C")
)
2.6.2. Accessing Data Frame Elements
Specific elements can be accessed using either column names or row-column indexing.
print(student_data$Name) # Accessing the 'Name' column
print(student_data[2, 3]) # Accessing the element in the second row, third column
2.6.3. Checking and Modifying Column Names and Data Types
The names() and str() functions provide insights into the structure of a data frame.
print(names(student_data)) # Displays the column names
print(str(student_data)) # Shows the internal structure of the data frame
2.6.4. Summarizing Data Frame Contents
The summary() function provides a statistical summary of numeric columns and frequency counts for categorical columns.
summary(student_data)
2.6.5. Adding and Removing Columns
New columns can be added dynamically, and existing columns can be removed using indexing.
# Adding a new column
student_data$City <- c("New York", "San Francisco", "Chicago")
# Removing a column
student_data <- student_data[, -4] # Removes the fourth column (City)
Rows can be added to a data frame using the rbind() function, and unwanted rows can be removed by subsetting the data frame.
Adding a Row
To append a new row, ensure that the new row has the same column structure as the existing data frame.
new_student <- data.frame(Name = "David", Age = 23, Grade = "B")
student_data <- rbind(student_data, new_student)
Removing a Row
Rows can be removed using negative indexing.
student_data <- student_data[-2, ] # Removes the second row
2.6.7. Subsetting a Data Frame
A subset of the data can be extracted using logical conditions.
young_students <- student_data[student_data$Age < 25, ]
Modifying values in a data frame can be done by directly assigning new values to specific elements, rows, or columns.
Updating a Specific Value
A particular cell in the data frame can be updated using row and column indexing.
student_data[1, 2] <- 23 # Updates Alice's Age to 23
Updating an Entire Column
A whole column can be modified by assigning new values.
student_data$Grade <- c("A+", "B+", "C+", "B") # Updates all grade
Updating Multiple Rows Based on a Condition
Rows that satisfy a condition can be updated efficiently.
student_data$Age[student_data$Name == "Charlie"] <- 22 # Updates Charlie's Age to 22
Row and column names in a data frame provide meaningful labels, improving data interpretation and ease of access.
Checking Row and Column Names
You can retrieve the names of rows and columns using the rownames() and colnames() functions.
print(rownames(student_data)) # Displays row names
print(colnames(student_data)) # Displays column names
Setting Column Names
Column names can be modified using the colnames() function.
colnames(student_data) <- c("Student_Name", "Student_Age", "Student_Grade")
Changing a Specific Column Name
To rename a single column, modify the corresponding index within colnames().
colnames(student_data)[2] <- "Age_in_Years" # Renames the second column
Setting Row Names
Row names can be assigned using the rownames() function.
rownames(student_data) <- c("S1", "S2", "S3")
Changing a Specific Row Name
To rename a specific row, update the corresponding index in rownames().
rownames(student_data)[1] <- "Student_A" # Renames the first row
Removing Row and Column Names
To remove row or column names, assign NULL.
rownames(student_data) <- NULL # Removes row names
colnames(student_data) <- NULL # Removes column names
Operators in R are used to perform calculations, comparisons, and logical evaluations. The main types of operators include arithmetic, relational, and logical operators.
Arithmetic operators perform basic mathematical computations.
# Addition ('+')
result <- 5 + 3
print(result) # Output: 8
# Subtraction ('-')
result <- 5 - 3
print(result) # Output: 2
# Multiplication ('*')
result <- 4 * 6
print(result) # Output: 24
# Division ('/')
result <- 10 / 2
print(result) # Output: 5
# Exponentiation ('^' or '**')
result <- 4 ^ 3
print(result) # Output: 64
Relational operators compare values and return a logical (TRUE or FALSE) output.
# Equal to ('==')
result <- 5 == 5
print(result) # Output: TRUE
# Not equal to ('!=')
result <- 3 != 7
print(result) # Output: TRUE
# Greater than ('>')
result <- 10 > 5
print(result) # Output: TRUE
# Less than ('<')
result <- 3 < 8
print(result) # Output: TRUE
# Greater than or equal to ('>=')
result <- 6 >= 6
print(result) # Output: TRUE
# Less than or equal to ('<=')
result <- 6 <= 6
print(result) # Output: TRUE
Logical operators are used to evaluate conditions and return Boolean (TRUE or FALSE) results.
# AND ('&' for element-wise, '&&' for single evaluation)
result <- FALSE & TRUE
print(result) # Output: FALSE
# OR ('|' for element-wise, '||' for single evaluation)
result <- TRUE | FALSE
print(result) # Output: TRUE
# NOT ('!')
result <- !TRUE
print(result) # Output: FALSE
Decision-making structures allow R programs to execute specific code blocks based on certain conditions. The if, if-else, and if-else if statements are commonly used for conditional execution.
4.1.1. Simple if Statement
The if statement executes a block of code only if a specified condition evaluates to TRUE.
# Example: Checking if x is greater than 5
x <- 10
if (x > 5) {
print("x is greater than 5")
}
4.1.2. if-else Statement
The if-else statement provides an alternative block of code that runs when the condition is FALSE.
# Example: Checking if y is greater than 5
y <- 3
if (y > 5) {
print("y is greater than 5")
} else {
print("y is not greater than 5")
}
4.1.3. if-else if Statement
The if-else if structure allows checking multiple conditions sequentially. The first condition that evaluates to TRUE executes, and the remaining conditions are ignored.
# Example: Checking if z is positive, negative, or zero
z <- 0
if (z > 0) {
print("z is positive")
} else if (z < 0) {
print("z is negative")
} else {
print("z is zero")
}
The switch statement in R provides an efficient way to handle multiple conditions by evaluating an expression and executing the corresponding code block based on matching values. It is particularly useful when multiple conditions need to be checked against a single variable.
4.2.1. Using switch to Handle Multiple Conditions
# Example: Determining the type of day based on input
day <- "Monday"
switch(day,
"Monday" = {
print("It's the start of the week.")
},
"Wednesday" = {
print("It's the middle of the week.")
},
"Friday" = {
print("It's the end of the week.")
},
"Saturday" = {
print("It's the weekend.")
},
"Sunday" = {
print("It's the weekend.")
},
print("Invalid day.") # Default case if no match is found
)
The switch function evaluates the value of day.
If day matches one of the specified cases (e.g., "Monday", "Wednesday"), the corresponding block of code is executed.
If no match is found, the default case executes (print("Invalid day.")).
The switch statement simplifies decision-making by reducing the need for multiple if-else conditions, making the code more readable and efficient.
The for loop in R is used to iterate over sequences, vectors, and other data structures, allowing efficient execution of repetitive tasks.
5.1.1. Iterating Over a Sequence of Numbers
# Looping through numbers 1 to 5
for (i in 1:5) {
print(i)
}
The loop iterates from 1 to 5, printing each value.
5.1.2. Iterating Over Elements of a Vector
# Looping through a character vector
fruits <- c("apple", "banana", "orange", "grape")
for (x in fruits) {
print(x)
}
The loop iterates over each element in the fruits vector and prints it.
5.1.3. Performing Calculations Within a Loop
# Summing elements of a vector
numbers <- c(2, 4, 6, 8, 10)
result <- 30 # Initial value
for (num in numbers) {
result <- result + num
}
print(result)
The loop iterates through the numbers vector, adding each value to result.
5.1.4. Nested for Loops
# Example of nested loops
for (i in 1:3) {
for (j in 1:2) {
print(paste("i =", i, ", j =", j))
}
}
The inner loop runs completely for each iteration of the outer loop.
The paste() function combines i and j values into a formatted string.
The while loop in R is used for executing a block of code repeatedly as long as a specified condition remains TRUE. It is particularly useful when the number of iterations is not known in advance.
5.2.1. Simple Counting Using a while Loop
# Counting from 1 to 5 using a while loop
count <- 1
while (count <= 5) {
print(count)
count <- count + 1
}
The loop continues executing as long as count is less than or equal to 5.
The count variable is incremented in each iteration to prevent an infinite loop.
5.2.2. Summing Numbers Using a while Loop
# Summing elements of a vector using a while loop
numbers <- c(2, 4, 6, 8, 10)
sum_result <- 0
index <- 1
while (index <= length(numbers)) {
sum_result <- sum_result + numbers[index]
index <- index + 1
}
print(sum_result)
The loop iterates through the numbers vector, adding each element to sum_result.
The index variable ensures all elements are processed sequentially.
5.2.3. User Input Validation Using a while Loop
# Validating user input to ensure it falls within a specific range
user_input <- -1
while (user_input < 0 || user_input > 10) {
cat("Enter a number between 0 and 10: ")
user_input <- as.numeric(readline())
}
print(paste("You entered:", user_input))
The loop repeatedly prompts the user until they enter a valid number between 0 and 10.
The readline() function takes user input, which is converted to numeric for validation.
5.2.4. Handling Infinite Loops with a Break Statement
# Infinite loop with a break condition
count <- 1
while (TRUE) {
print(count)
count <- count + 1
if (count > 5) {
break # Exits the loop when count exceeds 5
}
}
The while (TRUE) construct creates an infinite loop.
The break statement ensures the loop exits once count exceeds 5.
Functions in R allow for code reusability and modular programming. They enable users to encapsulate logic, making code more readable and efficient. Functions can accept arguments, return values, and have default parameters.
6.1. Creating a Simple Function
# Function to compute the square of a number
square <- function(x) {
return(x^2)
}
# Calling the function
result <- square(6)
print(result)
Explanation:
The function square() takes a single argument x and returns its square.
It is then called with the value 6, and the result is printed.
6.2. Function with Multiple Arguments
# Function to calculate the sum of squares of two numbers
sum_of_squares <- function(a, b) {
return(a^2 + b^2)
}
# Calling the function
result <- sum_of_squares(3, 4)
print(result)
Explanation:
The function sum_of_squares() takes two arguments, a and b, and returns the sum of their squares.
6.3. Function with Default Arguments
# Function to compute the power of a number with a default exponent
power <- function(x, exponent = 2) {
return(x^exponent)
}
# Calling the function with and without specifying the exponent
result1 <- power(3) # Defaults to exponent 2
result2 <- power(3, 3) # Exponent explicitly set to 3
print(result1)
print(result2)
Explanation:
The function power() computes x raised to a given exponent.
If no exponent is provided, it defaults to 2.
6.4. Returning Multiple Values
# Function to compute summary statistics of a dataset
calculate_stats <- function(data) {
mean_val <- mean(data)
median_val <- median(data)
sd_val <- sd(data)
return(list(mean = mean_val, median = median_val, sd = sd_val))
}
# Calling the function
data <- c(1, 2, 3, 4, 5)
result <- calculate_stats(data)
# Accessing elements from the returned list
print(result$mean)
print(result$median)
print(result$sd)
Explanation:
The function calculate_stats() takes a numeric vector as input.
It calculates and returns a list containing the mean, median, and standard deviation of the data.
R provides extensive functionality for string manipulation, including creation, concatenation, indexing, length measurement, case conversion, pattern matching, and replacement.
7.1. Creating and Concatenating Strings
# Defining strings using single and double quotes
my_string1 <- 'Hello, World!'
my_string2 <- "R programming"
# Concatenating strings using paste()
combined_string <- paste(my_string1, my_string2)
print(combined_string)
Explanation:
Strings can be defined using either single or double quotes.
The paste() function combines multiple strings into a single string.
7.2. Accessing Characters in a String
# Accessing individual characters
first_char <- substr(my_string1, 1, 1)
second_char <- substr(my_string1, 2, 2)
print(first_char)
print(second_char)
Explanation:
The substr() function allows accessing specific characters by specifying the start and stop positions.
7.3. Measuring String Length
# Getting the length of a string
length_of_string <- nchar(my_string1)
print(length_of_string)
Explanation:
The nchar() function returns the total number of characters in a string, including spaces and punctuation.
7.4. Extracting Substrings
# Extracting a portion of a string
substring_example <- substr(my_string1, start = 1, stop = 5)
print(substring_example)
Explanation:
The substr() function extracts a substring from a given start to stop position.
7.5. Changing Case
# Converting to uppercase and lowercase
uppercase_string <- toupper(my_string1)
lowercase_string <- tolower(my_string1)
print(uppercase_string)
print(lowercase_string)
Explanation:
The toupper() function converts a string to uppercase.
The tolower() function converts a string to lowercase.
7.6. Searching and Matching Strings
# Searching for a pattern in a string using grep()
matching_values <- grep("World", my_string1, value = TRUE, ignore.case = TRUE)
print(matching_values)
Explanation:
The grep() function searches for a pattern in a string and returns matches.
7.7. Replacing Substrings
# Replacing a substring using gsub()
modified_string <- gsub("Hello", "Hi", my_string1)
print(modified_string)
Explanation:
The gsub() function replaces all occurrences of a pattern with a specified replacement.
7.8. Comparing Strings
# Comparing two strings
comparison_result <- my_string1 == my_string2
print(comparison_result)
Explanation:
The == operator checks whether two strings are identical and returns TRUE or FALSE.
7.9. Combining Strings with Numbers
# Combining a string with a numeric value
age <- 25
info_string <- paste("My age is", age, "years.")
print(info_string)
Explanation:
The paste() function seamlessly combines strings and numeric values.
Regular expressions (RegEx) are powerful tools for pattern matching and text manipulation in R. The grep() function is commonly used to search for patterns in character vectors. Below are key RegEx concepts with examples.
Metacharacters allow pattern searching based on specific substrings.
my_string1 <- c("apple", "banana", "orange", "grape")
# Matching strings that contain "app"
matching_values <- grep("app", my_string1, value = TRUE, ignore.case = TRUE)
print(matching_values)
This searches for occurrences of "app" within the character vector, ignoring case sensitivity.
# Matching strings containing "g" or "r"
matching_values <- grep("[gr]", my_string1, value = TRUE, ignore.case = TRUE)
print(matching_values)
The pattern [gr] matches any string containing either "g" or "r".
# Matching strings containing any letter from "o" to "s"
matching_values <- grep("[o-s]", my_string1, value = TRUE, ignore.case = TRUE)
print(matching_values)
The pattern [o-s] matches strings that contain any character within the range "o" to "s".
# Matching strings with two or more occurrences of 'p'
matching_values <- grep("p{2,}", my_string1, value = TRUE, ignore.case = TRUE)
print(matching_values)
The pattern "p{2,}" ensures that only strings with at least two consecutive "p"s are matched.
# Matching strings that start with "gr"
matching_values <- grep("^gr", my_string1, value = TRUE, ignore.case = TRUE)
print(matching_values)
The caret (^) anchors the pattern to the beginning of the string.
# Matching strings that end with "a"
matching_values <- grep("a$", my_string1, value = TRUE, ignore.case = TRUE)
print(matching_values)
The dollar sign ($) anchors the pattern to the end of the string.
# Matching a period (.)
my_string2 <- c("apple", "banana", "ora.nge", "grape")
matching_values <- grep("\\.", my_string2, value = TRUE, ignore.case = TRUE)
print(matching_values)
The pattern "\\." ensures that only strings containing an actual period (.) are matched.
# Matching strings containing either "ban" or "ora"
matching_values <- grep("ban|ora", my_string1, value = TRUE, ignore.case = TRUE)
print(matching_values)
The pipe (|) acts as an OR operator, matching either "ban" or "ora".
text_data <- c("This is a telephone directory",
"Call Sam at 123-456-7890",
"Office: 987-654-3210",
"No number here.")
# Extracting valid phone numbers
phone_numbers <- grep("\\d{3}-\\d{3}-\\d{4}", text_data, value = TRUE)
print(phone_numbers)
\\d matches any digit (0-9).
{3}- ensures that three digits are followed by a hyphen.
{4} enforces a four-digit sequence at the end, forming a standard phone number format (XXX-XXX-XXXX).
Before working with files, it is important to ensure that the correct working directory is set.
# Check the current working directory
getwd()
# Set a new working directory
setwd("C:/Users/username/Desktop/R")
The getwd() function returns the current working directory, while setwd() allows setting a specific location where files will be read from or saved to.
# Read a CSV file into a data frame
data <- read.csv("file.csv")
The read.csv() function loads comma-separated data into an R data frame. The first row is assumed to contain column names unless specified otherwise.
# Read a text file with tab-separated values
data <- read.table("file.txt", header = TRUE, sep = "\t")
For non-CSV formats, read.table() provides more flexibility, allowing the user to specify custom delimiters such as tabs ("\t") or semicolons (";").
write.csv(data, "output_file.csv", row.names = FALSE)
This function saves an R data frame to a CSV file. Setting row.names = FALSE prevents row indices from being written.
write.table(data, "output_file.txt", sep = "\t", row.names = FALSE)
The write.table() function provides greater control over formatting by allowing custom separators such as tabs ("\t") or spaces.