This tutorial covers the basics of several commonly used statistical tests in biological research, their assumptions, and how to apply them using R. Each test includes well-commented example R code using standard R datasets or simulated data, explanations of the underlying assumptions, and a discussion on how to interpret the results.
A binomial test is used to determine if the proportion of successes in a sample is significantly different from a hypothesized proportion.
# Example: Testing if the proportion of male babies is 0.5
# Data: 60 males out of 100 births
# Perform binomial test
result <- binom.test(60, 100, p = 0.5)
# Output result
print(result)
# Interpretation: Check the p-value
if (result$p.value < 0.05) {
cat("Significant difference from hypothesized proportion (p < 0.05)")
} else {
cat("No significant difference from hypothesized proportion (p > 0.05)")
}
A chi-square test is used to examine the association between categorical variables.
# Example: Chi-square test of independence
# Dataset: 'HairEyeColor' dataset from R
# Summarize data for Brown and Blue eyes across genders
hair_eye_gender <- margin.table(HairEyeColor, c(1, 2))
# Perform chi-square test
result <- chisq.test(hair_eye_gender)
# Output result
print(result)
# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
cat("Significant association between hair and eye color (p < 0.05)")
} else {
cat("No significant association between hair and eye color (p > 0.05)")
}
A one-sample t-test is used to compare the mean of a sample to a known value or hypothesized mean.
# Example: Test if the average speed of cars is different from 20 mph
# Dataset: 'cars' dataset
# Load the dataset
speed <- cars$speed
# Perform one-sample t-test
result <- t.test(speed, mu = 20)
# Output result
print(result)
# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
cat("Significant difference from the hypothesized mean of 20 (p < 0.05)")
} else {
cat("No significant difference from the hypothesized mean of 20 (p > 0.05)")
}
The two-sample t-test is used to compare means of two independent groups.
# Example: Test if there is a difference in weight between two species of plants
# Dataset: Simulated data
# Simulate plant weights
set.seed(42)
plant_A <- rnorm(30, mean = 5, sd = 1)
plant_B <- rnorm(30, mean = 6, sd = 1)
# Perform two-sample t-test
result <- t.test(plant_A, plant_B)
# Output result
print(result)
# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
cat("Significant difference in mean weights (p < 0.05)")
} else {
cat("No significant difference in mean weights (p > 0.05)")
}
A paired t-test is used to compare means of two related groups (e.g., before and after measurements).
# Example: Test weight change after a treatment
# Dataset: Simulated data
# Simulate before and after weights
set.seed(42)
before <- rnorm(20, mean = 70, sd = 5)
after <- before + rnorm(20, mean = 2, sd = 3)
# Perform paired t-test
result <- t.test(before, after, paired = TRUE)
# Output result
print(result)
# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
cat("Significant change in weight after treatment (p < 0.05)")
} else {
cat("No significant change in weight after treatment (p > 0.05)")
}
The sign test is a non-parametric test used when the assumption of normality is not met for paired data.
# Example: Testing a median difference using the sign test
# Dataset: Simulated data
# Install 'BSDA' package for sign test
if (!require(BSDA)) install.packages("BSDA")
library(BSDA)
# Simulate before and after data
before <- rnorm(20, mean = 5)
after <- before + rnorm(20, mean = 1)
# Perform sign test
result <- SIGN.test(before, after)
# Output result
print(result)
# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
cat("Significant difference between paired data (p < 0.05)")
} else {
cat("No significant difference between paired data (p > 0.05)")
}
The Mann-Whitney U test is a non-parametric test used to compare the medians of two independent groups.
# Example: Test if there is a difference in heights between two species of plants
# Dataset: Simulated data
# Simulate plant heights
set.seed(42)
plant_A <- rnorm(30, mean = 5, sd = 1)
plant_B <- rnorm(30, mean = 6, sd = 1.5)
# Perform Mann-Whitney U test
result <- wilcox.test(plant_A, plant_B)
# Output result
print(result)
# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
cat("Significant difference in medians (p < 0.05)")
} else {
cat("No significant difference in medians (p > 0.05)")
}
A permutation test is a non-parametric method used to test the null hypothesis by calculating all possible values of the test statistic under rearrangements of the data.
# Example: Test if there is a significant difference in weights between two groups
# Dataset: Simulated data
# Simulate weights
set.seed(42)
group_A <- rnorm(15, mean = 10, sd = 2)
group_B <- rnorm(15, mean = 12, sd = 2)
# Define permutation function
perm_test <- function(x, y, num_permutations = 1000) {
observed_diff <- abs(mean(x) - mean(y))
combined <- c(x, y)
count <- 0
for (i in 1:num_permutations) {
permuted <- sample(combined)
x_perm <- permuted[1:length(x)]
y_perm <- permuted[(length(x) + 1):length(combined)]
perm_diff <- abs(mean(x_perm) - mean(y_perm))
if (perm_diff >= observed_diff) {
count <- count + 1
}
}
p_value <- count / num_permutations
return(p_value)
}
# Perform permutation test
p_value <- perm_test(group_A, group_B)
# Output result
cat("P-value:", p_value, "\n")
# Interpretation: Check p-value for significance
if (p_value < 0.05) {
cat("Significant difference between group means (p < 0.05)")
} else {
cat("No significant difference between group means (p > 0.05)")
}
Analysis of Variance (ANOVA) is used to compare the means of three or more groups. Tukey’s post-hoc test is used to determine which groups differ.
# Example: Test if there is a difference in weights between three plant species
# Dataset: Simulated data
# Simulate plant weights
set.seed(42)
plant_A <- rnorm(30, mean = 5, sd = 1)
plant_B <- rnorm(30, mean = 6, sd = 1)
plant_C <- rnorm(30, mean = 7, sd = 1)
# Combine data into a data frame
weights <- data.frame(
weight = c(plant_A, plant_B, plant_C),
group = factor(rep(c("A", "B", "C"), each = 30))
)
# Perform ANOVA
anova_result <- aov(weight ~ group, data = weights)
print(summary(anova_result))
# Perform Tukey's post-hoc test
if (summary(anova_result)[[1]]["Pr(>F)"][1] < 0.05) {
tukey_result <- TukeyHSD(anova_result)
print(tukey_result)
}
# Interpretation: Check ANOVA p-value for overall significance, then Tukey's test for pairwise comparisons
The Kruskal-Wallis test is a non-parametric test used to compare medians across three or more groups. Dunn’s test can be used for post-hoc analysis if Kruskal-Wallis is significant.
# Example: Test if there is a difference in weights between three plant species
# Dataset: Simulated data
# Simulate plant weights
set.seed(42)
plant_A <- rnorm(30, mean = 5, sd = 1)
plant_B <- rnorm(30, mean = 6, sd = 1)
plant_C <- rnorm(30, mean = 7, sd = 1)
# Combine data into a data frame
weights <- data.frame(
weight = c(plant_A, plant_B, plant_C),
group = factor(rep(c("A", "B", "C"), each = 30))
)
# Perform Kruskal-Wallis test
kruskal_result <- kruskal.test(weight ~ group, data = weights)
print(kruskal_result)
# Perform Dunn's post-hoc test if significant
if (kruskal_result$p.value < 0.05) {
if (!require(FSA)) install.packages("FSA")
library(FSA)
dunn_result <- dunnTest(weight ~ group, data = weights, method = "bonferroni")
print(dunn_result)
}
# Interpretation: Check Kruskal-Wallis p-value for overall significance, then Dunn's test for pairwise comparisons
Monte Carlo methods are used to approximate the distribution of a statistic by generating random samples from the data.
# Example: Estimate the mean of a distribution using Monte Carlo simulation
# Dataset: Simulated data
# Simulate a dataset
set.seed(42)
data <- rnorm(100, mean = 5, sd = 2)
# Monte Carlo estimation of the mean
num_simulations <- 1000
means <- numeric(num_simulations)
for (i in 1:num_simulations) {
sample_data <- sample(data, size = 50, replace = TRUE)
means[i] <- mean(sample_data)
}
# Plot the distribution of simulated means
hist(means, main = "Monte Carlo Simulation of Means", xlab = "Mean")
# Interpretation: Use the histogram to estimate the distribution of the mean
Simple linear regression is used to model the relationship between two continuous variables.
# Example: Test the relationship between speed and stopping distance
# Dataset: 'cars' dataset
# Load the dataset
speed <- cars$speed
distance <- cars$dist
# Fit linear regression model
model <- lm(distance ~ speed, data = cars)
print(summary(model))
# Plot regression line
plot(speed, distance, main = "Speed vs Stopping Distance", xlab = "Speed", ylab = "Distance")
abline(model, col = "red")
# Interpretation: Check p-value for slope and R-squared for goodness of fit
GLM is used to model the relationship between predictors and a response variable when the response does not follow a normal distribution.
# Example: Logistic regression for binary response
# Dataset: Simulated data
# Simulate binary response and predictor
set.seed(42)
predictor <- rnorm(100)
response <- rbinom(100, 1, prob = plogis(0.5 * predictor))
# Fit logistic regression model
model <- glm(response ~ predictor, family = binomial)
print(summary(model))
# Interpretation: Check p-values for predictors to determine significance
Mixed effects models are used when data have both fixed and random effects, such as repeated measures.
# Example: Test the effect of treatment with random effects for subject
# Dataset: Simulated data
# Install 'lme4' package for mixed models
if (!require(lme4)) install.packages("lme4")
library(lme4)
# Simulate data
set.seed(42)
subject <- factor(rep(1:20, each = 2))
treatment <- factor(rep(c("A", "B"), times = 20))
response <- rnorm(40, mean = ifelse(treatment == "A", 5, 6), sd = 1)
# Fit mixed effects model
model <- lmer(response ~ treatment + (1 | subject))
print(summary(model))
# Interpretation: Check p-values for fixed effects to determine significance
Transformations are used to meet the assumptions of statistical tests, such as normality and homoscedasticity.
# Example: Log transformation to normalize data
# Dataset: Simulated data
# Simulate skewed data
set.seed(42)
data <- rexp(100, rate = 1)
# Apply log transformation
log_data <- log(data)
# Plot original and transformed data
par(mfrow = c(1, 2))
hist(data, main = "Original Data", xlab = "Value")
hist(log_data, main = "Log-Transformed Data", xlab = "Value")
# Interpretation: Compare histograms to see the effect of transformation
Correlation is used to measure the strength and direction of the relationship between two variables.
# Example: Calculate Pearson and Spearman correlation
# Dataset: 'mtcars' dataset
# Load the dataset
mpg <- mtcars$mpg
hp <- mtcars$hp
# Calculate Pearson correlation
pearson_corr <- cor.test(mpg, hp, method = "pearson")
print(pearson_corr)
# Calculate Spearman correlation
spearman_corr <- cor.test(mpg, hp, method = "spearman")
print(spearman_corr)
# Interpretation: Compare correlation coefficients and p-values for significance