Biostatistics Tutorial for Students

Overview

This tutorial covers the basics of several commonly used statistical tests in biological research, their assumptions, and how to apply them using R. Each test includes well-commented example R code using standard R datasets or simulated data, explanations of the underlying assumptions, and a discussion on how to interpret the results.

1. Binomial Test

A binomial test is used to determine if the proportion of successes in a sample is significantly different from a hypothesized proportion.

  • Assumptions: Binary outcome, independent observations.
  • Inference Criteria: Look at the p-value to decide if the observed proportion differs significantly.
# Example: Testing if the proportion of male babies is 0.5
# Data: 60 males out of 100 births

# Perform binomial test
result <- binom.test(60, 100, p = 0.5)

# Output result
print(result)

# Interpretation: Check the p-value
if (result$p.value < 0.05) {
  cat("Significant difference from hypothesized proportion (p < 0.05)")
} else {
  cat("No significant difference from hypothesized proportion (p > 0.05)")
}

2. Chi-Square Test

A chi-square test is used to examine the association between categorical variables.

  • Assumptions: Expected frequency in each cell should be ≥ 5, observations are independent.
  • Inference Criteria: Look at the p-value to determine if there is a significant association.
# Example: Chi-square test of independence
# Dataset: 'HairEyeColor' dataset from R

# Summarize data for Brown and Blue eyes across genders
hair_eye_gender <- margin.table(HairEyeColor, c(1, 2))

# Perform chi-square test
result <- chisq.test(hair_eye_gender)

# Output result
print(result)

# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
  cat("Significant association between hair and eye color (p < 0.05)")
} else {
  cat("No significant association between hair and eye color (p > 0.05)")
}

3. One-Sample t-Test

A one-sample t-test is used to compare the mean of a sample to a known value or hypothesized mean.

  • Assumptions: Data is normally distributed, observations are independent.
  • Inference Criteria: Compare the p-value against 0.05 to test if the sample mean is significantly different from the hypothesized mean.
# Example: Test if the average speed of cars is different from 20 mph
# Dataset: 'cars' dataset

# Load the dataset
speed <- cars$speed

# Perform one-sample t-test
result <- t.test(speed, mu = 20)

# Output result
print(result)

# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
  cat("Significant difference from the hypothesized mean of 20 (p < 0.05)")
} else {
  cat("No significant difference from the hypothesized mean of 20 (p > 0.05)")
}

4. Two-Sample t-Test

The two-sample t-test is used to compare means of two independent groups.

  • Assumptions: Independent samples, normal distribution, equal variances.
  • Inference Criteria: Look at the p-value to determine if the means are significantly different.
# Example: Test if there is a difference in weight between two species of plants
# Dataset: Simulated data

# Simulate plant weights
set.seed(42)
plant_A <- rnorm(30, mean = 5, sd = 1)
plant_B <- rnorm(30, mean = 6, sd = 1)

# Perform two-sample t-test
result <- t.test(plant_A, plant_B)

# Output result
print(result)

# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
  cat("Significant difference in mean weights (p < 0.05)")
} else {
  cat("No significant difference in mean weights (p > 0.05)")
}

5. Paired t-Test

A paired t-test is used to compare means of two related groups (e.g., before and after measurements).

  • Assumptions: Differences are normally distributed.
  • Inference Criteria: Look at the p-value to determine if the paired samples differ significantly.
# Example: Test weight change after a treatment
# Dataset: Simulated data

# Simulate before and after weights
set.seed(42)
before <- rnorm(20, mean = 70, sd = 5)
after <- before + rnorm(20, mean = 2, sd = 3)

# Perform paired t-test
result <- t.test(before, after, paired = TRUE)

# Output result
print(result)

# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
  cat("Significant change in weight after treatment (p < 0.05)")
} else {
  cat("No significant change in weight after treatment (p > 0.05)")
}

6. Sign Test

The sign test is a non-parametric test used when the assumption of normality is not met for paired data.

  • Assumptions: Paired samples, no assumption about distribution.
  • Inference Criteria: Based on the number of positive and negative differences.
# Example: Testing a median difference using the sign test
# Dataset: Simulated data

# Install 'BSDA' package for sign test
if (!require(BSDA)) install.packages("BSDA")
library(BSDA)

# Simulate before and after data
before <- rnorm(20, mean = 5)
after <- before + rnorm(20, mean = 1)

# Perform sign test
result <- SIGN.test(before, after)

# Output result
print(result)

# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
  cat("Significant difference between paired data (p < 0.05)")
} else {
  cat("No significant difference between paired data (p > 0.05)")
}

7. Mann-Whitney U Test (Wilcoxon Rank-Sum Test)

The Mann-Whitney U test is a non-parametric test used to compare the medians of two independent groups.

  • Assumptions: Independent samples, ordinal data or continuous data that is not normally distributed.
  • Inference Criteria: Look at the p-value to determine if there is a significant difference in medians.
# Example: Test if there is a difference in heights between two species of plants
# Dataset: Simulated data

# Simulate plant heights
set.seed(42)
plant_A <- rnorm(30, mean = 5, sd = 1)
plant_B <- rnorm(30, mean = 6, sd = 1.5)

# Perform Mann-Whitney U test
result <- wilcox.test(plant_A, plant_B)

# Output result
print(result)

# Interpretation: Check p-value for significance
if (result$p.value < 0.05) {
  cat("Significant difference in medians (p < 0.05)")
} else {
  cat("No significant difference in medians (p > 0.05)")
}

8. Permutation Test

A permutation test is a non-parametric method used to test the null hypothesis by calculating all possible values of the test statistic under rearrangements of the data.

  • Assumptions: No specific distributional assumptions.
  • Inference Criteria: Compare observed test statistic to distribution of permuted test statistics.
# Example: Test if there is a significant difference in weights between two groups
# Dataset: Simulated data

# Simulate weights
set.seed(42)
group_A <- rnorm(15, mean = 10, sd = 2)
group_B <- rnorm(15, mean = 12, sd = 2)

# Define permutation function
perm_test <- function(x, y, num_permutations = 1000) {
  observed_diff <- abs(mean(x) - mean(y))
  combined <- c(x, y)
  count <- 0
  
  for (i in 1:num_permutations) {
    permuted <- sample(combined)
    x_perm <- permuted[1:length(x)]
    y_perm <- permuted[(length(x) + 1):length(combined)]
    perm_diff <- abs(mean(x_perm) - mean(y_perm))
    if (perm_diff >= observed_diff) {
      count <- count + 1
    }
  }
  p_value <- count / num_permutations
  return(p_value)
}

# Perform permutation test
p_value <- perm_test(group_A, group_B)

# Output result
cat("P-value:", p_value, "\n")

# Interpretation: Check p-value for significance
if (p_value < 0.05) {
  cat("Significant difference between group means (p < 0.05)")
} else {
  cat("No significant difference between group means (p > 0.05)")
}

9. ANOVA & Tukey Post-hoc Test

Analysis of Variance (ANOVA) is used to compare the means of three or more groups. Tukey’s post-hoc test is used to determine which groups differ.

  • Assumptions: Normal distribution, homogeneity of variances, independent observations.
  • Inference Criteria: Look at the p-value from the ANOVA to determine if at least one group differs, then use Tukey’s test to determine which pairs differ.
# Example: Test if there is a difference in weights between three plant species
# Dataset: Simulated data

# Simulate plant weights
set.seed(42)
plant_A <- rnorm(30, mean = 5, sd = 1)
plant_B <- rnorm(30, mean = 6, sd = 1)
plant_C <- rnorm(30, mean = 7, sd = 1)

# Combine data into a data frame
weights <- data.frame(
  weight = c(plant_A, plant_B, plant_C),
  group = factor(rep(c("A", "B", "C"), each = 30))
)

# Perform ANOVA
anova_result <- aov(weight ~ group, data = weights)
print(summary(anova_result))

# Perform Tukey's post-hoc test
if (summary(anova_result)[[1]]["Pr(>F)"][1] < 0.05) {
  tukey_result <- TukeyHSD(anova_result)
  print(tukey_result)
}

# Interpretation: Check ANOVA p-value for overall significance, then Tukey's test for pairwise comparisons

10. Kruskal-Wallis Test & Dunn’s Post-hoc Test

The Kruskal-Wallis test is a non-parametric test used to compare medians across three or more groups. Dunn’s test can be used for post-hoc analysis if Kruskal-Wallis is significant.

  • Assumptions: Independent samples, ordinal data or continuous data that is not normally distributed.
  • Inference Criteria: Look at the p-value from the Kruskal-Wallis test to determine if there is a significant difference, then use Dunn’s test to determine which groups differ.
# Example: Test if there is a difference in weights between three plant species
# Dataset: Simulated data

# Simulate plant weights
set.seed(42)
plant_A <- rnorm(30, mean = 5, sd = 1)
plant_B <- rnorm(30, mean = 6, sd = 1)
plant_C <- rnorm(30, mean = 7, sd = 1)

# Combine data into a data frame
weights <- data.frame(
  weight = c(plant_A, plant_B, plant_C),
  group = factor(rep(c("A", "B", "C"), each = 30))
)

# Perform Kruskal-Wallis test
kruskal_result <- kruskal.test(weight ~ group, data = weights)
print(kruskal_result)

# Perform Dunn's post-hoc test if significant
if (kruskal_result$p.value < 0.05) {
  if (!require(FSA)) install.packages("FSA")
  library(FSA)
  dunn_result <- dunnTest(weight ~ group, data = weights, method = "bonferroni")
  print(dunn_result)
}

# Interpretation: Check Kruskal-Wallis p-value for overall significance, then Dunn's test for pairwise comparisons

11. Monte Carlo Methods

Monte Carlo methods are used to approximate the distribution of a statistic by generating random samples from the data.

  • Assumptions: Depends on the type of analysis.
  • Inference Criteria: Generate distributions to make probabilistic conclusions.
# Example: Estimate the mean of a distribution using Monte Carlo simulation
# Dataset: Simulated data

# Simulate a dataset
set.seed(42)
data <- rnorm(100, mean = 5, sd = 2)

# Monte Carlo estimation of the mean
num_simulations <- 1000
means <- numeric(num_simulations)

for (i in 1:num_simulations) {
  sample_data <- sample(data, size = 50, replace = TRUE)
  means[i] <- mean(sample_data)
}

# Plot the distribution of simulated means
hist(means, main = "Monte Carlo Simulation of Means", xlab = "Mean")

# Interpretation: Use the histogram to estimate the distribution of the mean

12. Simple Linear Regression

Simple linear regression is used to model the relationship between two continuous variables.

  • Assumptions: Linear relationship, normality of residuals, homoscedasticity, independence of errors.
  • Inference Criteria: Look at the p-value for the slope to determine if there is a significant relationship.
# Example: Test the relationship between speed and stopping distance
# Dataset: 'cars' dataset

# Load the dataset
speed <- cars$speed
distance <- cars$dist

# Fit linear regression model
model <- lm(distance ~ speed, data = cars)
print(summary(model))

# Plot regression line
plot(speed, distance, main = "Speed vs Stopping Distance", xlab = "Speed", ylab = "Distance")
abline(model, col = "red")

# Interpretation: Check p-value for slope and R-squared for goodness of fit

13. Generalized Linear Model (GLM)

GLM is used to model the relationship between predictors and a response variable when the response does not follow a normal distribution.

  • Assumptions: Depends on the type of GLM (e.g., logistic, Poisson).
  • Inference Criteria: Look at p-values for the coefficients to determine the significance of predictors.
# Example: Logistic regression for binary response
# Dataset: Simulated data

# Simulate binary response and predictor
set.seed(42)
predictor <- rnorm(100)
response <- rbinom(100, 1, prob = plogis(0.5 * predictor))

# Fit logistic regression model
model <- glm(response ~ predictor, family = binomial)
print(summary(model))

# Interpretation: Check p-values for predictors to determine significance

14. Mixed Effects Models

Mixed effects models are used when data have both fixed and random effects, such as repeated measures.

  • Assumptions: Normality of residuals, random effects structure.
  • Inference Criteria: Look at fixed effects to determine if there are significant predictors.
# Example: Test the effect of treatment with random effects for subject
# Dataset: Simulated data

# Install 'lme4' package for mixed models
if (!require(lme4)) install.packages("lme4")
library(lme4)

# Simulate data
set.seed(42)
subject <- factor(rep(1:20, each = 2))
treatment <- factor(rep(c("A", "B"), times = 20))
response <- rnorm(40, mean = ifelse(treatment == "A", 5, 6), sd = 1)

# Fit mixed effects model
model <- lmer(response ~ treatment + (1 | subject))
print(summary(model))

# Interpretation: Check p-values for fixed effects to determine significance

15. Common Transformations

Transformations are used to meet the assumptions of statistical tests, such as normality and homoscedasticity.

  • Examples: Log, square root, Box-Cox.
# Example: Log transformation to normalize data
# Dataset: Simulated data

# Simulate skewed data
set.seed(42)
data <- rexp(100, rate = 1)

# Apply log transformation
log_data <- log(data)

# Plot original and transformed data
par(mfrow = c(1, 2))
hist(data, main = "Original Data", xlab = "Value")
hist(log_data, main = "Log-Transformed Data", xlab = "Value")

# Interpretation: Compare histograms to see the effect of transformation

16. Correlation (Spearman and Pearson)

Correlation is used to measure the strength and direction of the relationship between two variables.

  • Assumptions: Pearson requires normality, Spearman does not.
  • Inference Criteria: Look at the correlation coefficient and p-value.
# Example: Calculate Pearson and Spearman correlation
# Dataset: 'mtcars' dataset

# Load the dataset
mpg <- mtcars$mpg
hp <- mtcars$hp

# Calculate Pearson correlation
pearson_corr <- cor.test(mpg, hp, method = "pearson")
print(pearson_corr)

# Calculate Spearman correlation
spearman_corr <- cor.test(mpg, hp, method = "spearman")
print(spearman_corr)

# Interpretation: Compare correlation coefficients and p-values for significance