How to use this document
Each method includes:
Plain‑English description
Assumptions/limits
Example input
R code
Example output
Interpretation (with effect sizes when possible)
Data wrangling favors base R; plotting favors
ggplot2 for clarity. Run code blocks in R or RStudio.
Install packages once with install.packages()
.
# Core packages used in this review (install if needed)
pkgs <- c("ggplot2","lme4","lmerTest","effectsize","emmeans","car","MASS","boot","tidyr","dplyr","broom","ggfortify","vegan")
to_install <- setdiff(pkgs, rownames(installed.packages()))
if(length(to_install)) install.packages(to_install)
Compare means.
T-tests are used when you’re comparing the means of one or two groups to see if they’re statistically different from each other. The one-sample t-test checks whether a sample mean is different from a known or expected value (like a control or theoretical value). The two-sample version compares the means of two independent groups, while the paired version compares before/after measurements or matched pairs (e.g., siblings, pre/post interventions). T-tests assume normally distributed data, especially when sample sizes are small, but they’re fairly robust. They’re often your go-to when the outcome is continuous and the design is simple — just remember to check whether your samples are paired or independent and whether variance between groups is roughly equal.
Two treatments on fish growth (grams):
set.seed(1)
ctrl <- rnorm(30, mean = 10, sd = 2)
trt <- rnorm(28, mean = 11.5, sd = 2)
paired_before <- rnorm(20, 10, 1.5)
paired_after <- paired_before + rnorm(20, 0.8, 1.2)
# One-sample: is mean(ctrl) different from 10?
t.test(ctrl, mu = 10)
# Independent two-sample (Welch by default)
t.test(trt, ctrl)
# Paired t-test
t.test(paired_after, paired_before, paired = TRUE)
# Effect sizes
library(effectsize)
cohens_d(trt, ctrl) # independent
cohens_d(paired_after, paired_before, paired = TRUE) # paired
Welch Two Sample t-test
t = 3.2, df = 53.4, p = 0.0023
mean of x = 11.52, mean of y = 10.02, diff = 1.50
95% CI for diff: [0.56, 2.44]
Cohen's d (Hedges' g corrected):
d = 0.86 [0.31, 1.41]
When you read a t‑test, don’t stop at the p‑value. Start with the group means and their 95% CI—that tells you what size of difference is compatible with the data. Then report Cohen’s d (and its CI) so readers know if the difference is tiny, moderate, or big in biological terms. If Welch’s test was used, you already handled unequal variances; if not, check variance equality before making claims. For paired tests, emphasize the mean of the within‑subject differences—that’s the thing you actually tested. Finally, sanity‑check the plot: if the distributions barely overlap and d is decent, the story’s stronger than a lonely p‑value.
Chi-square tests are for categorical data — think counts of observations in different categories. You use them when you’re trying to answer questions like “Do these two variables appear to be related?” or “Are these observed counts different from what I’d expect by chance?” The goodness-of-fit test asks whether one categorical variable fits a specific expected distribution, while the test of independence checks whether two categorical variables are statistically associated (like treatment group and survival). These tests assume independence between observations and that expected counts are big enough (typically ≥5 per cell). If you’ve got categories and counts — like genotypes, color morphs, or choices in a behavioral trial — this is your tool.
# Goodness-of-fit
obs <- c(A=45, B=30, C=25) # observed counts
exp_prop <- c(0.4,0.3,0.3)
# Independence table
tab <- matrix(c(15,5, 8,12), nrow=2, byrow=TRUE,
dimnames=list(Treatment=c("Ctrl","Drug"),
Outcome=c("No","Yes")))
# Goodness-of-fit (supply expected probabilities)
chisq.test(obs, p = exp_prop)
# Independence
chisq.test(tab)
# If any expected cell < 5, consider:
fisher.test(tab)
Chi-squared test for given probabilities
X-squared = 0.83, df = 2, p = 0.66
Pearson's Chi-squared test
X-squared = 5.12, df = 1, p = 0.0236
library(effectsize)
cramers_v(tab)
For chi‑square, the headline is whether the observed pattern of counts deviates meaningfully from what you’d expect by chance. Report Cramér’s V (or phi in 2×2) to size the association; large tables can be “significant” with trivial associations, so effect size keeps us honest. Inspect standardized residuals to see which cells drive the result (that’s usually the biological insight). If any expected counts are tiny, say you used Fisher’s exact instead and frame conclusions cautiously. Bottom line: tell readers which categories differ from expectation and by how much, not just that the table was “sig.”
Compare means across ≥ 2 groups (factors).
ANOVA — short for analysis of variance — is what you use when you’re comparing more than two groups. It tells you whether there are any statistically significant differences among group means, and it’s more appropriate than stringing together a bunch of t-tests. One-way ANOVA tests the effect of a single factor (e.g., treatment), while two-way ANOVA lets you test two factors and their interaction (e.g., treatment and sex). If you measured the same subject multiple times, you’ll need a repeated measures ANOVA — which accounts for that within-subject structure. ANOVA assumes normality and equal variances among groups, but it’s robust to moderate violations. If you find significance, post-hoc tests (like Tukey’s) tell you which groups differ. Also, don’t just report the p-value — effect sizes like beta-squared are key to understanding the biological relevance.
set.seed(2)
dat <- data.frame(
y = c(rnorm(20,10,2), rnorm(22,11,2), rnorm(21,12,2)),
trt = factor(rep(c("A","B","C"), times=c(20,22,21)))
)
m1 <- aov(y ~ trt, data = dat)
summary(m1)
# Diagnostics
par(mfrow=c(1,2)); plot(m1); par(mfrow=c(1,1))
car::leveneTest(y ~ trt, data=dat)
# Post-hoc (Tukey)
TukeyHSD(m1)
# Effect size (eta-squared)
effectsize::eta_squared(m1)
# Plot means + CI
library(ggplot2)
ggplot(dat, aes(trt, y)) +
stat_summary(fun=mean, geom="point", size=3) +
stat_summary(fun.data=mean_cl_normal, geom="errorbar", width=0.2) +
theme_minimal()
set.seed(3)
dat2 <- expand.grid(
trt = c("A","B"),
sex = c("F","M"),
rep = 1:25
)
dat2$y <- with(dat2, 10 + (trt=="B")*2 + (sex=="M")*1 + (trt=="B" & sex=="M")*1.5 + rnorm(nrow(dat2),0,2))
m2 <- aov(y ~ trt*sex, data=dat2)
summary(m2)
effectsize::eta_squared(m2)
# Example: 10 subjects measured at 3 time points
set.seed(4)
subj <- factor(1:10)
time <- factor(rep(c("T1","T2","T3"), each=10))
y <- c(rnorm(10,10,1.2), rnorm(10,11,1.2), rnorm(10,12,1.2))
rm_dat <- data.frame(subj, time, y)
# Within-subject (Error term)
rm_mod <- aov(y ~ time + Error(subj/time), data=rm_dat)
summary(rm_mod)
# Pairwise (adjusted)
emmeans::emmeans(lm(y ~ time + subj, rm_dat), pairwise ~ time)
With ANOVA, significance says “at least one group mean differs,” but the biology lives in which contrasts differ and how large those differences are. Report eta²/partial‑eta² and 95% CIs for pairwise comparisons (Tukey or planned contrasts) so effect sizes are front and center. For two‑way ANOVA, check the interaction first: if present, interpret main effects within levels of the other factor; if absent, main effects may be summarized cleanly.
Measures linear association between two continuous variables (−1 to +1).
Correlation tells you how much two continuous variables tend to move together — positively, negatively, or not at all. Pearson correlation measures linear relationships and assumes both variables are roughly normally distributed. It gives you a number between −1 and +1: the closer to ±1, the stronger the relationship. A positive correlation means the variables increase together (e.g., body size and fecundity), while a negative one means as one increases, the other decreases (e.g., age and metabolic rate). It’s easy to overinterpret, so always visualize it — and remember, correlation isn’t causation.
cor(mtcars$mpg, mtcars$wt, method = "pearson")
cor.test(mtcars$mpg, mtcars$wt)
# Plot with regression line (visual check)
ggplot(mtcars, aes(wt, mpg)) +
geom_point() +
geom_smooth(method="lm", se=TRUE) +
theme_minimal()
Correlation is about strength and direction of a linear relationship, not causation. Report r with a 95% CI and show the scatter with a fitted line; readers should see whether a few points are doing all the work. If a relationship is clearly nonlinear or has outliers, Pearson r can mislead—switch to Spearman or model the curve. Also ask “so what?”: a modest r can be biologically important if the variables matter, while a large r can be trivial if it’s an artifact (batch, time, etc.). Keep the story focused on practical meaning, not just the magnitude of r.
Predict a continuous response from one or more predictors; estimate slopes (effects).
Linear regression is a workhorse method for modeling the relationship between a continuous outcome and one or more predictors (categorical or continuous). It estimates slopes that tell you how much the outcome changes for each unit of the predictor, holding others constant. Assumptions include linearity, homoscedasticity (constant variance), and normal residuals. You should check diagnostic plots — they’re easy and informative. Regression also gives you effect sizes via the slopes and R², which tells you how much of the variation in the outcome your model explains. Multiple regression (more than one predictor) lets you control for confounding and test more complex hypotheses.
fit <- lm(mpg ~ wt + hp, data=mtcars)
summary(fit) # coefficients, p, R^2
car::vif(fit) # multicollinearity
plot(fit) # diagnostics
effectsize::standardize_parameters(fit) # standardized betas
Interpret regression by translating coefficients into expected changes on the original scale: “a 1‑unit increase in X is associated with β units change in Y.” Use standardized betas if units differ wildly so readers can compare predictors. Report R² (variance explained) and show diagnostics—if residuals fan out or the relationship bends, your inferences and CIs aren’t trustworthy. When multiple predictors are in play, interpret a coefficient as conditional on the others; if collinearity is high, admit the uncertainty and consider simpler models.
GLMs extend linear models with non‑normal errors and link functions:
Strengths:
GLMs extend linear regression to handle data that aren’t continuous and normally distributed — like binary outcomes, counts, or skewed data. Instead of fitting a straight line, GLMs use link functions to relate the predictors to the mean of a response with a specific distribution (binomial for yes/no, Poisson for counts, etc.). They shine when you’re dealing with log-odds, incidence rates, or proportions — e.g., modeling survival, parasite load, or presence/absence. While the output can look cryptic (log odds and deviance), GLMs are powerful and flexible. Focus on interpreting effect sizes like odds ratios and incidence rate ratios, and always check for overdispersion in count data.
# Example: predict am (0/1) from wt + hp
glm_bin <- glm(am ~ wt + hp, data=mtcars, family=binomial)
summary(glm_bin)
# Odds ratios and CIs
or <- exp(cbind(OR = coef(glm_bin), confint(glm_bin)))
or
# Effect "size": Odds Ratios; McFadden's pseudo-R2
library(pscl)
pR2(glm_bin)
# Fake counts
set.seed(7)
counts <- rpois(100, lambda = exp(0.3 + 0.5*rnorm(100)))
x <- rnorm(100)
dfc <- data.frame(counts, x)
m_pois <- glm(counts ~ x, family=poisson, data=dfc)
summary(m_pois)
# Check overdispersion quickly:
disp <- sum(residuals(m_pois, type="pearson")^2) / m_pois$df.residual
# If > ~1.5, consider quasi-Poisson or negative binomial:
m_qpois <- glm(counts ~ x, family=quasipoisson, data=dfc)
summary(m_qpois)
For GLMs, always bring estimates back to a natural effect size on the response scale. In logistic, exponentiate coefficients to get odds ratios (OR) and report OR with 95% CIs (“treatment doubles the odds of survival”). In Poisson/negative binomial, exponentiate to get incidence rate ratios (IRR). State whether you checked overdispersion (and how you handled it). Replace generic fit indices with something interpretable (e.g., calibration curves, confusion‑style summaries for logistic; predicted counts across x for Poisson). A small p with an OR≈1 isn’t compelling—effect size and uncertainty should carry the conclusion.
Handle hierarchical or repeated measures data by adding random effects (e.g., random intercept per subject, plot, batch).
Mixed models are what you reach for when your data have grouping or repeated measures — like multiple measurements per animal, per cage, or per site. They combine fixed effects (e.g., treatment) with random effects (e.g., individual ID), allowing you to account for non-independence and variation between groups. This lets you use all your data without violating assumptions of independence. Mixed models are especially helpful in field or lab settings where full randomization isn’t possible. They require a bit more statistical care but offer realistic models of the biological structure in your data. Look at both fixed effect estimates and marginal vs conditional R² to understand what’s going on.
library(lme4); library(lmerTest)
# Example: y ~ time with repeated measures per subject
rm_dat$timeN <- as.numeric(rm_dat$time)
mod_lmm <- lmer(y ~ time + (1|subj), data=rm_dat)
summary(mod_lmm)
# Random slopes
mod_rs <- lmer(y ~ time + (time|subj), data=rm_dat)
anova(mod_lmm, mod_rs)
# Effect sizes (marginal/conditional R2)
performance::r2_nakagawa(mod_lmm)
emmeans::emmeans(mod_lmm, pairwise ~ time)
Mixed models separate the average (fixed) effects you care about from the group‑level variation you must account for. Report fixed effects with CIs, and include marginal/conditional R² to show how much variance is explained by predictors vs the full model (including random effects). If random slopes matter, say so and explain the biological meaning (e.g., individuals differ in responsiveness). Emphasize predicted means/trajectories by group/subject with CIs; they make the repeated‑measures story visible. Note any convergence issues or singular fits—those hint at over‑complex random structures relative to the data.
Non-parametric tests are your backup plan when data don’t meet parametric assumptions (like normality), or when your response is ordinal or ranked. They’re less powerful than parametric tests but more robust when data are skewed or have outliers. The Wilcoxon rank-sum (Mann–Whitney) test replaces a two-sample t-test, while the Wilcoxon signed-rank test replaces the paired version. Kruskal-Wallis stands in for one-way ANOVA, and the Friedman test replaces repeated measures ANOVA. You can also use Spearman or Kendall correlation for monotonic relationships. They don’t rely on means — they use medians or rank sums — so they’re great when distributions are weird.
When assumptions (normality/homoscedasticity) are violated or data are ordinal.
# Rank-sum
wilcox.test(trt, ctrl)
# Signed-rank (paired)
wilcox.test(paired_after, paired_before, paired=TRUE)
# Kruskal-Wallis
kruskal.test(y ~ trt, data=dat)
# Friedman (wide-to-long may be needed)
# Simple example using a matrix: subjects in rows, treatments in columns
mat <- cbind(T1=rnorm(10,10,1), T2=rnorm(10,11,1), T3=rnorm(10,12,1))
friedman.test(mat)
# Spearman correlation
cor.test(mtcars$mpg, mtcars$hp, method="spearman")
effectsize
).Interpret non‑parametrics as shifts in distributions, not differences in means. Use median differences to quantify the shift with CIs. Visualize with boxplots/violin plots so readers see spread and overlap. A significant result tells you the typical ranks differ; it doesn’t say by how many units on the original scale unless you provide that estimate. If you chose a rank test because of outliers or skew, state that plainly—that transparency strengthens your conclusion.
Build a null distribution by repeatedly shuffling labels; compute p as the fraction of permutations with statistic ≥ observed.
Permutation tests are beautifully intuitive: you calculate your observed statistic (like a mean difference), then randomly shuffle your group labels thousands of times and see how often the randomized difference is as extreme as yours. It builds a null distribution from your data rather than relying on theoretical ones. This makes them powerful, assumption-light alternatives to standard tests — especially when sample sizes are small or distributions are strange. They can be used for means, medians, correlations, regression coefficients, and more.
set.seed(10)
x <- trt; y <- ctrl
obs_diff <- mean(x) - mean(y)
B <- 5000
pool <- c(x,y)
n1 <- length(x)
perm_diffs <- replicate(B, {
idx <- sample(length(pool))
mean(pool[idx[1:n1]]) - mean(pool[idx[-(1:n1)]])
})
p_val <- mean(abs(perm_diffs) >= abs(obs_diff))
p_val
# Plot
ggplot(data.frame(perm=perm_diffs), aes(perm)) +
geom_histogram(bins=40) +
geom_vline(xintercept=obs_diff, linewidth=1) +
theme_minimal()
With permutation tests, your p‑value is literally the fraction of label‑shuffled worlds that look as extreme as your data. That story is easy to tell and defend. Pair the p‑value with the observed effect size (mean difference, correlation, etc.) and a null histogram so readers see where your estimate sits relative to chance. Because permutations respect your design, highlight that advantage when assumptions are doubtful. If sample size is small, note the resolution of the p‑value (e.g., with 5,000 permutations, the smallest nonzero p is 1/5000).
Simulate data from a model many times to understand variability, bias, or power.
Monte Carlo methods involve repeated random sampling to explore how a process behaves. You can use them to estimate sampling distributions, check how often a confidence interval actually captures the true value, simulate null distributions, or assess statistical power. The key idea is: if you can describe a process, you can simulate it. In experimental design, they help you test your analysis plan before collecting data or understand how your stats might behave under different assumptions.
set.seed(11)
B <- 2000; n <- 20; lambda <- 0.2 # exponential mean = 5
cover <- replicate(B, {
samp <- rexp(n, lambda)
ci <- t.test(samp)$conf.int
mean(samp) >= ci[1] && mean(samp) <= ci[2]
})
mean(cover) # ~ coverage rate
Monte Carlo results are about behavior under repeated sampling: coverage, bias, power, Type I error—whatever you set out to measure. Summarize with clear metrics (“95% CI covered the truth 91% of the time under skew”) and plots of the simulated distributions so students can see what “typical” looks like versus “extreme.” Tie the findings back to design choices: sample size, variance, skew, or effect size. The win is actionable guidance—“with n=20 per group and this variance, power is ~0.7; we need ~n=28 for 0.8.”
Fit a Bayesian model → simulate new data from the posterior predictive → compare simulated vs observed to check model fit.
Posterior predictive checks are a key part of Bayesian data analysis. Once you’ve fit a Bayesian model, you simulate new datasets from it and compare them to your actual data. If your model is reasonable, the simulated data should “look like” the observed data in terms of spread, structure, or key statistics. These checks are flexible and powerful for model validation, particularly when classical residuals don’t apply (e.g., in GLMs). They’re also great teaching tools — they help students see what a model is really claiming.
# Illustrative via bootstrapping-like posterior predictive check using a fitted LM
fit <- lm(mpg ~ wt, data=mtcars)
pred <- predict(fit, se.fit=TRUE)
# Simulate new outcomes using fitted mean and residual SD
set.seed(12)
y_rep <- replicate(1000, pred$fit + rnorm(nrow(mtcars), 0, sd(residuals(fit))))
# Compare distributions (e.g., means)
obs_mean <- mean(mtcars$mpg)
rep_means <- colMeans(y_rep)
mean(rep_means >= obs_mean) # rough tail-area check
For full Bayesian workflow, consider brms or rstanarm for formal posterior predictive checks.
Posterior predictive checks ask: if the model were true, would it generate data that look like ours? Show a statistic (mean, dispersion, zero‑inflation, max) or a visual (densities, residual patterns) comparing observed to posterior predictive distributions. If the observed falls in the tails, discuss which aspect of the data the model misses (skew, clumping, heterogeneity) and what revision would address it (different link, distribution, random effects). Emphasize that this is model criticism, not a p‑value hunt—the goal is a model that reproduces the salient biology.
Dimensionality reduction techniques help you make sense of high-dimensional data — gene expression, behavioral assays, morphometrics, etc. PCA (Principal Component Analysis) finds axes (components) that explain the most variance in your data. MDS (Multidimensional Scaling) preserves pairwise distances between samples. Methods like t-SNE and UMAP preserve local relationships and are excellent for visualizing clusters, even if the axes are hard to interpret. These tools are about summarizing complex data into a form you can plot, analyze, or cluster — they don’t test hypotheses directly, but they make your downstream work easier.
Summarize high‑dimensional data into fewer dimensions while preserving structure.
X <- scale(iris[,1:4])
p <- prcomp(X, center=TRUE, scale.=TRUE)
summary(p) # variance explained
biplot(p) # base R quick look
# Pretty ggplot
scores <- data.frame(p$x, Species=iris$Species)
ggplot(scores, aes(PC1, PC2, color=Species)) +
geom_point(size=2) +
theme_minimal()
dist_mat <- dist(X)
mds <- cmdscale(dist_mat, k=2)
mds_df <- data.frame(MDS1=mds[,1], MDS2=mds[,2], Species=iris$Species)
ggplot(mds_df, aes(MDS1, MDS2, color=Species)) +
geom_point(size=2) +
theme_minimal()
For t‑SNE/UMAP, use packages like Rtsne or uwot.
Interpret DR plots as maps, not proofs. For PCA, report the variance explained by PCs and describe the loadings to say which variables define each axis; then interpret group separation (or overlap) in that light. For MDS, stress that distances in the plot approximate dissimilarities—clusters that are close are similar by your chosen metric. For t‑SNE/UMAP, focus on local neighborhoods and avoid over‑reading global distances; reproducibility across seeds and parameters matters. Always tie patterns back to biology (treatment, batch, genotype) and follow up with models if you want inference, not just visualization.
We used base R for data shaping, but sometimes tidyverse is cleaner; ggplot2 is our default for plots.
library(dplyr)
df <- as.data.frame(mtcars) |>
mutate(cyl = factor(cyl)) |>
group_by(cyl) |>
summarize(mpg_mean = mean(mpg), .groups="drop")
df
# Points + smooth
ggplot(mtcars, aes(wt, mpg)) +
geom_point(alpha=0.7) +
geom_smooth(method="lm", se=TRUE) +
theme_minimal()
# Group means with CIs
ggplot(dat, aes(trt, y)) +
stat_summary(fun=mean, geom="point", size=3) +
stat_summary(fun.data=mean_cl_normal, geom="errorbar", width=0.2) +
theme_minimal()
library(broom)
fit <- lm(mpg ~ wt + hp, data=mtcars)
tidy(fit) # table of coefficients
glance(fit) # model-level stats (R^2, AIC)
augment(fit) # fitted values, residuals