Why this handout

Use natural‑language prompts with an AI assistant (like ChatGPT) to produce clean, runnable R code in RStudio for common statistics and plots—without over‑emphasizing syntax. You’ll see prompt patterns, checklists, and ready‑to‑run examples.


Quick rules for great prompts

  • Goal + Data + Output. Say what you want, which dataset/columns, and the output (plot, test, model).
  • Name tools. e.g., “Use ggplot2” or “base R only.”
  • Be concrete. Provide column names and data types.
  • Iterate. Ask for edits (titles, colors, grouping).
  • Ask for comments. “Add comments to each step.”
  • Paste errors back. “Explain this error and fix it.”

Copy‑me template:

I’m in RStudio. I have a data frame DF with columns: y (numeric), x (numeric), group (factor). Make a scatterplot of y vs x, color by group, add loess smoothers, and nice axis labels. Use ggplot2. Give me only runnable R code with comments.


Minimal dependencies

The examples below use base R and ggplot2.

if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
library(ggplot2)

Prompt → Code: common tasks

1) Load a CSV and inspect

Good prompt:

“Read data/my_study.csv, show first rows, column names, and a basic summary. Report missing values by column. Code only.”

path <- "data/my_study.csv"
if (file.exists(path)) {
  dat <- read.csv(path, stringsAsFactors = FALSE)
  cat("Rows x Cols:", nrow(dat), "x", ncol(dat), "\n")
  cat("Column names:\n"); print(names(dat))
  cat("\nMissing values per column:\n"); print(colSums(is.na(dat)))
  cat("\nSummary:\n"); print(summary(dat))
} else {
  cat("Demo mode: file not found; using built-in 'mtcars'.\n")
  dat <- mtcars
  dat$cyl <- factor(dat$cyl)
  print(head(dat))
}
## Demo mode: file not found; using built-in 'mtcars'.
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

2) Clean/transform (base R only)

Good prompt:

“Recode group to factor, drop rows with missing y or x, create log_y = log(y). Base R only, with comments.”

df <- dat
if ("cyl" %in% names(df)) df$cyl <- factor(df$cyl)  # demo grouping
df <- df[!is.na(df$mpg) & !is.na(df$wt), ]          # demo: keep complete rows
df$log_mpg <- log(df$mpg)
summary(df$log_mpg)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.342   2.736   2.955   2.958   3.127   3.523

3) Plot beautifully with ggplot2

Good prompt:

“Scatterplot mpg vs wt, color by cyl, smooth per group, minimal theme, clear labels.”

ggplot(df, aes(x = wt, y = mpg, color = cyl)) +
  geom_point(alpha = 0.8, size = 2.5) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Fuel Efficiency vs Weight",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon",
    color = "Cylinders"
  ) +
  theme_minimal(base_size = 14)

4) Two‑group comparison (Welch’s t‑test) + effect size

Good prompt:

“Compare mpg between 4‑ and 6‑cyl. Welch’s t‑test, Cohen’s d, one‑sentence interpretation.”

sub <- df[df$cyl %in% c("4", "6"), ]
x <- sub$mpg[sub$cyl == "4"]; y <- sub$mpg[sub$cyl == "6"]
tt <- t.test(x, y)  # Welch by default
d  <- (mean(x) - mean(y)) / sqrt((sd(x)^2 + sd(y)^2)/2)
tt
## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 4.7191, df = 12.956, p-value = 0.0004048
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.751376 10.090182
## sample estimates:
## mean of x mean of y 
##  26.66364  19.74286
cat(sprintf("Cohen's d = %.2f\n", d))
## Cohen's d = 2.07
cat(sprintf("Interpretation: Mean mpg differs between 4- and 6-cylinder cars (t=%.2f, p=%.3f); effect size d=%.2f.\n",
            unname(tt$statistic), tt$p.value, d))
## Interpretation: Mean mpg differs between 4- and 6-cylinder cars (t=4.72, p=0.000); effect size d=2.07.

5) One‑way ANOVA + diagnostics + Tukey

Good prompt:

“ANOVA mpg ~ cyl, residual plot + QQ plot + Shapiro, Tukey post‑hoc if significant. One‑line takeaway.”

fit <- aov(mpg ~ cyl, data = df)
summary(fit)
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cyl          2  824.8   412.4    39.7 4.98e-09 ***
## Residuals   29  301.3    10.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
par(mfrow = c(1, 2))
plot(fitted(fit), resid(fit), xlab="Fitted", ylab="Residuals", main="Residuals vs Fitted"); abline(h=0, lty=2)
qqnorm(resid(fit)); qqline(resid(fit))

par(mfrow = c(1, 1))

shapiro <- shapiro.test(resid(fit)); shapiro
## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit)
## W = 0.97065, p-value = 0.5177
p_anova <- summary(fit)[[1]]["cyl","Pr(>F)"]
if (!is.na(p_anova) && p_anova < 0.05) {
  tk <- TukeyHSD(fit, "cyl"); print(tk)
  cat("Takeaway: Cylinders explain mpg differences; see Tukey pairs.\n")
} else {
  cat("Takeaway: No strong evidence that mpg differs by cylinders.\n")
}
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mpg ~ cyl, data = df)
## 
## $cyl
##           diff        lwr        upr     p adj
## 6-4  -6.920779 -10.769350 -3.0722086 0.0003424
## 8-4 -11.563636 -14.770779 -8.3564942 0.0000000
## 8-6  -4.642857  -8.327583 -0.9581313 0.0112287
## 
## Takeaway: Cylinders explain mpg differences; see Tukey pairs.

6) Linear regression with standardized effects

Good prompt:

“Fit mpg ~ wt + hp. Report standardized betas and R². One‑sentence interpretation.”

Z <- scale(df[, c("wt", "hp")])
fit_lm <- lm(df$mpg ~ Z[, "wt"] + Z[, "hp"])
summary(fit_lm)
## 
## Call:
## lm(formula = df$mpg ~ Z[, "wt"] + Z[, "hp"])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.0906     0.4585  43.822  < 2e-16 ***
## Z[, "wt"]    -3.7943     0.6191  -6.129 1.12e-06 ***
## Z[, "hp"]    -2.1784     0.6191  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12
cat("Interpretation: Holding the other predictor constant, a 1 SD increase in a predictor changes mpg by its standardized beta.\n")
## Interpretation: Holding the other predictor constant, a 1 SD increase in a predictor changes mpg by its standardized beta.

Prompt patterns (copy/paste)

Explain + code

“Act as a stats tutor. In RStudio with DF(col1 numeric, col2 factor), run a Welch’s t‑test comparing col1 across col2 groups, include comments, and explain the output in 2 sentences.”

Plot with constraints

“Use base R only to draw side‑by‑side boxplots of y by group from DF. Add a title and axis labels. Code only.”

Reproduce & compare

“Give two solutions to plot y ~ x with a smooth: one using base R and one using ggplot2. Label axes and set a white background.”

Troubleshoot

“I get object 'Sepal.Length' not found. Here is my code: [paste]. Explain the error and give a fixed version.”


Error‑driven workflow

  1. Paste the exact error into your AI prompt.
  2. Include the code (or the failing line).
  3. State what you expected (“I wanted a plot with three groups”).
  4. Ask for a fix + explanation.

Built‑in practice datasets

  • iris — sepal/petal by species. Load: data(iris); head(iris)
  • mtcars — car performance stats. Load: data(mtcars); head(mtcars)
  • ToothGrowth — tooth length by supplement & dose. Load: data(ToothGrowth); head(ToothGrowth)

Exercises (prompt → code → interpret)

  1. T‑test (ToothGrowth): Compare tooth length by supp. Plot (violin/boxplot), Welch’s t‑test, effect size, one‑sentence interpretation.
  2. ANOVA (iris): Test whether Sepal.Length differs by Species. Diagnostics + Tukey; plot means with 95% CIs.
  3. Regression (mtcars): Model mpg ~ wt + hp. Report standardized betas, R², and a partial‑effect plot for wt controlling for hp.

Copy‑me prompt + scaffold

Prompt to paste into AI:

I’m in RStudio. I have DF with y (numeric outcome), x (numeric predictor), grp (factor). Please produce only runnable R code that:

  1. Removes rows with missing y or x.
  2. Makes a scatterplot of y vs x, color by grp, with a smooth per group.
  3. Fits y ~ x + grp and reports coefficients and R².
  4. Uses ggplot2, includes comments, runs as‑is.
DF <- mtcars; DF$grp <- factor(DF$cyl)  # replace with your data
DF <- DF[!is.na(DF$mpg) & !is.na(DF$wt), ]

ggplot(DF, aes(wt, mpg, color = grp)) +
  geom_point(alpha = 0.8) +
  geom_smooth(se = FALSE) +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", color = "Cylinders") +
  theme_minimal(base_size = 14)

fit <- lm(mpg ~ wt + grp, data = DF)
summary(fit)
## 
## Call:
## lm(formula = mpg ~ wt + grp, data = DF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
## wt           -3.2056     0.7539  -4.252 0.000213 ***
## grp6         -4.2556     1.3861  -3.070 0.004718 ** 
## grp8         -6.0709     1.6523  -3.674 0.000999 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

Mini cheat‑sheet: which test?

  • Two groups, numeric outcome: Welch’s t‑test.
  • >2 groups, numeric outcome: ANOVA (+ Tukey).
  • Numeric outcome + predictors: Linear regression.
  • Counts: Chi‑square or Poisson/Logistic GLM.
  • Assumption issues: Use non‑parametrics (Wilcoxon/Kruskal–Wallis), transforms, or appropriate GLMs.

End.