Why this handout

Use natural‑language prompts with an AI assistant (like ChatGPT) to produce clean, runnable R code in RStudio for common statistics and plots—without over‑emphasizing syntax. You’ll see prompt patterns, checklists, and ready‑to‑run examples.

Quick rules for great prompts

Goal + Data + Output. Say what you want, which dataset/columns, and the output (plot, test, model).
Name tools. e.g., “Use ggplot2” or “base R only.”
Be concrete. Provide column names and data types.
Iterate. Ask for edits (titles, colors, grouping).
Ask for comments. “Add comments to each step.”
Paste errors back. “Explain this error and fix it.”

Copy‑me template:

I’m in RStudio. I have a data frame DF with columns: y (numeric), x (numeric), group (factor). Make a scatterplot of y vs x, color by group, add loess smoothers, and nice axis labels. Use ggplot2. Give me only runnable R code with comments.

Minimal dependencies

The examples below use base R and ggplot2.

if (!requireNamespace("ggplot2", quietly = TRUE)) install.packages("ggplot2")
library(ggplot2)

Prompt → Code: common tasks

1) Load a CSV and inspect

Good prompt:

“Read data/my_study.csv, show first rows, column names, and a basic summary. Report missing values by column. Code only.”

path <- "data/my_study.csv"
if (file.exists(path)) {
  dat <- read.csv(path, stringsAsFactors = FALSE)
  cat("Rows x Cols:", nrow(dat), "x", ncol(dat), "\n")
  cat("Column names:\n"); print(names(dat))
  cat("\nMissing values per column:\n"); print(colSums(is.na(dat)))
  cat("\nSummary:\n"); print(summary(dat))
} else {
  cat("Demo mode: file not found; using built-in 'mtcars'.\n")
  dat <- mtcars
  dat$cyl <- factor(dat$cyl)
  print(head(dat))
}

## Demo mode: file not found; using built-in 'mtcars'.
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

2) Clean/transform (base R only)

Good prompt:

“Recode group to factor, drop rows with missing y or x, create log_y = log(y). Base R only, with comments.”

df <- dat
if ("cyl" %in% names(df)) df$cyl <- factor(df$cyl)  # demo grouping
df <- df[!is.na(df$mpg) & !is.na(df$wt), ]          # demo: keep complete rows
df$log_mpg <- log(df$mpg)
summary(df$log_mpg)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.342   2.736   2.955   2.958   3.127   3.523

3) Plot beautifully with ggplot2

Good prompt:

“Scatterplot mpg vs wt, color by cyl, smooth per group, minimal theme, clear labels.”

ggplot(df, aes(x = wt, y = mpg, color = cyl)) +
  geom_point(alpha = 0.8, size = 2.5) +
  geom_smooth(se = FALSE) +
  labs(
    title = "Fuel Efficiency vs Weight",
    x = "Weight (1000 lbs)",
    y = "Miles per Gallon",
    color = "Cylinders"
  ) +
  theme_minimal(base_size = 14)

4) Two‑group comparison (Welch’s t‑test) + effect size

Good prompt:

“Compare mpg between 4‑ and 6‑cyl. Welch’s t‑test, Cohen’s d, one‑sentence interpretation.”

sub <- df[df$cyl %in% c("4", "6"), ]
x <- sub$mpg[sub$cyl == "4"]; y <- sub$mpg[sub$cyl == "6"]
tt <- t.test(x, y)  # Welch by default
d  <- (mean(x) - mean(y)) / sqrt((sd(x)^2 + sd(y)^2)/2)
tt

## 
##  Welch Two Sample t-test
## 
## data:  x and y
## t = 4.7191, df = 12.956, p-value = 0.0004048
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   3.751376 10.090182
## sample estimates:
## mean of x mean of y 
##  26.66364  19.74286

cat(sprintf("Cohen's d = %.2f\n", d))

## Cohen's d = 2.07

cat(sprintf("Interpretation: Mean mpg differs between 4- and 6-cylinder cars (t=%.2f, p=%.3f); effect size d=%.2f.\n",
            unname(tt$statistic), tt$p.value, d))

## Interpretation: Mean mpg differs between 4- and 6-cylinder cars (t=4.72, p=0.000); effect size d=2.07.

5) One‑way ANOVA + diagnostics + Tukey

Good prompt:

“ANOVA mpg ~ cyl, residual plot + QQ plot + Shapiro, Tukey post‑hoc if significant. One‑line takeaway.”

fit <- aov(mpg ~ cyl, data = df)
summary(fit)

##             Df Sum Sq Mean Sq F value   Pr(>F)    
## cyl          2  824.8   412.4    39.7 4.98e-09 ***
## Residuals   29  301.3    10.4                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

par(mfrow = c(1, 2))
plot(fitted(fit), resid(fit), xlab="Fitted", ylab="Residuals", main="Residuals vs Fitted"); abline(h=0, lty=2)
qqnorm(resid(fit)); qqline(resid(fit))

par(mfrow = c(1, 1))

shapiro <- shapiro.test(resid(fit)); shapiro

## 
##  Shapiro-Wilk normality test
## 
## data:  resid(fit)
## W = 0.97065, p-value = 0.5177

p_anova <- summary(fit)[[1]]["cyl","Pr(>F)"]
if (!is.na(p_anova) && p_anova < 0.05) {
  tk <- TukeyHSD(fit, "cyl"); print(tk)
  cat("Takeaway: Cylinders explain mpg differences; see Tukey pairs.\n")
} else {
  cat("Takeaway: No strong evidence that mpg differs by cylinders.\n")
}

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = mpg ~ cyl, data = df)
## 
## $cyl
##           diff        lwr        upr     p adj
## 6-4  -6.920779 -10.769350 -3.0722086 0.0003424
## 8-4 -11.563636 -14.770779 -8.3564942 0.0000000
## 8-6  -4.642857  -8.327583 -0.9581313 0.0112287
## 
## Takeaway: Cylinders explain mpg differences; see Tukey pairs.

6) Linear regression with standardized effects

Good prompt:

“Fit mpg ~ wt + hp. Report standardized betas and R². One‑sentence interpretation.”

Z <- scale(df[, c("wt", "hp")])
fit_lm <- lm(df$mpg ~ Z[, "wt"] + Z[, "hp"])
summary(fit_lm)

## 
## Call:
## lm(formula = df$mpg ~ Z[, "wt"] + Z[, "hp"])
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.941 -1.600 -0.182  1.050  5.854 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  20.0906     0.4585  43.822  < 2e-16 ***
## Z[, "wt"]    -3.7943     0.6191  -6.129 1.12e-06 ***
## Z[, "hp"]    -2.1784     0.6191  -3.519  0.00145 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.593 on 29 degrees of freedom
## Multiple R-squared:  0.8268, Adjusted R-squared:  0.8148 
## F-statistic: 69.21 on 2 and 29 DF,  p-value: 9.109e-12

cat("Interpretation: Holding the other predictor constant, a 1 SD increase in a predictor changes mpg by its standardized beta.\n")

## Interpretation: Holding the other predictor constant, a 1 SD increase in a predictor changes mpg by its standardized beta.

Prompt patterns (copy/paste)

Explain + code

“Act as a stats tutor. In RStudio with DF(col1 numeric, col2 factor), run a Welch’s t‑test comparing col1 across col2 groups, include comments, and explain the output in 2 sentences.”

Plot with constraints

“Use base R only to draw side‑by‑side boxplots of y by group from DF. Add a title and axis labels. Code only.”

Reproduce & compare

“Give two solutions to plot y ~ x with a smooth: one using base R and one using ggplot2. Label axes and set a white background.”

Troubleshoot

“I get object 'Sepal.Length' not found. Here is my code: [paste]. Explain the error and give a fixed version.”

Error‑driven workflow

Paste the exact error into your AI prompt.
Include the code (or the failing line).
State what you expected (“I wanted a plot with three groups”).
Ask for a fix + explanation.

Built‑in practice datasets

iris — sepal/petal by species. Load: data(iris); head(iris)
mtcars — car performance stats. Load: data(mtcars); head(mtcars)
ToothGrowth — tooth length by supplement & dose. Load: data(ToothGrowth); head(ToothGrowth)

Exercises (prompt → code → interpret)

T‑test (ToothGrowth): Compare tooth length by supp. Plot (violin/boxplot), Welch’s t‑test, effect size, one‑sentence interpretation.
ANOVA (iris): Test whether Sepal.Length differs by Species. Diagnostics + Tukey; plot means with 95% CIs.
Regression (mtcars): Model mpg ~ wt + hp. Report standardized betas, R², and a partial‑effect plot for wt controlling for hp.

Copy‑me prompt + scaffold

Prompt to paste into AI:

I’m in RStudio. I have DF with y (numeric outcome), x (numeric predictor), grp (factor). Please produce only runnable R code that:

Removes rows with missing y or x.

Makes a scatterplot of y vs x, color by grp, with a smooth per group.

Fits y ~ x + grp and reports coefficients and R².

Uses ggplot2, includes comments, runs as‑is.

DF <- mtcars; DF$grp <- factor(DF$cyl)  # replace with your data
DF <- DF[!is.na(DF$mpg) & !is.na(DF$wt), ]

ggplot(DF, aes(wt, mpg, color = grp)) +
  geom_point(alpha = 0.8) +
  geom_smooth(se = FALSE) +
  labs(x = "Weight (1000 lbs)", y = "Miles per Gallon", color = "Cylinders") +
  theme_minimal(base_size = 14)

fit <- lm(mpg ~ wt + grp, data = DF)
summary(fit)

## 
## Call:
## lm(formula = mpg ~ wt + grp, data = DF)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.5890 -1.2357 -0.5159  1.3845  5.7915 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  33.9908     1.8878  18.006  < 2e-16 ***
## wt           -3.2056     0.7539  -4.252 0.000213 ***
## grp6         -4.2556     1.3861  -3.070 0.004718 ** 
## grp8         -6.0709     1.6523  -3.674 0.000999 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.557 on 28 degrees of freedom
## Multiple R-squared:  0.8374, Adjusted R-squared:   0.82 
## F-statistic: 48.08 on 3 and 28 DF,  p-value: 3.594e-11

Mini cheat‑sheet: which test?

Two groups, numeric outcome: Welch’s t‑test.
>2 groups, numeric outcome: ANOVA (+ Tukey).
Numeric outcome + predictors: Linear regression.
Counts: Chi‑square or Poisson/Logistic GLM.
Assumption issues: Use non‑parametrics (Wilcoxon/Kruskal–Wallis), transforms, or appropriate GLMs.

End.

Prompting R: Using AI + RStudio for Stats & Visualization

Your Name

2025-09-21

Why this handout

Quick rules for great prompts

Minimal dependencies

Prompt → Code: common tasks

1) Load a CSV and inspect

2) Clean/transform (base R only)

3) Plot beautifully with ggplot2

4) Two‑group comparison (Welch’s t‑test) + effect size

5) One‑way ANOVA + diagnostics + Tukey

6) Linear regression with standardized effects

Prompt patterns (copy/paste)

Error‑driven workflow

Built‑in practice datasets

Exercises (prompt → code → interpret)

Copy‑me prompt + scaffold

Mini cheat‑sheet: which test?