This document provides the full answer key with example R code for the Voluntary Homework. It includes the correct statistical test, reasoning, and code implementation.
Goal: Determine whether the two fertilizer treatments differ in their effect on plant growth.
Test: Two-sample t-test
Reasoning: - Response (height) is
continuous. - Predictor (fertilizer) has two levels. -
Assumptions: normality, equal variance.
plants_ttest <- read.csv("fertilizer.csv")
# Explore
table(plants_ttest$fertilizer)
##
## A B
## 30 30
summary(plants_ttest$height)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.147 8.895 11.071 10.936 12.903 16.285
# Check assumptions
shapiro.test(plants_ttest$height[plants_ttest$fertilizer == "A"])
##
## Shapiro-Wilk normality test
##
## data: plants_ttest$height[plants_ttest$fertilizer == "A"]
## W = 0.96214, p-value = 0.3509
shapiro.test(plants_ttest$height[plants_ttest$fertilizer == "B"])
##
## Shapiro-Wilk normality test
##
## data: plants_ttest$height[plants_ttest$fertilizer == "B"]
## W = 0.98748, p-value = 0.9717
var.test(height ~ fertilizer, data = plants_ttest)
##
## F test to compare two variances
##
## data: height by fertilizer
## F = 0.92761, num df = 29, denom df = 29, p-value = 0.841
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
## 0.4415113 1.9489118
## sample estimates:
## ratio of variances
## 0.9276134
# Perform t-test
t.test(height ~ fertilizer, data = plants_ttest)
##
## Welch Two Sample t-test
##
## data: height by fertilizer
## t = -2.71, df = 57.918, p-value = 0.008835
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
## -2.9444805 -0.4426046
## sample estimates:
## mean in group A mean in group B
## 10.08943 11.78297
Expected Result: Fertilizer B plants are significantly taller.
Goal: Test whether fertilizer type is associated with plant survival.
Test: Chi-square test of independence
Reasoning: - Both variables are categorical.
plants_chi <- read.csv("survival.csv")
# Create contingency table
tab <- table(plants_chi$fertilizer, plants_chi$survival)
# Check expected counts
chisq.test(tab)$expected
##
## Alive Dead
## A 21.5 8.5
## B 21.5 8.5
# Perform chi-square test
chisq.test(tab)
##
## Pearson's Chi-squared test with Yates' continuity correction
##
## data: tab
## X-squared = 2.9549, df = 1, p-value = 0.08562
Expected Result: Fertilizer B increases survival rate.
Goal: Assess the effects of fertilizer, sunlight, and their interaction on plant height.
Test: Two-way ANOVA / Linear regression with interaction term
plants_regression <- read.csv("plants_height.csv")
# Fit additive model
model1 <- lm(height ~ fertilizer + sunlight, data = plants_regression)
summary(model1)
##
## Call:
## lm(formula = height ~ fertilizer + sunlight, data = plants_regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.1110 -1.2226 0.1222 0.9927 3.8579
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.1172 0.4213 26.385 < 2e-16 ***
## fertilizerB 1.6000 0.4865 3.289 0.00173 **
## sunlightLow -1.0563 0.4865 -2.171 0.03410 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.884 on 57 degrees of freedom
## Multiple R-squared: 0.2141, Adjusted R-squared: 0.1865
## F-statistic: 7.764 on 2 and 57 DF, p-value: 0.001042
# Add interaction term
model2 <- lm(height ~ fertilizer * sunlight, data = plants_regression)
anova(model1, model2)
## Analysis of Variance Table
##
## Model 1: height ~ fertilizer + sunlight
## Model 2: height ~ fertilizer * sunlight
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 57 202.39
## 2 56 197.04 1 5.3499 1.5205 0.2227
summary(model2)
##
## Call:
## lm(formula = height ~ fertilizer * sunlight, data = plants_regression)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.812 -1.184 0.115 1.127 4.157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 11.4158 0.4843 23.571 <2e-16 ***
## fertilizerB 1.0028 0.6849 1.464 0.1488
## sunlightLow -1.6535 0.6849 -2.414 0.0191 *
## fertilizerB:sunlightLow 1.1944 0.9686 1.233 0.2227
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.876 on 56 degrees of freedom
## Multiple R-squared: 0.2349, Adjusted R-squared: 0.1939
## F-statistic: 5.73 on 3 and 56 DF, p-value: 0.001715
# Diagnostic plot
plot(model2, which = 1)
Expected Result: Both fertilizer and sunlight increase growth; weak or non-significant interaction.
Goal: Identify which predictors best explain variation in plant height.
Test: Multiple linear regression with stepwise selection
plants_stepwise <- read.csv("soil.csv")
# Fit full model
full_model <- lm(height ~ fertilizer + sunlight + nitrogen + ph + moisture, data = plants_stepwise)
summary(full_model)
##
## Call:
## lm(formula = height ~ fertilizer + sunlight + nitrogen + ph +
## moisture, data = plants_stepwise)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.82290 -0.68052 -0.01444 0.56930 2.15363
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 15.33122 2.00969 7.629 3.93e-10 ***
## fertilizerB 1.46803 0.25220 5.821 3.30e-07 ***
## sunlightLow -0.89131 0.27115 -3.287 0.00178 **
## nitrogen 0.22915 0.07317 3.132 0.00281 **
## ph -0.83563 0.26329 -3.174 0.00248 **
## moisture -0.02130 0.04828 -0.441 0.66078
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9581 on 54 degrees of freedom
## Multiple R-squared: 0.5735, Adjusted R-squared: 0.534
## F-statistic: 14.52 on 5 and 54 DF, p-value: 5.207e-09
# Stepwise regression
step_model <- step(full_model)
## Start: AIC=0.55
## height ~ fertilizer + sunlight + nitrogen + ph + moisture
##
## Df Sum of Sq RSS AIC
## - moisture 1 0.1788 49.752 -1.2373
## <none> 49.574 0.5467
## - nitrogen 1 9.0027 58.576 8.5590
## - ph 1 9.2474 58.821 8.8092
## - sunlight 1 9.9200 59.494 9.4914
## - fertilizer 1 31.1046 80.678 27.7674
##
## Step: AIC=-1.24
## height ~ fertilizer + sunlight + nitrogen + ph
##
## Df Sum of Sq RSS AIC
## <none> 49.752 -1.2373
## - nitrogen 1 9.427 59.179 7.1732
## - ph 1 9.575 59.327 7.3233
## - sunlight 1 9.762 59.515 7.5127
## - fertilizer 1 32.877 82.629 27.2012
summary(step_model)
##
## Call:
## lm(formula = height ~ fertilizer + sunlight + nitrogen + ph,
## data = plants_stepwise)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.83002 -0.67736 -0.05391 0.57378 2.13231
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.95048 1.80170 8.298 2.86e-11 ***
## fertilizerB 1.48706 0.24666 6.029 1.45e-07 ***
## sunlightLow -0.88076 0.26810 -3.285 0.00178 **
## nitrogen 0.23289 0.07215 3.228 0.00210 **
## ph -0.84653 0.26020 -3.253 0.00195 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.9511 on 55 degrees of freedom
## Multiple R-squared: 0.5719, Adjusted R-squared: 0.5408
## F-statistic: 18.37 on 4 and 55 DF, p-value: 1.231e-09
# Compare models
anova(full_model, step_model)
## Analysis of Variance Table
##
## Model 1: height ~ fertilizer + sunlight + nitrogen + ph + moisture
## Model 2: height ~ fertilizer + sunlight + nitrogen + ph
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 54 49.574
## 2 55 49.752 -1 -0.17875 0.1947 0.6608
Expected Result: Fertilizer, sunlight, and nitrogen remain in the final model.
Goal: Determine if species differ in food intake (count data).
Test: GLM (Poisson regression)
animals_count <- read.csv("food.csv")
# Fit Poisson GLM
count_model <- glm(food_intake ~ species, data = animals_count, family = poisson)
summary(count_model)
##
## Call:
## glm(formula = food_intake ~ species, family = poisson, data = animals_count)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.45862 0.07625 19.13 <2e-16 ***
## speciesRat 0.32897 0.09999 3.29 0.001 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 97.362 on 79 degrees of freedom
## Residual deviance: 86.391 on 78 degrees of freedom
## AIC: 359.66
##
## Number of Fisher Scoring iterations: 4
# Check overdispersion
overdispersion <- sum(residuals(count_model, type = "pearson")^2) / count_model$df.residual
overdispersion
## [1] 1.056371
# If overdispersed, use quasi-Poisson
if (overdispersion > 2) {
count_model <- glm(food_intake ~ species, data = animals_count, family = quasipoisson)
summary(count_model)
}
Expected Result: Rats consume more food than mice; possible mild overdispersion.
Goal: Test whether response probability differs between species.
Test: Logistic regression
animals_binary <- read.csv("response.csv")
# Fit logistic model
bin_model <- glm(responded ~ species, data = animals_binary, family = binomial)
summary(bin_model)
##
## Call:
## glm(formula = responded ~ species, family = binomial, data = animals_binary)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.8473 0.3450 -2.456 0.014061 *
## speciesRat 1.6946 0.4879 3.473 0.000515 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 110.904 on 79 degrees of freedom
## Residual deviance: 97.738 on 78 degrees of freedom
## AIC: 101.74
##
## Number of Fisher Scoring iterations: 4
# Predicted probabilities
animals_binary$predicted_prob <- predict(bin_model, type = "response")
# Visualize
library(ggplot2)
ggplot(animals_binary, aes(x = species, y = predicted_prob)) +
geom_boxplot(fill = "skyblue") +
ylab("Predicted Probability of Response")
Expected Result: Rats have higher response probability.
Examples:
t.test(nitrogen ~ sunlight, data = plants_stepwise)glm(survival ~ nitrogen, data = mydata, family = binomial)Emphasis: Justify your test choice and interpret results clearly. Visualization and assumption checks are encouraged.