Overview

This document provides the full answer key with example R code for the Voluntary Homework. It includes the correct statistical test, reasoning, and code implementation.


1. Plant Growth and Fertilizer

Goal: Determine whether the two fertilizer treatments differ in their effect on plant growth.

Test: Two-sample t-test

Reasoning: - Response (height) is continuous. - Predictor (fertilizer) has two levels. - Assumptions: normality, equal variance.

plants_ttest <- read.csv("fertilizer.csv")
# Explore
table(plants_ttest$fertilizer)
## 
##  A  B 
## 30 30
summary(plants_ttest$height)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   5.147   8.895  11.071  10.936  12.903  16.285
# Check assumptions
shapiro.test(plants_ttest$height[plants_ttest$fertilizer == "A"])
## 
##  Shapiro-Wilk normality test
## 
## data:  plants_ttest$height[plants_ttest$fertilizer == "A"]
## W = 0.96214, p-value = 0.3509
shapiro.test(plants_ttest$height[plants_ttest$fertilizer == "B"])
## 
##  Shapiro-Wilk normality test
## 
## data:  plants_ttest$height[plants_ttest$fertilizer == "B"]
## W = 0.98748, p-value = 0.9717
var.test(height ~ fertilizer, data = plants_ttest)
## 
##  F test to compare two variances
## 
## data:  height by fertilizer
## F = 0.92761, num df = 29, denom df = 29, p-value = 0.841
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.4415113 1.9489118
## sample estimates:
## ratio of variances 
##          0.9276134
# Perform t-test
t.test(height ~ fertilizer, data = plants_ttest)
## 
##  Welch Two Sample t-test
## 
## data:  height by fertilizer
## t = -2.71, df = 57.918, p-value = 0.008835
## alternative hypothesis: true difference in means between group A and group B is not equal to 0
## 95 percent confidence interval:
##  -2.9444805 -0.4426046
## sample estimates:
## mean in group A mean in group B 
##        10.08943        11.78297

Expected Result: Fertilizer B plants are significantly taller.


2. Fertilizer and Survival

Goal: Test whether fertilizer type is associated with plant survival.

Test: Chi-square test of independence

Reasoning: - Both variables are categorical.

plants_chi <- read.csv("survival.csv")
# Create contingency table
tab <- table(plants_chi$fertilizer, plants_chi$survival)

# Check expected counts
chisq.test(tab)$expected
##    
##     Alive Dead
##   A  21.5  8.5
##   B  21.5  8.5
# Perform chi-square test
chisq.test(tab)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab
## X-squared = 2.9549, df = 1, p-value = 0.08562

Expected Result: Fertilizer B increases survival rate.


3. Fertilizer, Sunlight, and Growth

Goal: Assess the effects of fertilizer, sunlight, and their interaction on plant height.

Test: Two-way ANOVA / Linear regression with interaction term

plants_regression <- read.csv("plants_height.csv")

# Fit additive model
model1 <- lm(height ~ fertilizer + sunlight, data = plants_regression)
summary(model1)
## 
## Call:
## lm(formula = height ~ fertilizer + sunlight, data = plants_regression)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.1110 -1.2226  0.1222  0.9927  3.8579 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.1172     0.4213  26.385  < 2e-16 ***
## fertilizerB   1.6000     0.4865   3.289  0.00173 ** 
## sunlightLow  -1.0563     0.4865  -2.171  0.03410 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.884 on 57 degrees of freedom
## Multiple R-squared:  0.2141, Adjusted R-squared:  0.1865 
## F-statistic: 7.764 on 2 and 57 DF,  p-value: 0.001042
# Add interaction term
model2 <- lm(height ~ fertilizer * sunlight, data = plants_regression)
anova(model1, model2)
## Analysis of Variance Table
## 
## Model 1: height ~ fertilizer + sunlight
## Model 2: height ~ fertilizer * sunlight
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     57 202.39                           
## 2     56 197.04  1    5.3499 1.5205 0.2227
summary(model2)
## 
## Call:
## lm(formula = height ~ fertilizer * sunlight, data = plants_regression)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -3.812 -1.184  0.115  1.127  4.157 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)              11.4158     0.4843  23.571   <2e-16 ***
## fertilizerB               1.0028     0.6849   1.464   0.1488    
## sunlightLow              -1.6535     0.6849  -2.414   0.0191 *  
## fertilizerB:sunlightLow   1.1944     0.9686   1.233   0.2227    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.876 on 56 degrees of freedom
## Multiple R-squared:  0.2349, Adjusted R-squared:  0.1939 
## F-statistic:  5.73 on 3 and 56 DF,  p-value: 0.001715
# Diagnostic plot
plot(model2, which = 1)

Expected Result: Both fertilizer and sunlight increase growth; weak or non-significant interaction.


4. Predicting Plant Height

Goal: Identify which predictors best explain variation in plant height.

Test: Multiple linear regression with stepwise selection

plants_stepwise <- read.csv("soil.csv")

# Fit full model
full_model <- lm(height ~ fertilizer + sunlight + nitrogen + ph + moisture, data = plants_stepwise)
summary(full_model)
## 
## Call:
## lm(formula = height ~ fertilizer + sunlight + nitrogen + ph + 
##     moisture, data = plants_stepwise)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.82290 -0.68052 -0.01444  0.56930  2.15363 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 15.33122    2.00969   7.629 3.93e-10 ***
## fertilizerB  1.46803    0.25220   5.821 3.30e-07 ***
## sunlightLow -0.89131    0.27115  -3.287  0.00178 ** 
## nitrogen     0.22915    0.07317   3.132  0.00281 ** 
## ph          -0.83563    0.26329  -3.174  0.00248 ** 
## moisture    -0.02130    0.04828  -0.441  0.66078    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9581 on 54 degrees of freedom
## Multiple R-squared:  0.5735, Adjusted R-squared:  0.534 
## F-statistic: 14.52 on 5 and 54 DF,  p-value: 5.207e-09
# Stepwise regression
step_model <- step(full_model)
## Start:  AIC=0.55
## height ~ fertilizer + sunlight + nitrogen + ph + moisture
## 
##              Df Sum of Sq    RSS     AIC
## - moisture    1    0.1788 49.752 -1.2373
## <none>                    49.574  0.5467
## - nitrogen    1    9.0027 58.576  8.5590
## - ph          1    9.2474 58.821  8.8092
## - sunlight    1    9.9200 59.494  9.4914
## - fertilizer  1   31.1046 80.678 27.7674
## 
## Step:  AIC=-1.24
## height ~ fertilizer + sunlight + nitrogen + ph
## 
##              Df Sum of Sq    RSS     AIC
## <none>                    49.752 -1.2373
## - nitrogen    1     9.427 59.179  7.1732
## - ph          1     9.575 59.327  7.3233
## - sunlight    1     9.762 59.515  7.5127
## - fertilizer  1    32.877 82.629 27.2012
summary(step_model)
## 
## Call:
## lm(formula = height ~ fertilizer + sunlight + nitrogen + ph, 
##     data = plants_stepwise)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.83002 -0.67736 -0.05391  0.57378  2.13231 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 14.95048    1.80170   8.298 2.86e-11 ***
## fertilizerB  1.48706    0.24666   6.029 1.45e-07 ***
## sunlightLow -0.88076    0.26810  -3.285  0.00178 ** 
## nitrogen     0.23289    0.07215   3.228  0.00210 ** 
## ph          -0.84653    0.26020  -3.253  0.00195 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.9511 on 55 degrees of freedom
## Multiple R-squared:  0.5719, Adjusted R-squared:  0.5408 
## F-statistic: 18.37 on 4 and 55 DF,  p-value: 1.231e-09
# Compare models
anova(full_model, step_model)
## Analysis of Variance Table
## 
## Model 1: height ~ fertilizer + sunlight + nitrogen + ph + moisture
## Model 2: height ~ fertilizer + sunlight + nitrogen + ph
##   Res.Df    RSS Df Sum of Sq      F Pr(>F)
## 1     54 49.574                           
## 2     55 49.752 -1  -0.17875 0.1947 0.6608

Expected Result: Fertilizer, sunlight, and nitrogen remain in the final model.


5. Food Intake by Species

Goal: Determine if species differ in food intake (count data).

Test: GLM (Poisson regression)

animals_count <- read.csv("food.csv")

# Fit Poisson GLM
count_model <- glm(food_intake ~ species, data = animals_count, family = poisson)
summary(count_model)
## 
## Call:
## glm(formula = food_intake ~ species, family = poisson, data = animals_count)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  1.45862    0.07625   19.13   <2e-16 ***
## speciesRat   0.32897    0.09999    3.29    0.001 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 97.362  on 79  degrees of freedom
## Residual deviance: 86.391  on 78  degrees of freedom
## AIC: 359.66
## 
## Number of Fisher Scoring iterations: 4
# Check overdispersion
overdispersion <- sum(residuals(count_model, type = "pearson")^2) / count_model$df.residual
overdispersion
## [1] 1.056371
# If overdispersed, use quasi-Poisson
if (overdispersion > 2) {
  count_model <- glm(food_intake ~ species, data = animals_count, family = quasipoisson)
  summary(count_model)
}

Expected Result: Rats consume more food than mice; possible mild overdispersion.


6. Response Probability by Species

Goal: Test whether response probability differs between species.

Test: Logistic regression

animals_binary <- read.csv("response.csv")

# Fit logistic model
bin_model <- glm(responded ~ species, data = animals_binary, family = binomial)
summary(bin_model)
## 
## Call:
## glm(formula = responded ~ species, family = binomial, data = animals_binary)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -0.8473     0.3450  -2.456 0.014061 *  
## speciesRat    1.6946     0.4879   3.473 0.000515 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 110.904  on 79  degrees of freedom
## Residual deviance:  97.738  on 78  degrees of freedom
## AIC: 101.74
## 
## Number of Fisher Scoring iterations: 4
# Predicted probabilities
animals_binary$predicted_prob <- predict(bin_model, type = "response")

# Visualize
library(ggplot2)
ggplot(animals_binary, aes(x = species, y = predicted_prob)) +
  geom_boxplot(fill = "skyblue") +
  ylab("Predicted Probability of Response")

Expected Result: Rats have higher response probability.


7. Bonus Challenge

Examples:

  • Compare mean nitrogen levels by sunlight: t.test(nitrogen ~ sunlight, data = plants_stepwise)
  • Predict survival probability from nitrogen: glm(survival ~ nitrogen, data = mydata, family = binomial)

Emphasis: Justify your test choice and interpret results clearly. Visualization and assumption checks are encouraged.