Generalized Linear Models

Generalized linear models are an incredibly powerful tool for data analysis. This page will take you through fitting a variety of models and interpeting the results of these models. In the first example lets fit the simplest of models by doing just an ordinary least squares linear regression.

set.seed(1)
# first we simulate some data this will be our predictor variable
x <- runif(20, min = 0, max = 100)
# this will be our response variable.
y <- rnorm(20, mean = 2*x, sd = 10)

# now we take a look at our data
plot(y~x)

# now we fit a simple linear regression
fit <- lm(y~x)
summary(fit)
## 
## Call:
## lm(formula = y ~ x)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -23.192  -3.494   1.744   5.882  10.544 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  9.41232    4.59340   2.049   0.0553 .  
## x            1.84083    0.07394  24.898 2.13e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.221 on 18 degrees of freedom
## Multiple R-squared:  0.9718, Adjusted R-squared:  0.9702 
## F-statistic: 619.9 on 1 and 18 DF,  p-value: 2.131e-15

So how do we interpret this result? Well the beta coefficient for the predictor variable x is in the second row of the results and is the value in the estimate column and what we see in this case is that it has a value of 1.8 and is highly significant. The significance makes sense when we consider how strong the correlation is that we see in the plot. Does the beta coefficient make sense? We would expect if we had unlimited data to recover the beta coefficient that we built in which was 2 (you can see this on the line above where we create the variable y). We got 1.84 so pretty close.

Including Continuous and Discrete Predictors

One of the real strengths of GLMs is that you can combine both discrete and continuous predictor variables. Lets look at an example of doing this.

set.seed(1)
# first we simulate some continuous data
x <- runif(200, min = 0, max = 100)
# and now we simulate some discrete data
z <- as.factor(sample(c("male","female"), 200, replace=T))
# now we simulate a response variable that is a function of these two 
# predictor variables
y <- rnorm(200, mean = 2*x, sd = 10) + rnorm(200, mean=4*as.numeric(z))

# now we take a look at our data
cols <- c("red","blue")[z]
plot(y~x, col=cols, pch=16)

Here we have plotted our data with blue being males and red being females.

# now we fit a simple linear regression
fit <- lm(y~x+z)
# now lets check out the results
summary(fit)
## 
## Call:
## lm(formula = y ~ x + z)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -29.4871  -6.1783   0.1887   6.8386  27.3306 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  3.46396    1.72123   2.012 0.045530 *  
## x            2.00413    0.02678  74.841  < 2e-16 ***
## zmale        5.38705    1.43731   3.748 0.000234 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.16 on 197 degrees of freedom
## Multiple R-squared:  0.9661, Adjusted R-squared:  0.9657 
## F-statistic:  2805 on 2 and 197 DF,  p-value: < 2.2e-16

Here again we recover both of the original variables as signficant. However, you can see that this time we do rather poorly in estimating the effet of our discrete predictor.

Logistic Regression

Logistic Regression allows us to look at models where the response variable is binary. Lets look at how these are commonly fit and used.

# first we simulate some data
set.seed(1)

# This will be our predictor variable where increases mean 
# a greater chance of disease.
x <- runif(200, min = 0, max = 100)

# we will simulate outcomes as zeros and ones
# you can think of zero as being something like a healthy state 
# and one being disease state. 
y <- rbinom(size = 1, n = 200, prob = x/100)

# now we take a look at our data
plot(y~x)

OK, so here you can see the pattern we typically expect with logistic regression where the outcome variable has an increasing or decreasing probability of a given outcome as the predictor variable increases. Now lets fit the model.

# now we fit a simple logistic regression
fit <- glm(y~x, family = "binomial")
summary(fit)
## 
## Call:
## glm(formula = y ~ x, family = "binomial")
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.07566    0.46431  -6.624 3.49e-11 ***
## x            0.05864    0.00820   7.151 8.59e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 277.24  on 199  degrees of freedom
## Residual deviance: 198.25  on 198  degrees of freedom
## AIC: 202.25
## 
## Number of Fisher Scoring iterations: 4

OK, now remember those beta coefficients are NOT so simple anymore they are actually the expected change in the log odds ratio for a one unti of change in the predictor variable. We can not simply say what the change in probability of the outcome is because the logistic function is not linear. Instead what we can do is calculate that probability across many values of predictor variable and report that.

# Generate a sequence of new x values for prediction
x_seq <- seq(min(x), max(x), length.out = 200)

# Get predicted probabilities from the model
predicted_prob <- predict(fit, newdata = data.frame(x = x_seq), type = 'response')

# plot the raw data
plot(y~x)
# add the model precition
lines(x_seq, predicted_prob, col = 'goldenrod', lwd = 2)

Here we have plotted our data and we have added the predicted probability of outcome 1 in gold. Lets use code to identify when that probability reaches 50%.

# lets find the minimum value where the probability of 
x_seq[which(predicted_prob>=0.5)[1]]
## [1] 52.50327