Chapter 13 GLM Summary

13.1 Overview

Before we start going over todays lesson, let’s go over the general process of modelling.

  1. Model Selection (From Question)
  2. Model Fit
  3. Model Selection (By statistical reasoning, logic, and ecology)
  4. Model Validation/Diagnostics (some, or all of)
    • Normaility of Errors
    • Homogeneity of Variance
    • Fixed X
    • Indepence of Samples
    • Correct Model Specification
  5. Model Visualization

13.2 Visualization

Visualizations specific to GLMs etc.

Before performing linear modeling…visualize!

With ecological questions, this exercise will help decide what type of modeling is required.

Here is what we mean, in Base R:

dotchart(mydata$length, groups = mydata$stage, ylab = "Stage", xlab = "Length")

scatterplot(mydata$length, mydata$width, groups = mydata$stage, by.groups = TRUE, smooth = FALSE, reg.line = FALSE, las = 1)

boxplot(length ~ stage, data = mydata, las = 1, ylab = "Length", xlab = "Stage")

pairs(mydata[3:5])

Note: Visualizations using ggplot2 are similar

13.3 GLM

General Linear Model

  • Can be very simple (t.test, anova)
  • Can be very complex (GLMM, ZIP, ZINB)

GLZ = Generalized Linear Model

  • Now mostly referred to as GLM

The format of all modeling in R is very similar

lmb <- lm(y ~ x1, data = mydata) #simple linear regression
lmb <- glm(y ~ x1 + x2*x3, data = mydata) #GLM with 3 variables and an interaction term between variable 2 and 3

A typical GLM structure is formed as:

\[Y = \beta_{0} + \beta_{1}x_{1} + \beta_{2}x_{2} + \epsilon\] Where \(\epsilon\) is the predicted error, or, that varience in the data that can’t be explained by the factors \(x_{1}\), \(x_{2}\), etc.

We use the sample data residuals to estimate the predicted error.

Assumptions of GLM vary depending on the model and GLM.

In a classic regression, these 3 assumption should be next:

  1. Homogeneity of Variance
  2. Independent Observations
  3. Normally Distributed Errors

Because these assumptions are required for regression, and regression is just one form of a linear model, these assumptions must hold for all GLMs.

13.4 Simple Linear Model



Create plots of your data picking 2 variables that are continuous, then create a linear model (regression) using your data.

Use this Table as a guide


Create a simple linear model using lm()

lmb <- lm(y ~ x, data = mydata)


lm() = Linear Model Function
lmb <- lm(length ~ width, data = mydata)


Use these functions to interogate your model!

summary(lmb) Estimates goodness of fit, significance
anova(lmb) familiar anova table
str(lmb) structure of lmb object, or what is in the object
Here is the General Output

13.5 Diagnostics

These are applicable to just about all GLMs.

Mainly plots to look at residuals.

  1. Residuals vs. Fitted Values (verify homogeneity)
  2. Q-Q Plot of Histogram of Residuals (normality of errors)
  3. Residuals vs. Explanatory (x) (Independence)
  4. Influential Observations

We will use the simple plot functions

plot(x, y)
hist(x)

13.5.0.1 Homogeneity

Tests can be used for this, but Beware!!!

# Bartlett Test of Homogeneity of Variances
bartlett.test(length ~ stage, data = mydata)

# Figner-Killeen Test of Homogeneity of Variances
fligner.test(length ~ stage, data = mydata)

It is better to plot the fitted vs. residual values!

plot(lmb) will produce 4 plots, but we can do this manually too!

predict <- fitted() # predicted values
resid <- residuals() # residuals
plot(predict, resid)

13.5.0.2 Homogeneity

Are the errors of the model normally distributed?

qqnorm(resid)
qqline(resid) # adds the 1:1 line




13.5.0.3 Residual Plot

13.6 Plot Your Model

Let’s look at some examples of plotting your data using the ggeffects package!

#Plot the relationship
ggplot(mydata, aes(x = x_variable, y = y_variable))+
  geom_point()

# linear model 
M1<-lm(y ~ x, data = mydata, na.action = na.exclude)

predict_response(M1)%>%plot(show_data=TRUE, ci_style = "errorbar", dot_size = 1.5)
predict_response(M1)%>%plot(ci_style = "errorbar", dot_size = 1.5)

#Add residuals
#This method works on simple models and appends residuals to mydata
mydata<-mydata%>%
  add_residuals(M1)%>%
  add_predictions(M1)

#Plot residuals
#There is a lot more to do here...this is an example
ggplot(mydata)+
  geom_histogram(aes(x=resid), fill='grey50', col='black')+
  theme_bw(22)+ylab("Frequency")+xlab("Residuals")

ggplot(mydata)+
  geom_boxplot(aes(x=common_name,y=resid), fill='grey50', col='black')+
  theme_bw(22)+ylab("Residuals")+xlab("Common Name")

ggplot(mydata)+
  geom_point(aes(x=pred,y=resid), fill='grey50', col='black')+
  theme_bw(22)+ylab("Residuals")+xlab("Predicted")

13.7 Model Specification in R

When modelling in R, glm specification can look like this:

glm(y ~ x1 * x2 + x3, data = mydata)

Here’s what these mean:

  • y = response variable
  • x1 = first explanatory/predictor variable
  • x2 = second explanatory variable
  • x3 = third explanatory variable
  • * = include ALL interaction effects (for the two joined)
  • + = include only main effects (no interactions)
  • : = specific interaction effects

Note: x1/x2 is read as x2 nested within x1

13.7.0.1 Coefficients



Coefficients must be calculated

Intercept -> Intercept (first value of each factor) i.e. 
stage1 = 7.95
speciesmicans = 7.95
stage2 = 7.95 + 3.43 = 11.38
stage3 = 7.95 + 8.87 = 16.82
speciesmutilatus = 7.95 - 0.48
stage2 = 7.95 - 0.48 + 3.43
stage3 = 7.95 - 0.48 +8.87

13.8 Extending Models

Models are extended by adding more explanatory variables.

Explanatory variables are called covariates.

Recall an example:

glm(y ~ x1 + x2, data = mydata)

If x2 is a continuous variable, the model is a multiple regression.

If x2 is a categorical variable, the model has both a regression variable (x1) and a categorical variable (x2).

Let’s look at some classical model specifications!

13.8.0.1 Multiple Regression

Here width and weight are both continuous.

Finds the relationship between length and both width and weight!

# Multiple Linear Regression Example
lmc <- lm(length ~ width + weight, data = mydata)
summary(lmc) # show results

# Other useful functions
coefficients() # model coefficients
confint(lmc, level = 0.95) # CIs for model parameters
anova() # anova table
vcov() # covariance matrix for model parameters
influence() # regression diagnostics 

13.8.0.2 ANOVA

What if the covariate is a category?

Q: Are species lengths different?

# One Way Anova (Completely Randomized Design)
predict <- aov(y ~ A, data = mydata) # A is a factor

# Two Way Factorial Design
predict <- aov(y ~ A + B + A:B, data = mydata) #A and B are factors
predict <- aov(y ~ A * B, data = mydata) # same thing 

Notes:

aov() is the ANOVA wrapper for lm() and it has to be balanced glm() is more general and unbalanced designs are acceptable

Balanced is where each category has the same number of observations

Create an ANOVA using your data picking 1 variable that is continuous and 1 variable that is categorical

Use this table as a guide:

Create a simple linear model using aov()

mod1 <- aov(y ~ x, data = mydata)

13.8.0.3 ANCOVA

Ancova has at least one covariate that is numeric (or integer) and at least 1 factor.

# Analysis of Covariance
predict <- aov(y ~ A + x, data = mydata) 
# A is a factor
# x is a numeric covariate

13.9 GLM

Creat a GLM using your data picking 1 variable that is continuous and 1 variable that is categorical

Create a simple model using glm()

Try your ANOVA!

glm <- glm(y ~x, data = mydata)

Create a more complex model using glm()

glm1 <- glm(y ~ x1 + x2, data = mydata)

Run diagnostics (plots etc) on your models.

13.10 Comparing Models

Great! So we can specify many different models mixing covariates of different types (e.g. categorical and continuous; there are other types too!)

How are models compared?

anova(model1, model2, model3, …, test = “F”) #likelihood ratio
drop1(model1) # drops each term in order
AIC(model1, model2, model3, …)
BIC(model1, model2, model3, …)
lrtest(model1, model2)
vuong(model1, model2)

Parsimonious models are preferable to big models if both have similar predictive power.

A parsimonious model is one with the fewest predictors and significantly better than models with more predictors.

If models are not nested, sometimes cannot use the F-test to choose between one and another. Must rely on other sample statistics such as \(R^{2}\) and RMSE. Or can use the Vuong test.

In the end, choice of any model is subjective…

13.11 Hypothesis Testing

Let’s go through an example using the ggeffects package!

#Continuous and Categorical Interaction Effects Regression Model
#Main effects only
#This means a common slope estimate across common_name
M2<-lm(continuous_variable~categorical_var1+categorical_var2, data = mydata)

# You can also try this in the model too!
#Include interaction effects (use * in place of +)
#This means a separate slope estimate for each common_name

# let's plot it!
ggpredict(M2, terms = c("categorical_var1", "categorical_var2"))%>%plot()

# Now let's test the hypothesis!!
test_predictions(M4, terms = c("categorical_var1", "categorical_var2"), test=NULL, p_adjust = "tukey")
test_predictions(M4, terms = c("categorical_var1", "categorical_var2"), p_adjust = "tukey")