Chapter 12 Intro to Modelling

12.1 Review of Linear Regression

12.1.0.1 Linear Relationships

Our goal here is to display relationships between variables, or to describe the linear relationship between x and y and to predict x from y.

Equation of a straight line: $y = \beta_{o} + \beta_{1}x$

$\beta_{o} = y_{intercept}$
$\beta_{1} = slope$
$slope = \frac{rise}{run}$

12.1.0.2 Fitting a Least Squares Regression Line

12.1.0.3 Regression Lines

Given a bivariate sample $(x_{i},y_{i}) ... (x_{n},y_{n})$

The vertical drop is given by: $y_{i} - \hat{y}_{i} = y_{i} - b_{0} - b_{1}x_{i}$
$\hat{y}_{i} = b_{0} - b_{1}x_{i}$ are the fitted values (and form the line)

The least squares line is the line that minimizes:

$\sum(y_{i} - b_{0} - b_{1}x_{i})^{2}$ or more simply
$\sum(y_{i} - \hat{y}_{i})^{2}$

Over all possible choices of $b_{0}$ and $b_{1}$

That is, if we randomly chose many $b_{0}$ and many $b_{1}$, say 100,000 we may happen upon the exact values

However, both $b_{0}$ and $b_{1}$ can be calculated!

For any $x_{i}$ there may be more than one $y_{i}$, but each $\hat{y}_{i}$ receives the same estimate!

Here’s an example!

12.1.0.4 Assessing the Fit of a LSRL

The LSRL is a good fit if the sample points $(x_{i},y_{i})$ along the LSRL defined by $\hat{y} = b_{0} - B{1}x$ is much better than $\overline{y}$ (mean value of all $y_{i}$) at predicting the observed $y_{i}$

Or (equivalently) if the total variation in the $y_{i}$ values is well exxplained by the LSRL!

Recall: A residual is also called an error with symbol e.
\[e = (y_{i} - \hat{y}_{i})\]
Summing all squared errors gives the variation unexplained by the LSRL. We call this error variation the Error of Sum of Squares (SSE)

12.1.0.5 Summary

12.2 Models Introduction

12.2.0.1 General Considerations

In general, our hypotheses are about relationships between concepts or things.

Concepts can be measured, so we can test relationships between variables

Concepts can be a target organism, environmental state, etc.
Variables represent what is being measured

Modeling is:

Describing the relationship between variables
Describing how the things are measured aka how the data is generated

A model takes both into account

Suppose we want to describe the relationship between mammal body size (in weight) and brain weight

Hypothesis: Larger mammals have larger brains

We measure body weight and brain weight
We create a simple linear model; in this case a linear regression
There is variability about the relationship

Like this:

The model, then, the total variation of brain weight consists of:

A systematic component: how brain wt varies as a function of body wt
A Stochaistic component: variability due to other causes, which we can not explain with our data

A model is a summary of the data in terms of the systematic effect + a summary of the magnitude of the unexplained or random variation!

12.2.0.2 Model Components

Systematic component

Defines the relationship between X and Y (between body wt and brain wt)
It represents the variation of body wt to explain the variation of brain wt
This idea (relationship of X and Y) is what our hypotheses are (usually) about

Stochastic component

Defines the distribution of Y
It describes the variation of brain wt
In the absence of predictors (variables that we think might explain Y), all variation in brain wt is stochastic
We specify this component by making assumptions about the statistical process that generated the values of brain wt (**In linear models it is assumed to be normal*)

Here is a visual of the different model components we just discussed:

Put it all together:

Quick Summary:

A linear model is an assumption about the nature of the relationship between body wt and brain wt
It describes how much brain wt changes on average for a unit increase in body wt
It also describes how much of the variation in brain wt is not explained by body wt

12.2.0.3 Modeling Probabilities with GLM

How can we work a binary or probability response with GLM?

We need to transform the probability of Y (i.e. the mean of Y) in a way such that it can be related to X linearly
This is done using a mathematical function called link function
The link function transforms a probability into a quantity called linear predictor
The linear predictor is the systematic component of the model, and can be modeled in the same way as in simple linear models like simple linear regression

12.2.0.4 GLMs in Brief

Specify the distribution of the dependent variable
- This is the assumption about how the data are generated
- This is the stochastic component of the model
Specify the link function
- We linearize the mean of Y by transforming it into the linear predictor
- It always has an inverse function called response function
Specify how the linear predictor relates to the independent variables
- This is done in the same way as with linear regression
- This is the systematic component of the model

12.3 GLMs in Practice

12.3.0.1 Residuals

In textbooks and online, you will note different words being used for the same items within modeling.

Here is a guideline:

Ordinary (raw) residuals = response residuals
- Observed value - fitted value
Pearson residuals
- Standardized ordinary residuals
- Ordinary residuals/sqrt(fitted mean) there are others
Deviance residuals (default in R)
- Quantify the contribution of each observation to the residual deviance
- Use for zero-inflated and binomial distributions
- When in doubt, use deviance residuals!
Studentized residuals
Normalized residuals

12.3.0.2 Summary of GLM Process

Once you create your model, you aren’t done (Sadly)! We need to go over and interprete the model, validate it, and make sure the relationships between our variables are being described appropriately and accurately.

Let’s go over the process!

12.3.0.3 1. Fitting

When you first fit your model, you need to do an initial interpretation. Then you move into validating your model!

Let’s go over the steps for Validating a GLM

Homogeneity of error variance
Independence
- Inherent dependency (E.g. repeated observations on units)
- Dependency due to model misfit (Missing a covariate)
Normality of Residuals (not your data!!!!)
Influential observations
Plot residuals vs.
- Covariates included and excluded
- Time and/or space
- Fitted values
Dispersion (overdispersion or underdispersion) Only for count models
Biological sense??

12.3.0.4 2. Visualization

Plot your model
Interpret the plot
Does it make biological sense now?? (No? Repeat)

12.3.0.5 3. Interpretation

After you go through those steps (Again, you may need to repeat the validation process a few times), you then need to finalize the model and your interpretation of it in relation to your variables!

Remember: All data is different, some models might not be the best fit for your data

We can always go back to step 1 and try a new fit, and then walk through the validation steps again. There is no limit.

12.4 GLM Demo

Let’s do an example and walk through it together!

This example is taken from Zuur et al. 2013 Beginner’s Guide to GLM and GLMM in R.

The data are fisheries data from two periods – pre-commercial, and post-commercial fishing.
Response variable = Total Abundance of all species
Explanatory variable #1 = Depth (m) (range 800-4000+ m)
Explanatory variable #2 = Period (categorical, 2 levels)

12.4.0.1 Step 1 - Fit the Model

Visualizing the relationship between Total Abundance and Depth is straightforward. Always start simply when modeling!

Here is what is would look like in R:

M0 <- lm(TotAbund ~ MeanDepth, data = Fish)
summary(M0)

This model corresponds to:

\[TotAbund = β_{1} + β_{2} * MeanDepth_{i} + \epsilon_{i}\] where the model estimates $TotAbund$ at $MeanDepth_{i}$

remember that $_{i}$ is each point in the data.

12.4.0.2 Step 2 - Validate Model

Several validation steps are done for each model. Here is the list again, in a logical order.

Check for normality of residuals.
Check for homogeneity (homoscedasticity) of residuals.
Check for independence of each value.
Check for influential observations.
Check if the model makes biological sense.

Assessing a model for biological sense is quite important.

If the model does not predict values of the response (realized values) that are biologically meaningful, then the model does not fit the data and a different model should be used.

To validate the model residuals:

Extract the residuals from the model object. In this case we will use ordinary residuals, but deviance would work well too.
Plot residual diagnostic plots.
Assess whether the residual relationship to fitted values is realistic.
- Also plot residual relationship to all covariates within the model or not in the model. (More on this later).
Assess realized values. One way to do this is to simulate Total Abundance at various (and user-selected) points of Depth
- This is somewhat complicated, but is used to demonstrate a couple of vary important points!

Let’s see how this would look!

Check for the Normality of Residuals
`E0 <- resid(M0) # Residuals` `F0 <- fitted(M0) # Fitted Values` Let’s look at the residual plot
Residual v. Fitted Heterogeneity in the residuals: i.e. higher variation in the residuals for larger fitted values Non-linear patterns in residuals: groups of residuals clustered at less than zero and also greater than zero.
Does it Make Biological Sense??
In this case, after we look at the residual plots we can go back and take another look by plotting the Total Abundance vs. Mean Depth with model fit line: Here’s is what we notice: At depths close to 5 km, fitted data becomes negative This does not make biological sense!
Let’s plot the Model fit with simulated total abundance values Black dots: observed data Grey dots: simulated data Again, negative total abundance values do not make biological sense!

12.4.0.3 Validation Summary

To sum up:

Residuals appear to be more/less normally distributed.
- Does this mean the errors are normally distributed?
Residuals were heteroscedastic (heterogeneous) and showed patterns
Even though the residuals are normally distributed about Depth points, they also predict negative Total Abundance values (negative realizations)
The model also predicts negative Total Abundance fitted values which is biologically ridiculous.

So, this model is CRAP!

What is the issue?

The response variable is count data. Count data do not follow a Normal (Guassian) distribution, instead, count data follow a Poisson distribution! Let’s try to model it!

To do this, we require:

A distribution for the response variable.
A predictor function specifying the covariates.
The link between the predictor function and the mean of the distribution. (we mentioned these earlier!)

In our example, TotAbund is the response variable. It:

Likely follows a Poisson distribution,
Is related to the covariate MeanDepth, but also could be related to other covariates (i.e. other variables), and
Likely requires a link function to prevent realized values that are negative.

12.4.0.4 Link Functions

It makes sense to use a probability distribution that fits with the data type being modeled.

The data are count data and count data often follow either a Poisson or Negative Binomial distribution. Why?

Poisson distributions have a simple property in that the mean is equal to the variance.

So, we have determined that the responses probably follow a Poisson distribution. Set that aside.

What about this ‘link’ function?

When we used the linear model, the mean TotAbund response is what is being fitted at each point of Depth using a defined predictor function $\beta_{1} + \beta_{2} * MeanDepth_{i}$ which we can write as:

\[\eta_{i} = \beta_{1} + \beta_{2} * MeanDepth_{i}\]

Where $\eta_{i}$ is our predictor function.

What we want to do is link $\mu_{i}$ (the mean) to the predictor function $\eta_{i}$ in a useful way.

Identity link

$\mu_{i} = \eta_{i}$

So, the mean response along the regression line is the estimate

$\hat{y_{i}}$ along the line (the mean at $x_{i}$).

Log link

The advantage of the log link is that values will always be positive. This is precisely what we want with this model!

$\log(\mu_{i}) = \eta_{i}$

${i} = e^{{i}} = e^{{1} + {2} * MeanDepth_{i}} $

GLM Families

Specifying a different modeling distribution is rather easy.

glm(formula, family = familytype(link = linkfunction), data = mydata)

Family	Default Link Function
`binomial`	`(link = "logit")`
`gaussian`	`(link = "identity)`
`Gamma`	`(link - "inverse")`
`inverse.gaussian`	`(link = "1/mu^2")`
`poisson`	`(link = "log")`
`quasi`	`(link = "identity", variance = "constant")`
`quasibinomial`	`(link = "logit")`
`quasipoisson`	`(link = "log")`

12.4.0.5 Poisson Distribution

\[\mathrm{P}(X = k) = \frac{\lambda^{k} 𝒆^{−\lambda}}{k!}\] \[\lambda = \mathrm{E}(X) = V(X)\] Let’s see how our model specification fits into these formulas!

\[TotAbun_{i} = \beta_{1} + \beta_{2} * MeanDepth_{i} + \varepsilon_{i}\]
\[\varepsilon_{i} \sim N(0, \sigma^{2})\]

Same Gaussian model, different specification:

\[TotAbun_{i} \sim N(\mu_{i}, \sigma^2)\] \[E(TotAbun_{i}) = \mu_{i}\] \[var(TotAbun_{i}) = \sigma^{2}\] \[\mu_{i} = \beta_{1} + \beta_{2} * MeanDpeth_{i}\]

Poisson model

\[TotAbun_{i} \sim P(\mu_{i})\] \[E(TotAbun_{i}) = var(TotAbun_{i}) = \mu_{i}\] \[\eta_{i} = \beta_{1} + \beta_{2} * MeanDepth_{i}\]

Using a link between the mean value of $\mu_{i}$ and the covariates

\[\mu_{i} = \eta_{i}\]

But, the link function can still produce negative values, so:

\[\log(\mu_{i}) = \eta_{i}\] \[\mu_{i} = e^{n_{i}} = e^{\beta_{1} + \beta_{2} * MeanDepth_{i}}\]

12.4.0.6 Step 3 - Try it Again

We have now established:

A Poisson distribution is probably useful
A log link corresponds to our model (positive responses only).

So, let’s try a new model:

M1 <- glm(TotAbund ~ MeanDepth, data = Fish, family = poisson)
summary(M1)

Note: The default link for family poisson is the log link (so we don’t need to specify it).

And now Let’s begin again with model validation (residuals, plots, etc.)

Once the validation process indicates no issues, we follow it with model interpretation.

Residual v. Fitted Heterogeneity in the residuals: Again, we see that the residual plots are not ideal!
The fitted line plot with simulated values added (just like in the linear model), shows no negative predictions, yay Black dots: observed data Grey dots: simulated data Red line: fitted values of Poisson GLM Makes biological sense! BUT

12.4.0.7 Validation 2 Summary

There is another issue:

The data (observations) are more dispersed than the model predicts!

This result is known as overdispersion, and it is a very important thing to investigate. It occurs with Poisson family distributions.

Luckily, we have created some functions to help with calculating dispersion.

To sum up:

Residuals were heterogeneous and showed patterns,
Residuals are normally distributed about Depth points, but they are more dispersed than the Poisson model predicts.

So, this model is CRAP! (again)

12.4.0.8 Dispersion

Overdispersion can be caused by:

Missing explanatory variable(s)
One or more outliers
Missing interaction term(s)
One (or more) explanatory variables on the wrong scale (need to be logged, squared, etc.)
Non-linear effects of a continuous covariate
Mis-specified model (incorrect link function)
Zero-inflation
Inherent dependency structure in the data

Investigating each of these may reduce dispersion, but what level is correct?

Very fine statisticians have simulated Poisson model outcomes, and have determined that a dispersion of 1 realizes a perfectly fitted Poisson model.

Dispersion statistic >1: overdispersed
Dispersion statistic <1: underdispersed

Therefore, let’s try to get to a dispersion of 1.

Look for outliers
Add a covariate (Period), and look at others (spatial covariate)
Try a different distribution - The Negative Binomial

Let’s Investigate

Cooks Distance
Cooks distance may be used to determine outliers! Cook’s distance values >1 may be influential to the model. But there are over 20 values >1! We cannot exclude all of these observations. Dispersion is still high when we remove largest outlier
Adding a Factor
This plot indicates that there is a Period effect on the residuals. So we should add period as a factor in our model To determine if the depth effect on abundance changes per period, we need to add an interaction between Mean Depth and Period `dispersion(M2)` `111.1873` **There is still overdispersion in this model, what now?
Adjust for sample period! `dispersion(M2)` `111.1873` The interaction factor is significant, but there is still too much overdispersion…
Coefficients
Coefficients MUST BE CALCULATED! Intercept <- Intercept (first value of each factor) i.e. $stage1 = 7.95$ $speciesmicans = 7.95$ $stage2 = 7.95 + 3.43 = 11.38$ $stage3 = 7.95 + 8.87 = 16.82$ $speciesmutilatus = 7.95 - 0.48$ $stage2 = 7.95 - 0.48 + 3.43$ $stage3 = 7.95 - 0.48 + 8.87$

12.4.0.9 Negative Binomial

Negative binomial distribution contains a second parameter called the dispersion parameter or heterogeneity parameter used to accommodate Poisson overdispersion.

The previous modelling results indicated that variance is larger than the mean, so a negative binomial distribution may fix this overdispersion.

Model Specification	$TotAbun_{i} = \beta_{1} + \beta_{2} * MeanDepth + \varepsilon_{i}$ $\varepsilon_{i} \sim N(0, \sigma^{2})$
Poisson Model	$TotAbun_{i} \sim P(\mu_{i})$ $E(TotAbun_{i}) = var(TotAbun_{i}) = \mu_{i}$ $\eta_{i} = \beta_{1} + \beta_{2} * MeanDepth_{i}$
Using a link between the mean value of $\mu_{i}$ and the covariates $\mu_{i} = \eta_{i}$ But, the link function can still produce negative values, so: $\log(\mu_{i}) = \eta_{i}$ or $\mu_{i} = e^{\eta_{i}} = e^{\beta_{1} + \beta_{2} * MeanDepth_{i}}$
Negative binomial Model Note: It includes the additional factor, Period as well	$TotAbun_{i} \sim NB(\mu_{i}, k)$ $E(TotAbun_{i}) = \mu_{i}$ and $var(TotAbun_{i}) = \mu_{i} + \frac{\mu_{i}^{2}}{k} = \mu_{i} + \sigma + \mu_{i}^{2}$ $\log(\mu_{i}) = \eta_{i}$ $\eta_{i} = \beta_{1} + \beta_{2}xMeanDepth_{i} + \beta_{3}xPeriod_{j} +\beta_{4}xMeanDepth_{i}xPeriod_{j}$
Negative Binomial Distribution `dispersion(M4, modeltype = 'nb’)` `[1] 1.001618` The dispersion statistic is essentially 1!

Model Specification	\(TotAbun_{i} = \beta_{1} + \beta_{2} * MeanDepth + \varepsilon_{i}\) \(\varepsilon_{i} \sim N(0, \sigma^{2})\)
Poisson Model	\(TotAbun_{i} \sim P(\mu_{i})\) \(E(TotAbun_{i}) = var(TotAbun_{i}) = \mu_{i}\) \(\eta_{i} = \beta_{1} + \beta_{2} * MeanDepth_{i}\)
Using a link between the mean value of \(\mu_{i}\) and the covariates \(\mu_{i} = \eta_{i}\) But, the link function can still produce negative values, so: \(\log(\mu_{i}) = \eta_{i}\) or \(\mu_{i} = e^{\eta_{i}} = e^{\beta_{1} + \beta_{2} * MeanDepth_{i}}\)
Negative binomial Model Note: It includes the additional factor, Period as well	\(TotAbun_{i} \sim NB(\mu_{i}, k)\) \(E(TotAbun_{i}) = \mu_{i}\) and \(var(TotAbun_{i}) = \mu_{i} + \frac{\mu_{i}^{2}}{k} = \mu_{i} + \sigma + \mu_{i}^{2}\) \(\log(\mu_{i}) = \eta_{i}\) \(\eta_{i} = \beta_{1} + \beta_{2}xMeanDepth_{i} + \beta_{3}xPeriod_{j} +\beta_{4}xMeanDepth_{i}xPeriod_{j}\)
Negative Binomial Distribution `dispersion(M4, modeltype = 'nb’)` `[1] 1.001618` The dispersion statistic is essentially 1!