A simple way of assessing dispersion in models for count data
It has been some time since I have updated my blog, but I have plenty of good
excuses for justifying my lack of activity (or so I like to tell myself). In my post on simple models for abundance, I
introduced the Poissonlog GLM as the basic approach for
modelling count data. The Poissonlog GLM has an important limitation when it
comes to modelling ecological data: it assumes that the variance and the mean
of the response (dependent) variable are the same, so it is rather inflexible
for modelling response data with a variability exceeding the variation in the
mean. When this condition is not met, we obtain over or underdispersion in
the model. It is
important to note that the terms over and underdispersion do not refer to the
raw data values, but to the mean and variance expected with respect to a Poissonlog
GLM. It is a frequent mistake to think that dispersion refers to the raw
values of the response variable.
I should be more specific and technical when explaining what dispersion is: the dispersion in the Poissonlog GLM applies to the portion of the variability in the count data that cannot be accounted for by modelling the mean response as described in the post on modelling abundance. Therefore, dispersion refers to the residual (or conditional) mean and variance obtained after fitting a model. We talk about over and underdispersion in the response variable after the structure in the mean response has been accounted for with covariates. In summary, relative to the assumptions of a Poissonlog model, we classify dispersion as follows:
If residual
variance > residual mean: overdispersion
If
residual variance < residual mean: underdispersion
I have come to realise that over and underdispersion are very common in ecological models. In fact, I cannot recall the last time that I was
able to use a Poissonlog GLM for my research: almost always I do need to
account for over or underdispersion (more frequently for overdispersion). The two most usual sources of over and underdispersion in ecological modelling are, at least in my own experience:
 The presence of abundant zeroes in the response variable
 Important covariates have not been included in the model
It seems to be quite common to assume overdispersion whenever there are many zeroes in the response variable, but that is not necessarily right. Choosing the appropriate model is fundamental for obtaining reliable inferences about the system of interest, so it is important to be able to assess whether we need to account for over or underdispersion. There are a few ways for doing it, but perhaps the most simple is to fit a quasiPoisson GLM. The quasiPoisson GLM is basically a Poissonlog GLM that accounts for the dispersion in the dataset by estimating an additional parameter, the dispersion parameter ϕ. This extra parameter has a straightforward interpretation:
ϕ < 1: interpret as underdispersion
ϕ = 1: interpret as residual variance equals residual mean
ϕ > 1: interpret as overdispersion
As usual, I will illustrate the use of a quasiPoisson
model through an example in R. We are going to model the abundance of the Cunningham’s skink (Egernia cunninghami) in the Mt Lofty Ranges (South Australia). The species is
rare and considered endangered in South Australia due to its restricted range. In this
example, we will model the abundance of the Cunningham's skink as a function of vegetation
cover, with the same values as in the post on modelling abundance of the
hunstman spider (to make comparisons easy).
An adult Cunningham's skink photographed on Cleland Conservation Park, South Australia 
Let’s get started by simulating some data
in R:
#### sample.size: number of plots surveyed for Cunningham’s
skink, n = 30
set.seed(1000)
sample.size<30
#### Simulate vegetation cover data for the plots
from a Beta distribution
veg.cov<rbeta(sample.size, 0.5,
0.5)
#### Now, let’s define the parameters of the
Poissonlog GLM relationship between the
#### Cunningham’s skink abundance
and the vegetation cover
intercept.sim=2 ####
Intercept; slope.sim=3 #### Slope
#### Define the mean Cunningham’s skink abundance λ_{i} as
a function of the
#### vegetation cover
lambda<exp(intercept.sim+slope.sim*veg.cov)
#### Now, we induce
underdispersion in the response variable by defining the
#### dispersion
parameter ϕ being less than 1
phi<0.4
#### Finally, obtain the Cunningham’s skink abundance in each and every plot by drawing
####
values from a Poisson distribution with mean λ_{i} and dispersion parameter ϕ
abundance<rpois(sample.size, phi*lambda)
##### Let’s have a look at the
relationship between the Cunningham's skink abundance and the
#### vegetation cover
plot(abundance~veg.cov,
xlab="Vegetation cover (%)", ylab="Skink abundance",
pch=19, cex=3)

Now, let's formally assess whether or not we have dispersion by using a quasiPoisson GLM. Fitting a quasiPoisson GLM
in R is as simple as it was fitting a Poissonlog GLM, and we only need one
line of code.
#### Fit a quasiPoisson GLM to the Cunningham’s
skink abundance (response variable) and
#### vegetation cover data (independent covariate)
fit.ud<glm(abundance~veg.cov,
family=quasipoisson())
#### Have a look at the summary Table
summary(fit.ud)

To assess dispersion, we have to pay attention to the estimated dispersion parameter
(highlighted in red). Remember the rules of thumb explained before for
interpreting the dispersion parameter – in this case, the estimate is 0.53 and we conclude that we have underdispersion. Of course, there is
underdispersion as we have simulated the dataset to be underdispersed.
We can calculate the 95% confidence
intervals to see whether the estimates of the intercept and slope from the
quasiPoisson model compare well to our simulated values. Interestingly, we see that our quasiPoisson model recovers
#### Calculate the 95% confidence intervals
conf.int<confint(fit.ud)
conf.int
95% confidence intervals for the coefficients of the quasiPoisson model 
Interestingly, we see that our quasiPoisson model recovers the slope relatively well but that the intercept is not adequately recovered by the fitted model (the simulated value, 2, is not within the 95% CI). This result conveys an important message: you have to be cautious when presenting and interpreting results from models that account for over and underdispersion as the estimates might be biassed. Remember that the models that account for over or underdispersion need to estimate an additional parameter, implying that there is fewer data to estimate all the coefficients in the model (relative to a Poissonlog GLM). Fewer data relative to the number of coefficients to be estimated means higher chances of biassed and uncertain estimates.
This entry presented an example of how to evaluate the presence of over or underdispersion. There are other ways and tools for dealing with the added complication of dispersion. If you are interested in finding out more on the topic of modelling count data, I would recommend that you read the fantastic blog post by the UNSW EcoStats people and these two superb documents:
Zeileis et al. Regression models for count data in R
Potts & Elith. Comparing species abundance models
A closeup of two Cunningham's skinks on Morialta Conservation Park, South Australia 
The annotated R script for conducting the quasiPoisson GLM can be found
in my GitHub repository: https://github.com/pablogarciadiaz/CodeBlog/blob/master/CodeDispersion.R
No comments:
Post a Comment