Thursday, 14 July 2016

Assessing dispersion

A simple way of assessing dispersion in models for count data


It has been some time since I have updated my blog, but I have plenty of good excuses for justifying my lack of activity (or so I like to tell myself). In my post on simple models for abundance, I introduced the Poisson-log GLM as the basic approach for modelling count data. The Poisson-log GLM has an important limitation when it comes to modelling ecological data: it assumes that the variance and the mean of the response (dependent) variable are the same, so it is rather inflexible for modelling response data with a variability exceeding the variation in the mean. When this condition is not met, we obtain over- or under-dispersion in the model. It is important to note that the terms over- and under-dispersion do not refer to the raw data values, but to the mean and variance expected with respect to a Poisson-log GLM. It is a frequent mistake to think that dispersion refers to the raw values of the response variable.

I should be more specific and technical when explaining what dispersion is: the dispersion in the Poisson-log GLM applies to the portion of the variability in the count data that cannot be accounted for by modelling the mean response as described in the post on modelling abundance. Therefore, dispersion refers to the residual (or conditional) mean and variance obtained after fitting a model. We talk about over- and under-dispersion in the response variable after the structure in the mean response has been accounted for with covariates. In summary, relative to the assumptions of a Poisson-log model, we classify dispersion as follows:

If residual variance > residual mean: over-dispersion
  If residual variance < residual mean: under-dispersion

I have come to realise that over- and under-dispersion are very common in ecological models. In fact, I cannot recall the last time that I was able to use a Poisson-log GLM for my research: almost always I do need to account for over- or under-dispersion (more frequently for over-dispersion). The two most usual sources of over- and under-dispersion in ecological modelling are, at least in my own experience:

  •             The presence of abundant zeroes in the response variable
  •             Important covariates have not been included in the model

It seems to be quite common to assume over-dispersion whenever there are many zeroes in the response variable, but that is not necessarily right. Choosing the appropriate model is fundamental for obtaining reliable inferences about the system of interest, so it is important to be able to assess whether we need to account for over- or under-dispersion. There are a few ways for doing it, but perhaps the most simple is to fit a quasi-Poisson GLM. The quasi-Poisson GLM is basically a Poisson-log GLM that accounts for the dispersion in the dataset by estimating an additional parameter, the dispersion parameter ϕ. This extra parameter has a straightforward interpretation:

ϕ < 1: interpret as under-dispersion
ϕ = 1: interpret as residual variance equals residual mean
ϕ > 1: interpret as over-dispersion

As usual, I will illustrate the use of a quasi-Poisson model through an example in R. We are going to model the abundance of the Cunningham’s skink (Egernia cunninghami) in the Mt Lofty Ranges (South Australia). The species is rare and considered endangered in South Australia due to its restricted range. In this example, we will model the abundance of the Cunningham's skink as a function of vegetation cover, with the same values as in the post on modelling abundance of the hunstman spider (to make comparisons easy).

An adult Cunningham's skink photographed on Cleland Conservation Park, South Australia

Let’s get started by simulating some data in R:
#### sample.size: number of plots surveyed for Cunningham’s skink, n = 30



#### Simulate vegetation cover data for the plots from a Beta distribution

veg.cov<-rbeta(sample.size, 0.5, 0.5)

#### Now, let’s define the parameters of the Poisson-log GLM relationship between the 
#### Cunningham’s skink abundance and the vegetation cover

intercept.sim=-2    #### Intercept;  slope.sim=3  #### Slope

#### Define the mean Cunningham’s skink abundance λi as a function of the
#### vegetation cover


#### Now, we induce under-dispersion in the response variable by defining the
#### dispersion parameter ϕ being less than 1


#### Finally, obtain the Cunningham’s skink abundance in each and every plot by drawing 
#### values from a Poisson distribution with mean λi and dispersion parameter ϕ

abundance<-rpois(sample.size, phi*lambda)

##### Let’s have a look at the relationship between the Cunningham's skink abundance and the
#### vegetation cover

plot(abundance~veg.cov, xlab="Vegetation cover (%)", ylab="Skink abundance", pch=19, cex=3)

The relationship between the Cunningham's skink abundance and the vegetation cover

            Now, let's formally assess whether or not we have dispersion by using a quasi-Poisson GLM. Fitting a quasi-Poisson GLM in R is as simple as it was fitting a Poisson-log GLM, and we only need one line of code.

#### Fit a quasi-Poisson GLM to the Cunningham’s skink abundance (response variable) and
#### vegetation cover data (independent covariate)

fit.ud<-glm(abundance~veg.cov, family=quasipoisson())

#### Have a look at the summary Table


The summary table for the quasi-Poisson model (fit.ud), obtained by running summary(fit.ud)
To assess dispersion, we have to pay attention to the estimated dispersion parameter (highlighted in red). Remember the rules of thumb explained before for interpreting the dispersion parameter – in this case, the estimate is 0.53 and we conclude that we have under-dispersion. Of course, there is under-dispersion as we have simulated the dataset to be under-dispersed.

           We can calculate the 95% confidence intervals to see whether the estimates of the intercept and slope from the quasi-Poisson model compare well to our simulated values. Interestingly, we see that our quasi-Poisson model recovers

#### Calculate the 95% confidence intervals<-confint(fit.ud)

95% confidence intervals for the coefficients of the quasi-Poisson model

              Interestingly, we see that our quasi-Poisson model recovers the slope relatively well but that the intercept is not adequately recovered by the fitted model (the simulated value, -2, is not within the 95% CI). This result conveys an important message: you have to be cautious when presenting and interpreting results from models that account for over- and under-dispersion as the estimates might be biassed. Remember that the models that account for over- or under-dispersion need to estimate an additional parameter, implying that there is fewer data to estimate all the coefficients in the model (relative to a Poisson-log GLM). Fewer data relative to the number of coefficients to be estimated means higher chances of biassed and uncertain estimates.

                This entry presented an example of how to evaluate the presence of over- or under-dispersion. There are other ways and tools for dealing with the added complication of dispersion. If you are interested in finding out more on the topic of modelling count data, I would recommend that you read the fantastic blog post by the UNSW Eco-Stats people and these two superb documents:

A close-up of two Cunningham's skinks on Morialta Conservation Park, South Australia

The annotated R script for conducting the quasi-Poisson GLM can be found in my GitHub repository:

No comments:

Post a Comment