Sunday, 13 March 2016

Simple models for distributions

Simple models for distributions

In my last post, I said that ecology is fundamentally the study of the distribution and abundance of living things. On that last post, I introduced a very simple model for abundance, so it was just logical to follow-up with an entry on simple models for distribution. I will define distribution as presence/absence of the target species on a given site. I am going to follow the structure of the previous entry to make any comparisons easy.

Abundance and presence/absence are not independent ecological realities, rather they are different measures of the same underlying process. This is because the presence of a species on a site is just what happens when the abundance on that site is ≥ 1. If there is, at least, one individual, then the species is present. Oppositely, absence is when the abundance at a site is zero. There is an ever-growing, large, field of research on models for studying the distribution of species, so-called Species Distribution Models SDMs, but to keep it simple I am going to focus on the simplest and most frequently used approach. If you are interested in modelling distributions, I strongly suggest reading the review paper written by Gurutzeta Guillera-Arroita et al. and the blog post by Gurutzeta where she summarises the paper. 

Presence/absence data are frequently coded as 0 (absence) and 1 (presence). The Bernoulli distribution is the appropriate probability distribution for modelling this kind of data. Note that the Bernoulli distribution is a special case of the Binomial distribution, whereby n, the number of trials, is 1. We can model presence/absence data using a logistic regression (Binomial-logit GLM), and it is straightforward to do it in R. In a logistic regression, the relationship between presence and the factors of interest is linear on the logit scale, and the presence follows a Bernoulli distribution with probability of presence pi:

            logit(pi) = intercept + slope * xi
            Presence ~ Bernoulli(pi)

where xi is the factor of interest. Let’s do an example in R: in this case, we are interested in how the abundance of prey in a river section influences the distribution of the American mink. I have published a paper demonstrating that prey abundance drives the reproductive biology of American mink in Spain, so I do expect that prey abundance would also govern mink distribution.  

An American mink carefully checks me out; river Tormes Salamanca (Spain)

Let's get started by simulating some data in R:

            #### sample.size: number of river sections surveyed for mink, n = 30



            #### Simulate prey abundance data for the river sections using a Poisson distribution; 
#### the mean abundance of prey per section would be 5.

prey.ab<-rpois(sample.size, 5)

#### Now, let’s define the parameters of the logistic regression for the relationship between
#### the mink presence and the prey abundance

intercept.sim=-5    #### Intercept  
slope.sim=1.5  #### Slope

#### Define the probability of presence of mink as a function of prey abundance


#### Finally, obtain the presence of mink in each river section by drawing values from a 
#### Bernoulli distribution with probability prob.pres
#### Remember that a Bernoulli distribution is a Binomial distribution with size = 1
presence<-rbinom(sample.size, size=1, p=prob.pres)

##### Let’s have a look at the relationship between the mink presence and the
#### prey abundance

plot(presence~prey.ab, xlab="Prey abundance", ylab="Mink presence", pch=19, cex=3)

The relationship between the mink presence and the abundance of prey

Time for the formal statistical inference. We fit a logistic regression to the simulated data to assess whether the prey abundance influences the presence of mink.

#### Fit a logistic regression to the mink presence (dependent variable) and
#### prey abundance data (independent).

fit2<-glm(presence~prey.ab, family=binomial())

#### Have a look at the summary Table.


The ‘Estimate’ column shows that ‘prey.ab’ has a strong positive effect, and the column 'Pr(>|z|)’ indicates that the effect is statistically significant  (p <0.05). The analysis confirms that the probability of the presence of mink increases with increasing prey abundance.

The summary table for the logistic model (fit2), obtained by running summary(fit2)

Whenever you have simulated data, it is always a good idea to check how well your model recovers the simulated parameters. I did a quick check in the previous post, so let’s do it again for the logistic regression example. Our simulated parameters were -5 (intercept) and 1.5, and we see that the estimates from the fitted model (column ‘Estimate’) compare nicely with the simulated values. Calculating the 95% Confidence Intervals confirms that the model is doing well at recovering the original parameter values, although the confidence intervals for both parameters are quite large:

#### Calculate the 95% Confidence Intervals<-confint(fit2)
95% Confidence Intervals for the coefficients of the logistic regression

  This was a very simple and nice example to get started with modelling presence/absence data. As I stressed in the case of modelling abundance, modelling real presence/absence data is frequently a more complex and challenging task. Obviously, detection is again a key biassing agent – I have considered in this post that we were always able to detect mink wherever they were present. Therefore, zeroes are actual absences. However, if the detection is imperfect, then zeroes would be a mix of real and false absences (the species is present but was not detected), messing up the inferences that can be obtained from any statistical model. Occupancy models can successfully deal with this problem. In any case, I strongly recommend reading the review paper by Guillera-Arroita et al. to get an idea of the existing tools and complexities of modelling distributions.

Here is a very good friend of mine checking a fish trap that we had set for estimating fish abundance. This was part of the fieldwork that we did for a study on the impact of the American mink on native prey in the Natural Park Las Batuecas-Sierra de Francia (Spain)

An annotated R script for conducting the logistic regression can be found in my GitHub repository:

No comments:

Post a Comment