Simple models for distributions
In my last post, I said that ecology is
fundamentally the study of the distribution and
abundance of living things. On that last post, I introduced a very simple model for abundance, so it was just logical to follow-up with an entry on
simple models for distribution. I will define distribution as presence/absence
of the target species on a given site. I am going to follow the structure of
the previous entry to make any comparisons easy.
Abundance and presence/absence are not independent ecological realities, rather they are different measures of the same underlying process. This is because the presence of a species on a site is just what happens when the abundance on that site is ≥ 1. If there is, at least, one individual, then the species is present. Oppositely, absence is when the abundance at a site is zero. There is an ever-growing, large, field of research on models for studying the distribution of species, so-called Species Distribution Models SDMs, but to keep it simple I am going to focus on the simplest and most frequently used approach. If you are interested in modelling distributions, I strongly suggest reading the review paper written by Gurutzeta Guillera-Arroita et al. and the blog post by Gurutzeta where she summarises the paper.
Presence/absence data are frequently
coded as 0 (absence) and 1 (presence). The Bernoulli distribution is the
appropriate probability distribution for modelling this kind of data. Note that
the Bernoulli distribution is a special case of the Binomial distribution,
whereby n, the number of trials, is
1. We can model presence/absence data using a logistic regression (Binomial-logit GLM), and it is straightforward to do it in
R. In a logistic regression, the relationship between presence and the factors of
interest is linear on the logit scale, and the presence follows a Bernoulli distribution
with probability of presence pi:
logit(pi)
= intercept + slope * xi
Presence
~ Bernoulli(pi)
where xi is
the factor of interest. Let’s do an
example in R: in this case, we are
interested in how the abundance of prey in a river section influences the distribution of the
American mink. I have published a paper demonstrating that prey abundance drives
the reproductive biology of American mink in Spain, so I do expect that prey
abundance would also govern mink distribution.
An American mink carefully checks me out; river Tormes Salamanca (Spain) |
Let's get started by simulating some data in R:
#### sample.size: number of river sections surveyed for mink, n = 30
set.seed(1000)
sample.size<-30
#### Simulate prey abundance data for the river sections using a Poisson
distribution;
#### the mean abundance of prey per section would be 5.
prey.ab<-rpois(sample.size,
5)
#### Now, let’s define the parameters of the logistic
regression for the relationship between
#### the mink presence and the prey abundance
intercept.sim=-5 #### Intercept
slope.sim=1.5 #### Slope
slope.sim=1.5 #### Slope
#### Define the probability of
presence of mink as a function of prey
abundance
prob.pres<-plogis(intercept.sim+slope.sim*prey.ab)
#### Finally, obtain the presence of mink in each river
section by drawing values from a
#### Bernoulli distribution with probability prob.pres
#### Remember that a Bernoulli distribution is a Binomial distribution with size = 1
#### Remember that a Bernoulli distribution is a Binomial distribution with size = 1
presence<-rbinom(sample.size, size=1, p=prob.pres)
##### Let’s have a look at the
relationship between the mink presence and the
#### prey abundance
plot(presence~prey.ab, xlab="Prey abundance", ylab="Mink
presence", pch=19, cex=3)
![]() | |
|
Time for the formal statistical inference. We fit a
logistic regression to the simulated data to assess whether the prey abundance
influences the presence of mink.
#### Fit a logistic regression to the mink presence
(dependent variable) and
#### prey abundance data (independent).
fit2<-glm(presence~prey.ab,
family=binomial())
#### Have a look at the summary
Table.
summary(fit2)
The ‘Estimate’ column shows that ‘prey.ab’ has a strong
positive effect, and the column 'Pr(>|z|)’ indicates that the effect is
statistically significant (p <0.05).
The analysis confirms that the probability of the presence of mink increases
with increasing prey abundance.
![]() | |
|
Whenever you have simulated data,
it is always a good idea to check how well your model recovers the simulated parameters. I did a quick check in the
previous post, so let’s do it again for the logistic regression example. Our
simulated parameters were -5 (intercept)
and 1.5, and we see that the estimates from the fitted model (column ‘Estimate’)
compare nicely with the simulated values. Calculating the 95% Confidence
Intervals confirms that the model is doing well at recovering the original parameter
values, although the confidence intervals for both parameters are quite large:
#### Calculate the 95% Confidence
Intervals
conf.int<-confint(fit2)
conf.int
![]() | |
|
This was a very simple and nice example to
get started with modelling presence/absence data. As I
stressed in the case of modelling
abundance, modelling real presence/absence data is frequently a more complex and challenging
task. Obviously, detection is again a key biassing
agent – I have considered in this post that we were always able to detect mink
wherever they were present. Therefore, zeroes are actual absences. However, if
the detection is imperfect, then zeroes would be a mix of real and false
absences (the species is present but was not detected), messing up the inferences that can be obtained from any statistical model. Occupancy models can
successfully deal with this problem. In any case, I strongly recommend reading
the review paper by Guillera-Arroita et al. to get an idea of the existing
tools and complexities of modelling distributions.
Here is a very good friend of mine checking a fish trap that we had set for estimating fish abundance. This was part of the fieldwork that we did for a study on the impact of the American mink on native prey in the Natural Park Las Batuecas-Sierra de Francia (Spain) |
An annotated R script for conducting the logistic regression can be found in my GitHub repository: https://github.com/pablogarciadiaz/Code-Blog/blob/master/Logistic-Regression.R
No comments:
Post a Comment