Chapter 2 of CWP identifies three frameworks for testing hypotheses in conservation biology:
Classical statistics and null hypothesis testing.
Model selection based on information theory.
Bayesian statistics.
We’re not going into Bayesian statistics in this class (the Bayesian approach is very valuable but too complex to cover here). In my opinion, the first two approaches are not nearly as distinct as CH. 2 suggests, and many of the weaknesses of classical statistics described in CH.2 are not fundamental weaknesses but rather examples of weak application of the method or poor reporting of results. You cannot do ecological research without understanding and using both approaches, but keep in mind that they will often lead you to the same fundamental inferences.
We’ll use a data set on factors that affect herd sizes of ungulates in Kenya to illustrate these methods, and how they can be used to make ecological inferences. The variables in the dataset (after some processing shown below) are:
GroupSize - size of the ungulate herd
Species - species of the ungulate herd
DistPred - distance from the herd to the nearest predator in kilometers
Pred.LT.400m - distance to predator turned into a categorical variable, Present = less than threshold distance, Absent = greater than threshold distance. Used a species-specific threshold for each ungulate species (see below).
PredSpecies - species of the nearest predator, LI = lion, HY = spotted hyena
BushWoodGrass - vegetation type as a categorical variable with three levels (B =bush ,W = woodland, G = grassland)
HabOpen.Close - vegetation type as a simpler categorical variable with only two levels (O = open, c = closed)
DistWood.m - distance to nearest edge between vegetation types, in broad categories because it was visually estimated.
HerdType - single species or mixed species herd.
Often, you have a well-defined hypothesis and simply want to know if data support it or not. For example, you might hypothesize that ungulate herd sizes should increase when predators are near (because natural selection has favored behavior that reduces the risk of predation). Most simply you can test this hypothesis with a simple regression approach, examining the relationship of herd size (y) to the proximity of predators (x) and testing if the slope differs from zero. If you believe that other variables (like vegetation type, or distance to water) also are likely to affect herd size, you can use a multiple regression approach: in this case, you examine the relationship of herd size (y) to several predictor variables (x1 = predator proximity, x2 = vegetation type, x3 = distance to water, x4 = season, etc). The predictors can be a combination of continuous variables (like distance to water) and categorical variables (like season or vegetation type). The multiple regression approach:
As part of a multiple regression approach, you can use stepwise regression to identify which predictor variables should be included in a model of herd sizes. This can be done by backward stepwise regression, which starts with a large set of possibly important predictors and eliminates the weakest predictors one by one until only strong predictors remain, or by forward stepwsie regression, which adds the strongest predictor first and keeps adding predictors until none are helping to explain much about herd size.
Examples of all of these follow.
First, load some packages that will be used below. (Make sure you have these packages installed before loading them, as discussed in the first R lab.)
library(ggplot2)
library(MASS)
library(msm)
library(sandwich)
library(ggthemes)
library(plyr)
library(digest)
Clean up the memory, read in the data, inspect it, do some simple data processing and produce some summary statistics.
You will first have to set the working directory to the location where you’ve saved the data file using setwd() or the Set Working Directory option under the Session tab in R Studio. The syntax for setwd() is setwd(“C:/Users/screel/Desktop/kenya_behavior_paper/Kenya_herd_size”). Notice the direction of the slashes.
I normally save data files to be read in the same folder as the R script that reads them, so that there is no need to change the working directory.
rm(list=ls(all=TRUE))
kenyaherdsize2 = read.table("kenyaherdsize2.txt", header = TRUE, sep = ",", fill= TRUE, dec = ".")
kenya.herdsize <- as.data.frame(kenyaherdsize2)
attach(kenya.herdsize)
You do not have to examine the code just below in detail for now. It uses if() statements within a for() loop to step through all the rows in the data frame and insert values in a categorical variable (Pred.LT.400m) for predator presence/absence, using species-specific distance thresholds (which came from an independent analysis not shown here). We will go through ‘control structures’ like for() loops and ‘conditional execution’ statements like if() statements later in the class. (There are other ways to do this, but I want to briefly introduce these functions in a fairly simple context.)
for (i in 1:length(DistPred)) {
if ((kenya.herdsize$Species[i] == "Zebra") && (kenya.herdsize$DistPred[i] <= 1.5))
kenya.herdsize$Pred.LT.400m[i] = "Present"
if ((kenya.herdsize$Species[i] == "Zebra") && (kenya.herdsize$DistPred[i] > 1.51))
kenya.herdsize$Pred.LT.400m[i] = "Absent"
if ((kenya.herdsize$Species[i] == "Wildbst") && (kenya.herdsize$DistPred[i] <= 0.5))
kenya.herdsize$Pred.LT.400m[i] = "Present"
if ((kenya.herdsize$Species[i] == "Wildbst") && (kenya.herdsize$DistPred[i] > 0.51))
kenya.herdsize$Pred.LT.400m[i] = "Absent"
if ((kenya.herdsize$Species[i] == "Impala") && (kenya.herdsize$DistPred[i] <= 1.7))
kenya.herdsize$Pred.LT.400m[i] = "Present"
if ((kenya.herdsize$Species[i] == "Impala") && (kenya.herdsize$DistPred[i] > 1.71))
kenya.herdsize$Pred.LT.400m[i] = "Absent"
if ((kenya.herdsize$Species[i] == "Grant") && (kenya.herdsize$DistPred[i] <= 1.0))
kenya.herdsize$Pred.LT.400m[i] = "Present"
if ((kenya.herdsize$Species[i] == "Grant") && (kenya.herdsize$DistPred[i] > 1.01))
kenya.herdsize$Pred.LT.400m[i] = "Absent"
if ((kenya.herdsize$Species[i] == "Giraffe") && (kenya.herdsize$DistPred[i] <= 1.0))
kenya.herdsize$Pred.LT.400m[i] = "Present"
if ((kenya.herdsize$Species[i] == "Giraffe") && (kenya.herdsize$DistPred[i] > 1.01))
kenya.herdsize$Pred.LT.400m[i] = "Absent"
}
Delete observations more than 2 km from predators because graphical analysis (not shown) suggests there is a lot of statistical ‘noise’ and some sign of increased antipredator response for distances beyond this point (both signatures of misclassification of predator absence when distances become too large).
reduced.data <-subset(kenya.herdsize, DistPred < 2.0, select = c(DistPred, PredSpecies, GroupSize, Species,BushWoodGrass, HabOpen.Close, Pred.LT.400m, DistWood.m., HerdType))
#convert the distance to woodland edge, in meters (DistWood.m) into a categorical factor
reduced.data$DistWood <- factor(reduced.data$DistWood.m.)
levels(reduced.data$DistWood)
## [1] "" ">300" "1-30" "101-300" "31-100"
summary(reduced.data$DistWood)
## >300 1-30 101-300 31-100
## 64 17 18 102 58
# Rename levels of the factor, one name at a time, and remove some levels that are not useful
# (there is a more concise way to do this all in one line, but I am just using copy/edit here to keep it clear)
levels(reduced.data$DistWood)[levels(reduced.data$DistWood)=="1-30"] <- "Edge"
levels(reduced.data$DistWood)[levels(reduced.data$DistWood)=="31-100"] <- "Near"
levels(reduced.data$DistWood)[levels(reduced.data$DistWood)=="101-300"] <- "Far"
levels(reduced.data$DistWood)[levels(reduced.data$DistWood)==">300"] <- "Very Far"
levels(reduced.data$DistWood)[levels(reduced.data$DistWood)=="0"] <- NA
levels(reduced.data$DistWood)[levels(reduced.data$DistWood)== "50"] <- NA
levels(reduced.data$DistWood)[levels(reduced.data$DistWood)==""] <- NA
summary(reduced.data$DistWood)
## Very Far Edge Far Near NA's
## 17 18 102 58 64
Inspect the processed data and use it from this point on.
head(reduced.data)
## DistPred PredSpecies GroupSize Species BushWoodGrass HabOpen.Close
## 1 0.883 LI 5 Giraffe B O
## 2 0.883 LI 2 Giraffe B O
## 3 0.050 LI 8 Giraffe B C
## 5 1.300 LI 6 Giraffe W C
## 7 0.175 LI 4 Giraffe G O
## 8 0.100 LI 4 Giraffe W O
## Pred.LT.400m DistWood.m. HerdType DistWood
## 1 Present Mixed <NA>
## 2 Present Mixed <NA>
## 3 Present 31-100 Single Near
## 5 Absent 31-100 Single Near
## 7 Present >300 Mixed Very Far
## 8 Present 101-300 Single Far
tail(reduced.data)
## DistPred PredSpecies GroupSize Species BushWoodGrass HabOpen.Close
## 355 0.065 LI 6 Zebra G O
## 357 0.711 LI 13 Zebra G O
## 358 1.810 LI 10 Zebra W O
## 360 0.060 LI 4 Zebra W C
## 361 1.810 LI 8 Zebra W O
## 362 1.500 LI 6 Zebra B O
## Pred.LT.400m DistWood.m. HerdType DistWood
## 355 Present 101-300 Mixed Far
## 357 Present >300 Mixed Very Far
## 358 Absent 101-300 Single Far
## 360 Present 101-300 Single Far
## 361 Absent 101-300 Mixed Far
## 362 Present 31-100 Mixed Near
summary(reduced.data)
## DistPred PredSpecies GroupSize Species BushWoodGrass
## Min. :0.010 HY: 72 Min. : 1.0 Giraffe:25 B: 77
## 1st Qu.:0.284 LI:187 1st Qu.: 2.0 Grant :81 G: 73
## Median :0.779 Median : 6.0 Impala :30 W:109
## Mean :0.807 Mean : 10.6 Wildbst:52
## 3rd Qu.:1.270 3rd Qu.: 13.0 Zebra :71
## Max. :1.980 Max. :118.0
## HabOpen.Close Pred.LT.400m DistWood.m. HerdType DistWood
## C: 54 : 0 : 64 Mixed :142 Very Far: 17
## O:205 Absent : 75 >300 : 17 Single:117 Edge : 18
## Present:184 1-30 : 18 Far :102
## 101-300:102 Near : 58
## 31-100 : 58 NA's : 64
##
detach(kenya.herdsize)
attach(reduced.data)
Calculate standard summary statistics for herd size - the mean, sample size, standard deviation and standard error. First, use a general method to illustrate the tapply() function. tapply() allows you to apply a calculation to a variable or set of variables. Here, the calculation is to determine the mean, the standard deviation and the sample size. The syntax of the arguments is:
means <- tapply(GroupSize, interaction(Species, Pred.LT.400m, drop = TRUE), mean)
SDs <- tapply(GroupSize, interaction(Species, Pred.LT.400m, drop = TRUE), sd)
Ns <- tapply(GroupSize, interaction(Species, Pred.LT.400m, drop = TRUE), length)
#get standard error of the mean from std deviation and sample size
SEs <- SDs/sqrt(Ns)
#view the summary statistics
means
## Giraffe.Absent Grant.Absent Impala.Absent Wildbst.Absent
## 4.600 7.586 5.250 13.970
## Zebra.Absent Giraffe.Present Grant.Present Impala.Present
## 10.250 3.550 8.712 9.654
## Wildbst.Present Zebra.Present
## 4.474 16.776
SDs
## Giraffe.Absent Grant.Absent Impala.Absent Wildbst.Absent
## 2.408 6.039 4.787 19.066
## Zebra.Absent Giraffe.Present Grant.Present Impala.Present
## 1.708 2.328 8.118 7.244
## Wildbst.Present Zebra.Present
## 7.912 19.977
Ns
## Giraffe.Absent Grant.Absent Impala.Absent Wildbst.Absent
## 5 29 4 33
## Zebra.Absent Giraffe.Present Grant.Present Impala.Present
## 4 20 52 26
## Wildbst.Present Zebra.Present
## 19 67
SEs
## Giraffe.Absent Grant.Absent Impala.Absent Wildbst.Absent
## 1.0770 1.1214 2.3936 3.3190
## Zebra.Absent Giraffe.Present Grant.Present Impala.Present
## 0.8539 0.5205 1.1257 1.4207
## Wildbst.Present Zebra.Present
## 1.8151 2.4406
tapply() (and other versions of the apply function, see CH 4 of ABGR) can used for many purposes other than just getting summary statistics – it can be used to calculate any function, and allows calculations on subsets of the data in a way that does not take a lot of coding. However, to just get summary statistics, there are specific functions mean(), sd(), etc. Here is an example using mean(), selecting a subset of the data (Species == ‘Zebra’) with a logical statement in the index of the GroupSize vector (recall that logical statements in the index were previously discussed in R Exercise Two as an alternative to the subset() function).
mean.zebra <- mean(GroupSize[which(Species == 'Zebra')])
mean.zebra
## [1] 16.41
#you could do the above in two steps, which is easier to understand but less efficient
zebra.groups <- GroupSize[which(Species == 'Zebra')] #subset the data
mean.zebra <- mean(zebra.groups) #determine the mean
mean.zebra
## [1] 16.41
#same result as before
Graphs often allow better understanding of summary statistics than tables do. The code below produces a publication-quality plot of the data just tabulated above. First, use the ddply() function to calculate mean group size for each species with predators absent or present (also calculate sample size [N], standard deviation and standard error). ddply() is a more flexible alternative to tapply(), from the package plyr. It subsets a dataframe, applies a function (or multiple functions) to the data and returns the results in a dataframe.
# take the data frame 'reduced.data', subset it by species and predator presence, and summarize the data by calculating
# the mean (store in a variable means$GS), the sample size (store in means$N), the standard deviation (store in means$sd)
# and the standard error of the mean (store in means$se)
means <- ddply(reduced.data, c("Species","Pred.LT.400m"),summarise,
GS = mean(GroupSize),
N = length(GroupSize),
sd = sd(GroupSize),
se = sd(GroupSize) / sqrt(length(GroupSize)))
summary(means) #summarize the means table
## Species Pred.LT.400m GS N sd
## Giraffe:2 :0 Min. : 3.55 Min. : 4.0 Min. : 1.71
## Grant :2 Absent :5 1st Qu.: 4.76 1st Qu.: 8.5 1st Qu.: 3.00
## Impala :2 Present:5 Median : 8.15 Median :23.0 Median : 6.64
## Wildbst:2 Mean : 8.48 Mean :25.9 Mean : 7.96
## Zebra :2 3rd Qu.:10.10 3rd Qu.:32.0 3rd Qu.: 8.07
## Max. :16.78 Max. :67.0 Max. :19.98
## se
## Min. :0.521
## 1st Qu.:1.088
## Median :1.273
## Mean :1.609
## 3rd Qu.:2.249
## Max. :3.319
means #display the means table
## Species Pred.LT.400m GS N sd se
## 1 Giraffe Absent 4.600 5 2.408 1.0770
## 2 Giraffe Present 3.550 20 2.328 0.5205
## 3 Grant Absent 7.586 29 6.039 1.1214
## 4 Grant Present 8.712 52 8.118 1.1257
## 5 Impala Absent 5.250 4 4.787 2.3936
## 6 Impala Present 9.654 26 7.244 1.4207
## 7 Wildbst Absent 13.970 33 19.066 3.3190
## 8 Wildbst Present 4.474 19 7.912 1.8151
## 9 Zebra Absent 10.250 4 1.708 0.8539
## 10 Zebra Present 16.776 67 19.977 2.4406
Then make a plot with these summary statistics, using the ggplot() function of the ggplot2 package.
# plot of mean group size with error bars, observations > 2k discarded
# and species - specific threshold distances for predator presence
p = ggplot(means, aes(x= Species, y=GS, colour=Pred.LT.400m)) +
geom_point(position=position_dodge(.2)) +
geom_errorbar(aes(ymin=GS-se, ymax=GS+se), width=.2, position=position_dodge(.2)) +
theme_bw() +
scale_x_discrete(name = "", breaks=c("Giraffe", "Grant", "Impala", "Wildbst", "Zebra"),
labels=c("Giraffe", "Gazelle", "Impala", "Wildebeest", "Zebra")) +
scale_y_continuous(name="Herd Size") +
scale_colour_discrete(name="Predator is:") +
theme(legend.justification=c(0.2,0.92), legend.position=c(0.20,0.91))
p
## ymax not defined: adjusting position using y instead
Simple Regression Example
You can fit a simple regression using least-squares (as CH.2 discusses) or by maximum likelihood (which is not unique to the model-selection approach as CH. 2 tends to imply).
Simple regression using ordinary least squares (OLS) regression
The lm() function fits a linear model by least squares regression.
*
the formula argument specifies that y is a function of x with the code y
~ x * the data argument tells lm() where to find the variables in the
formula statement * the na.action = na.omit statement tells lm() to omit
cases with NA, the code for missing data.
mod.ols <- lm(formula = GroupSize ~ DistPred, data = reduced.data, na.action = na.omit)
Equivalently but using shorter code:
mod.ols <- lm(GroupSize ~ DistPred, reduced.data, na.action = na.omit)
Examine the OLS simple regression results in two ways:
summary(mod.ols)
##
## Call:
## lm(formula = GroupSize ~ DistPred, data = reduced.data, na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -10.50 -8.23 -4.08 2.19 107.27
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.004 1.561 6.41 7e-10 ***
## DistPred 0.761 1.606 0.47 0.64
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 14 on 257 degrees of freedom
## Multiple R-squared: 0.000872, Adjusted R-squared: -0.00302
## F-statistic: 0.224 on 1 and 257 DF, p-value: 0.636
anova(mod.ols)
## Analysis of Variance Table
##
## Response: GroupSize
## Df Sum Sq Mean Sq F value Pr(>F)
## DistPred 1 44 44 0.22 0.64
## Residuals 257 50357 196
Overall, there is very little association between herd size and distance to the closest predators, just using simple regression to examine this single predictor, and pooling data across several prey species. The output shows that the regression coefficient (0.761 = slope of the line relating GroupSize to DistPred) is small relative to its standard error (1.606), so that the test of whether or not this slope differs from zero gives a small t-statistic (0.47), that has a high probability of occurring by chance (0.64 or 64%).
In addition, the coefficient of determination (R-squared) is small (almost zero: R-squared can range from zero to one, where 1 means that 100% of the variation in GroupSize is explained by DistPred, and 0 means that there is no relationship between the two variables.
You can examine whether the data meet the assumptions of OLS regression graphically, using the plot() function. plot() works in a unique manner when applied to an object that holds output from the lm() function:
par(mfrow= c(2,2)) # this sets the plot display window to have 2 rows and 2 columns for 4 plots
plot(mod.ols)
par(mfrow = c(1,1)) # reset plot window to single pane
The left two panels show the residuals (top) and square-root of the residuals (bottom) as functions of the predicted group size. These are known as ‘scale-location’ plots, which relate a measure of scale (variance) to a measure of location (mean). You’d like to see no slope in these plots, with no obvious patterns, and no major outliers. Outliers are noted by their row numbers.
The top right plot is a quantile-quantile or ‘Q-Q’ plot that essentially plots the observed residuals against the expected values. In a Q-Q plot, the points will fall on the dashed y = x line for a model that fits the data well. Obvious patterns of deviation from the line indicate a problem with the model’s fit to the data… as can be seen here, for reasons addressed below.
The bottom right plot relates residuals to ‘leverage’. Leverage is a measure of how much a single point influences the slope of the regression line. In this plot, you’d like to see no overall slope, and few points with both high leverage and large residuals, which would indicates that a few unusual observations have a large influence on the slope of the regression line.
This regression model nicely illustrates the value of visually inspecting data summaries. From the bar plot above, you already know that the effect of predator presence differs among prey species… a regression fit to data for all of the species combined masks those effects. You’d get a very different result if you examined each species separately.
Run the same model, just for wildebeest:
mod.ols.wildebeest <- lm(formula = GroupSize ~ DistPred, data = subset(reduced.data, Species == 'Wildbst', select = c(GroupSize, DistPred)), na.action = na.omit)
summary(mod.ols.wildebeest)
##
## Call:
## lm(formula = GroupSize ~ DistPred, data = subset(reduced.data,
## Species == "Wildbst", select = c(GroupSize, DistPred)), na.action = na.omit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.74 -8.26 -4.79 1.10 78.89
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.73 4.17 1.13 0.26
## DistPred 7.11 4.32 1.64 0.11
##
## Residual standard error: 16.2 on 50 degrees of freedom
## Multiple R-squared: 0.0513, Adjusted R-squared: 0.0324
## F-statistic: 2.71 on 1 and 50 DF, p-value: 0.106
A stronger relationship is now apparent. The regression coefficient (slope) is about 10 times larger (7.11) than before (the ‘effect size’ is about 10X greater) and the regression coefficient is large relative to its standard error, so that the t-statistic is much larger (1.645) and the P value is consequently smaller (0.106), though still not “significant” if one adhered strictly to a convention that P < 0.05 identifies a statistically significant result, and R-squared is still not very large.
GLM: Simple regression using maximum likelihood to fit a generalised linear model
The Q-Q plot for the OLS regression indicated a problem with the fit of the model to the data. OLS regression assumes that the dependent variable has a normal distribution, but group size is a classic example of a ‘count variable’ that can only have integer values (you’ll never see 2.71 wildebeest) and group sizes are typically pretty small. This sort of variable usually has a Poisson distribution, instead of a normal distribution.
Generalized Linear Models (GLMs) are more flexible than OLS regression models, because:
a GLM lets you specify one of several distributions for the dependent variable
a GLM lets you specific one of several ‘link’ functions between the dependent variable (y) and the independent variable(s) (x).
For each distribution, there is a default link function that is usually most logical. You do not need to worry about changing the link function for now.
The flexibility of GLMs will often improve the fit of a GLM relative to an OLS regression, yielding better inferences. R Exercise 3B and 3C will provide more detail on GLMs, which are fit in R with the glm() function, which works much like the lm() function.
mod.glm.wildebeest <- glm(formula = GroupSize ~ DistPred, family = poisson(link = "log"), data = subset(reduced.data, Species == 'Wildbst', select = c(GroupSize, DistPred)), na.action = na.omit)
summary(mod.glm.wildebeest)
##
## Call:
## glm(formula = GroupSize ~ DistPred, family = poisson(link = "log"),
## data = subset(reduced.data, Species == "Wildbst", select = c(GroupSize,
## DistPred)), na.action = na.omit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.836 -3.306 -2.637 0.391 15.109
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.7624 0.0905 19.48 < 2e-16 ***
## DistPred 0.6526 0.0801 8.14 3.8e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 928.28 on 51 degrees of freedom
## Residual deviance: 862.49 on 50 degrees of freedom
## AIC: 1033
##
## Number of Fisher Scoring iterations: 6
With this more appropriate model, we detect a very strong relationship between wildebeest group sizes and distance to the nearest predator. The slope (regression coefficient = 0.65) is large in comparison to its standard error (0.08), so the z-statistic (8.144) testing whether this slope differs from zero has a P-value close to zero (3.84x10-16) … there is almost no chance that we’d see a relationship this strong in our sample, unless group sizes were actually related to predator proximity. Notice also that wildebeest actually form SMALLER groups when they are near predators (there is a positive slope between group size and distance to predators).
Multiple Regression Example
The point just made leads nicely into multiple regression. Simple regression predicts the dependent variable (y) using just one indepedent variable or predictor (x). The regression coefficient from the model is the slope of the fitted relationship. In multiple regression, we predict y on the basis of more than one x. For each predictor variable, we get a partial regression coefficient, which is the relationship between y and and that x-variable after controlling for the effects of the other x variables (predictors).
You can do this using the lm() function to fit an OLS regression, or with glm() to fit a generalized linear model. Given what we learned above, let’s fit a GLM with multiple predictors.
mod.multiple.glm.wildebeest <- glm(formula = GroupSize ~ DistPred + PredSpecies + HabOpen.Close, family = poisson(link = "log"), data = subset(reduced.data, Species == 'Wildbst', select = c(GroupSize, DistPred, PredSpecies, HabOpen.Close)), na.action = na.omit)
summary(mod.multiple.glm.wildebeest)
##
## Call:
## glm(formula = GroupSize ~ DistPred + PredSpecies + HabOpen.Close,
## family = poisson(link = "log"), data = subset(reduced.data,
## Species == "Wildbst", select = c(GroupSize, DistPred,
## PredSpecies, HabOpen.Close)), na.action = na.omit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.60 -3.45 -2.64 1.38 14.51
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.2973 0.1743 7.44 9.7e-14 ***
## DistPred 0.8232 0.0927 8.88 < 2e-16 ***
## PredSpeciesLI 0.3654 0.1102 3.31 0.00092 ***
## HabOpen.CloseO 0.0591 0.1174 0.50 0.61468
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 928.28 on 51 degrees of freedom
## Residual deviance: 849.96 on 48 degrees of freedom
## AIC: 1025
##
## Number of Fisher Scoring iterations: 6
Examining the model summary, we still see a strong effects of proximity to predators, and the magnitude of the effect (the value of the regression coefficient or slope, 0.82) has not changed a lot (0.65 in the simple regression). Seeing that the regression coefficient for a given predictor is failry stable from model to model increases confidence in the result. We also see that when the closest predators are lions (PredSpeces = LI) herd size is sginificantly larger than for hyenas. Last, we see that herds are a little larger, but not significantly so, in open habitats (HabOpen.Close = O for open, relative to C for closed).
Stepwise multiple regression
The model summary suggests that our multiple regression might be more complex than necessary, because the predictor for habitat type did little to predict wildebeest herd sizes. Backward stepwise regression using the drop1() function uses the deviance values to calculate likelihood ratio tests, or LRTs. We’ll go into deviance in more detail in R Exercise 3B, but for now just recall from CWP CH. 2 that a low deviance means that the model fits well.
drop1(mod.multiple.glm.wildebeest, test = "LRT")
## Single term deletions
##
## Model:
## GroupSize ~ DistPred + PredSpecies + HabOpen.Close
## Df Deviance AIC LRT Pr(>Chi)
## <none> 850 1025
## DistPred 1 926 1099 75.8 < 2e-16 ***
## PredSpecies 1 861 1034 11.1 0.00088 ***
## HabOpen.Close 1 850 1023 0.3 0.61255
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A LRT compares the deviances of two models. In the drop1()
function, we start with the multiple regression saved in
mod.multiple.glm.wildebeest, which had a residual deviance of 849.96.
This model is shown in the ‘Model’ statement at the top of the output,
and its deviance is listed in the top row of the table under
In this case, LRTs using drop1() suggest we should keep DistPred and PredSpecies and drop habitat from the final model.
mod.multiple.glm.wildebeest <- glm(formula = GroupSize ~ DistPred + PredSpecies, family = poisson(link = "log"), data = subset(reduced.data, Species == 'Wildbst', select = c(GroupSize, DistPred, PredSpecies, HabOpen.Close)), na.action = na.omit)
summary(mod.multiple.glm.wildebeest)
##
## Call:
## glm(formula = GroupSize ~ DistPred + PredSpecies, family = poisson(link = "log"),
## data = subset(reduced.data, Species == "Wildbst", select = c(GroupSize,
## DistPred, PredSpecies, HabOpen.Close)), na.action = na.omit)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -5.72 -3.43 -2.63 1.26 14.55
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 1.3395 0.1519 8.82 < 2e-16 ***
## DistPred 0.8229 0.0925 8.90 < 2e-16 ***
## PredSpeciesLI 0.3755 0.1081 3.47 0.00051 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 928.28 on 51 degrees of freedom
## Residual deviance: 850.22 on 49 degrees of freedom
## AIC: 1023
##
## Number of Fisher Scoring iterations: 6
From a hypothesis testing perspective, this model indicates that there is an association between wildebeest herd size and distance to predators, but not in the direction hypothesized at the outset of this exercise, and that the strength of grouping responses differs in reaction to different predators.
To show the results graphically, using ggplot() again:
#build the plot using the ggplot() convention of making the basic plot and then adding layers and modifications with +
p1 = ggplot(data = subset(reduced.data, Species == 'Wildbst', select = c(GroupSize, DistPred, PredSpecies)), aes(x= DistPred, y=GroupSize, colour=PredSpecies)) +
stat_smooth(method = 'lm') +
geom_point() +
scale_y_continuous(name="Herd Size") +
scale_x_continuous(name = "Distance to Predator (km)")
#display the plot
p1