Hypothesis Testing

An example using data on group sizes of ungulates in Kenya

Chapter 2 of CWP identifies three frameworks for testing hypotheses in conservation biology:

Classical statistics and null hypothesis testing.
Model selection based on information theory.
Bayesian statistics.

We’re not going into Bayesian statistics in this class (the Bayesian approach is very valuable but too complex to cover here). In my opinion, the first two approaches are not nearly as distinct as CH. 2 suggests, and many of the weaknesses of classical statistics described in CH.2 are not fundamental weaknesses but rather examples of weak application of the method or poor reporting of results. You cannot do ecological research without understanding and using both approaches, but keep in mind that they will often lead you to the same fundamental inferences.

We’ll use a data set on factors that affect herd sizes of ungulates in Kenya to illustrate these methods, and how they can be used to make ecological inferences. The variables in the dataset (after some processing shown below) are:

GroupSize - size of the ungulate herd

Species - species of the ungulate herd

DistPred - distance from the herd to the nearest predator in kilometers

Pred.LT.400m - distance to predator turned into a categorical variable, Present = less than threshold distance, Absent = greater than threshold distance. Used a species-specific threshold for each ungulate species (see below).

PredSpecies - species of the nearest predator, LI = lion, HY = spotted hyena

BushWoodGrass - vegetation type as a categorical variable with three levels (B =bush ,W = woodland, G = grassland)

HabOpen.Close - vegetation type as a simpler categorical variable with only two levels (O = open, c = closed)

DistWood.m - distance to nearest edge between vegetation types, in broad categories because it was visually estimated.

HerdType - single species or mixed species herd.

Hypothesis Testing

Often, you have a well-defined hypothesis and simply want to know if data support it or not. For example, you might hypothesize that ungulate herd sizes should increase when predators are near (because natural selection has favored behavior that reduces the risk of predation). Most simply you can test this hypothesis with a simple regression approach, examining the relationship of herd size (y) to the proximity of predators (x) and testing if the slope differs from zero. If you believe that other variables (like vegetation type, or distance to water) also are likely to affect herd size, you can use a multiple regression approach: in this case, you examine the relationship of herd size (y) to several predictor variables (x1 = predator proximity, x2 = vegetation type, x3 = distance to water, x4 = season, etc). The predictors can be a combination of continuous variables (like distance to water) and categorical variables (like season or vegetation type). The multiple regression approach:

estimates how much of the variation in herd size can be explained by all of these predictors combined (via the coefficient of determination, R-squared )
estimates the effect on herd size of each predictor, after taking into account the effects of the other predictors (via partial regression coefficients)
tests whether the regression coefficients differ from zero, to test whether the effect of that predictor is bigger than could logically be explained by random chance (via a probability or P-value from an F-test or t-test)

As part of a multiple regression approach, you can use stepwise regression to identify which predictor variables should be included in a model of herd sizes. This can be done by backward stepwise regression, which starts with a large set of possibly important predictors and eliminates the weakest predictors one by one until only strong predictors remain, or by forward stepwsie regression, which adds the strongest predictor first and keeps adding predictors until none are helping to explain much about herd size.

Examples of all of these follow.

First, load some packages that will be used below. (Make sure you have these packages installed before loading them, as discussed in the first R lab.)

library(ggplot2)

library(MASS)

library(msm)

library(sandwich)

library(ggthemes)

library(plyr)

library(digest)

Clean up the memory, read in the data, inspect it, do some simple data processing and produce some summary statistics.

You will first have to set the working directory to the location where you’ve saved the data file using setwd() or the Set Working Directory option under the Session tab in R Studio. The syntax for setwd() is setwd(“C:/Users/screel/Desktop/kenya_behavior_paper/Kenya_herd_size”). Notice the direction of the slashes.

I normally save data files to be read in the same folder as the R script that reads them, so that there is no need to change the working directory.

rm(list=ls(all=TRUE))

kenyaherdsize2 = read.table("kenyaherdsize2.txt", header = TRUE, sep = ",", fill= TRUE, dec = ".")





kenya.herdsize <- as.data.frame(kenyaherdsize2)

attach(kenya.herdsize)

You do not have to examine the code just below in detail for now. It uses if() statements within a for() loop to step through all the rows in the data frame and insert values in a categorical variable (Pred.LT.400m) for predator presence/absence, using species-specific distance thresholds (which came from an independent analysis not shown here). We will go through ‘control structures’ like for() loops and ‘conditional execution’ statements like if() statements later in the class. (There are other ways to do this, but I want to briefly introduce these functions in a fairly simple context.)

for (i in 1:length(DistPred)) {

     

     if ((kenya.herdsize$Species[i] == "Zebra") && (kenya.herdsize$DistPred[i] <= 1.5)) 

          kenya.herdsize$Pred.LT.400m[i] = "Present"

     if ((kenya.herdsize$Species[i] == "Zebra") && (kenya.herdsize$DistPred[i] > 1.51))           

          kenya.herdsize$Pred.LT.400m[i] = "Absent"

     

     if ((kenya.herdsize$Species[i] == "Wildbst") && (kenya.herdsize$DistPred[i] <= 0.5)) 

          kenya.herdsize$Pred.LT.400m[i] = "Present"

     if ((kenya.herdsize$Species[i] == "Wildbst") && (kenya.herdsize$DistPred[i] > 0.51)) 

          kenya.herdsize$Pred.LT.400m[i] = "Absent"

     

     

     if ((kenya.herdsize$Species[i] == "Impala") && (kenya.herdsize$DistPred[i] <= 1.7)) 

          kenya.herdsize$Pred.LT.400m[i] = "Present"

     if ((kenya.herdsize$Species[i] == "Impala") && (kenya.herdsize$DistPred[i] > 1.71)) 

          kenya.herdsize$Pred.LT.400m[i] = "Absent"

     

     if ((kenya.herdsize$Species[i] == "Grant") && (kenya.herdsize$DistPred[i] <= 1.0)) 

          kenya.herdsize$Pred.LT.400m[i] = "Present"

     if ((kenya.herdsize$Species[i] == "Grant") && (kenya.herdsize$DistPred[i] > 1.01)) 

          kenya.herdsize$Pred.LT.400m[i] = "Absent"

     

     if ((kenya.herdsize$Species[i] == "Giraffe") && (kenya.herdsize$DistPred[i] <= 1.0)) 

          kenya.herdsize$Pred.LT.400m[i] = "Present"

     if ((kenya.herdsize$Species[i] == "Giraffe") && (kenya.herdsize$DistPred[i] > 1.01)) 

          kenya.herdsize$Pred.LT.400m[i] = "Absent"

     

}

Delete observations more than 2 km from predators because graphical analysis (not shown) suggests there is a lot of statistical ‘noise’ and some sign of increased antipredator response for distances beyond this point (both signatures of misclassification of predator absence when distances become too large).

reduced.data <-subset(kenya.herdsize, DistPred < 2.0, select = c(DistPred, PredSpecies, GroupSize, Species,BushWoodGrass, HabOpen.Close, Pred.LT.400m, DistWood.m., HerdType))



#convert the distance to woodland edge, in meters (DistWood.m) into a categorical factor



reduced.data$DistWood <- factor(reduced.data$DistWood.m.)

levels(reduced.data$DistWood)

## [1] ""        ">300"    "1-30"    "101-300" "31-100"

summary(reduced.data$DistWood)

##            >300    1-30 101-300  31-100 

##      64      17      18     102      58

# Rename levels of the factor, one name at a time, and remove some levels that are not useful

# (there is a more concise way to do this all in one line, but I am just using copy/edit here to keep it clear)



levels(reduced.data$DistWood)[levels(reduced.data$DistWood)=="1-30"] <- "Edge"

levels(reduced.data$DistWood)[levels(reduced.data$DistWood)=="31-100"] <- "Near"

levels(reduced.data$DistWood)[levels(reduced.data$DistWood)=="101-300"] <- "Far"

levels(reduced.data$DistWood)[levels(reduced.data$DistWood)==">300"] <- "Very Far"

levels(reduced.data$DistWood)[levels(reduced.data$DistWood)=="0"] <- NA

levels(reduced.data$DistWood)[levels(reduced.data$DistWood)== "50"] <- NA

levels(reduced.data$DistWood)[levels(reduced.data$DistWood)==""] <- NA





summary(reduced.data$DistWood)

## Very Far     Edge      Far     Near     NA's 

##       17       18      102       58       64

Inspect the processed data and use it from this point on.

head(reduced.data)

##   DistPred PredSpecies GroupSize Species BushWoodGrass HabOpen.Close

## 1    0.883          LI         5 Giraffe             B             O

## 2    0.883          LI         2 Giraffe             B             O

## 3    0.050          LI         8 Giraffe             B             C

## 5    1.300          LI         6 Giraffe             W             C

## 7    0.175          LI         4 Giraffe             G             O

## 8    0.100          LI         4 Giraffe             W             O

##   Pred.LT.400m DistWood.m. HerdType DistWood

## 1      Present                Mixed     <NA>

## 2      Present                Mixed     <NA>

## 3      Present      31-100   Single     Near

## 5       Absent      31-100   Single     Near

## 7      Present        >300    Mixed Very Far

## 8      Present     101-300   Single      Far

tail(reduced.data)

##     DistPred PredSpecies GroupSize Species BushWoodGrass HabOpen.Close

## 355    0.065          LI         6   Zebra             G             O

## 357    0.711          LI        13   Zebra             G             O

## 358    1.810          LI        10   Zebra             W             O

## 360    0.060          LI         4   Zebra             W             C

## 361    1.810          LI         8   Zebra             W             O

## 362    1.500          LI         6   Zebra             B             O

##     Pred.LT.400m DistWood.m. HerdType DistWood

## 355      Present     101-300    Mixed      Far

## 357      Present        >300    Mixed Very Far

## 358       Absent     101-300   Single      Far

## 360      Present     101-300   Single      Far

## 361       Absent     101-300    Mixed      Far

## 362      Present      31-100    Mixed     Near

summary(reduced.data)

##     DistPred     PredSpecies   GroupSize        Species   BushWoodGrass

##  Min.   :0.010   HY: 72      Min.   :  1.0   Giraffe:25   B: 77        

##  1st Qu.:0.284   LI:187      1st Qu.:  2.0   Grant  :81   G: 73        

##  Median :0.779               Median :  6.0   Impala :30   W:109        

##  Mean   :0.807               Mean   : 10.6   Wildbst:52                

##  3rd Qu.:1.270               3rd Qu.: 13.0   Zebra  :71                

##  Max.   :1.980               Max.   :118.0                             

##  HabOpen.Close  Pred.LT.400m  DistWood.m.    HerdType       DistWood  

##  C: 54                :  0          : 64   Mixed :142   Very Far: 17  

##  O:205         Absent : 75   >300   : 17   Single:117   Edge    : 18  

##                Present:184   1-30   : 18                Far     :102  

##                              101-300:102                Near    : 58  

##                              31-100 : 58                NA's    : 64  

##

detach(kenya.herdsize)

attach(reduced.data)

Calculate standard summary statistics for herd size - the mean, sample size, standard deviation and standard error. First, use a general method to illustrate the tapply() function. tapply() allows you to apply a calculation to a variable or set of variables. Here, the calculation is to determine the mean, the standard deviation and the sample size. The syntax of the arguments is:

GroupSize is the variable for which we’re going the calculation (e.g. determine the mean)
interaction(Species, Pred.LT.400m) specifies the levels for the calculation. That is, it identifies the subsets of the data that we want a mean for. Here, we’re doing the calculation for every species separately, and for predators present and predators absent within each species.
drop = TRUE specifies that we’re not interested in calculations with only one level.

means <- tapply(GroupSize, interaction(Species, Pred.LT.400m, drop = TRUE), mean)

SDs <- tapply(GroupSize, interaction(Species, Pred.LT.400m, drop = TRUE), sd)

Ns <- tapply(GroupSize, interaction(Species, Pred.LT.400m, drop = TRUE), length)



#get standard error of the mean from std deviation and sample size

SEs <- SDs/sqrt(Ns)



#view the summary statistics

means

##  Giraffe.Absent    Grant.Absent   Impala.Absent  Wildbst.Absent 

##           4.600           7.586           5.250          13.970 

##    Zebra.Absent Giraffe.Present   Grant.Present  Impala.Present 

##          10.250           3.550           8.712           9.654 

## Wildbst.Present   Zebra.Present 

##           4.474          16.776

SDs

##  Giraffe.Absent    Grant.Absent   Impala.Absent  Wildbst.Absent 

##           2.408           6.039           4.787          19.066 

##    Zebra.Absent Giraffe.Present   Grant.Present  Impala.Present 

##           1.708           2.328           8.118           7.244 

## Wildbst.Present   Zebra.Present 

##           7.912          19.977

Ns

##  Giraffe.Absent    Grant.Absent   Impala.Absent  Wildbst.Absent 

##               5              29               4              33 

##    Zebra.Absent Giraffe.Present   Grant.Present  Impala.Present 

##               4              20              52              26 

## Wildbst.Present   Zebra.Present 

##              19              67

SEs

##  Giraffe.Absent    Grant.Absent   Impala.Absent  Wildbst.Absent 

##          1.0770          1.1214          2.3936          3.3190 

##    Zebra.Absent Giraffe.Present   Grant.Present  Impala.Present 

##          0.8539          0.5205          1.1257          1.4207 

## Wildbst.Present   Zebra.Present 

##          1.8151          2.4406

tapply() (and other versions of the apply function, see CH 4 of ABGR) can used for many purposes other than just getting summary statistics – it can be used to calculate any function, and allows calculations on subsets of the data in a way that does not take a lot of coding. However, to just get summary statistics, there are specific functions mean(), sd(), etc. Here is an example using mean(), selecting a subset of the data (Species == ‘Zebra’) with a logical statement in the index of the GroupSize vector (recall that logical statements in the index were previously discussed in R Exercise Two as an alternative to the subset() function).

mean.zebra <- mean(GroupSize[which(Species == 'Zebra')])

mean.zebra

## [1] 16.41

#you could do the above in two steps, which is easier to understand but less efficient



zebra.groups <- GroupSize[which(Species == 'Zebra')]        #subset the data

mean.zebra <- mean(zebra.groups)                            #determine the mean



mean.zebra

## [1] 16.41

#same result as before

Graphs often allow better understanding of summary statistics than tables do. The code below produces a publication-quality plot of the data just tabulated above. First, use the ddply() function to calculate mean group size for each species with predators absent or present (also calculate sample size [N], standard deviation and standard error). ddply() is a more flexible alternative to tapply(), from the package plyr. It subsets a dataframe, applies a function (or multiple functions) to the data and returns the results in a dataframe.

# take the data frame 'reduced.data',  subset it by species and predator presence, and summarize the data by calculating

# the mean (store in a variable means$GS), the sample size (store in means$N), the standard deviation (store in means$sd)

# and the standard error of the mean (store in means$se)



means <- ddply(reduced.data, c("Species","Pred.LT.400m"),summarise, 

               GS = mean(GroupSize),

              N    = length(GroupSize),

              sd   = sd(GroupSize),

            se   = sd(GroupSize) / sqrt(length(GroupSize)))



summary(means)                #summarize the means table

##     Species   Pred.LT.400m       GS              N              sd       

##  Giraffe:2          :0     Min.   : 3.55   Min.   : 4.0   Min.   : 1.71  

##  Grant  :2   Absent :5     1st Qu.: 4.76   1st Qu.: 8.5   1st Qu.: 3.00  

##  Impala :2   Present:5     Median : 8.15   Median :23.0   Median : 6.64  

##  Wildbst:2                 Mean   : 8.48   Mean   :25.9   Mean   : 7.96  

##  Zebra  :2                 3rd Qu.:10.10   3rd Qu.:32.0   3rd Qu.: 8.07  

##                            Max.   :16.78   Max.   :67.0   Max.   :19.98  

##        se       

##  Min.   :0.521  

##  1st Qu.:1.088  

##  Median :1.273  

##  Mean   :1.609  

##  3rd Qu.:2.249  

##  Max.   :3.319

means                         #display the means table

##    Species Pred.LT.400m     GS  N     sd     se

## 1  Giraffe       Absent  4.600  5  2.408 1.0770

## 2  Giraffe      Present  3.550 20  2.328 0.5205

## 3    Grant       Absent  7.586 29  6.039 1.1214

## 4    Grant      Present  8.712 52  8.118 1.1257

## 5   Impala       Absent  5.250  4  4.787 2.3936

## 6   Impala      Present  9.654 26  7.244 1.4207

## 7  Wildbst       Absent 13.970 33 19.066 3.3190

## 8  Wildbst      Present  4.474 19  7.912 1.8151

## 9    Zebra       Absent 10.250  4  1.708 0.8539

## 10   Zebra      Present 16.776 67 19.977 2.4406

Then make a plot with these summary statistics, using the ggplot() function of the ggplot2 package.

# plot of mean group size with error bars, observations > 2k  discarded 

# and species - specific threshold distances for predator presence





p = ggplot(means, aes(x= Species, y=GS, colour=Pred.LT.400m)) +                  

  geom_point(position=position_dodge(.2)) +

  geom_errorbar(aes(ymin=GS-se, ymax=GS+se), width=.2, position=position_dodge(.2)) +                                  

  theme_bw() +                                                                

  scale_x_discrete(name = "", breaks=c("Giraffe", "Grant", "Impala", "Wildbst", "Zebra"), 

                         labels=c("Giraffe", "Gazelle", "Impala", "Wildebeest", "Zebra")) +

  scale_y_continuous(name="Herd Size") +

  scale_colour_discrete(name="Predator is:") +

  theme(legend.justification=c(0.2,0.92), legend.position=c(0.20,0.91))    



p

## ymax not defined: adjusting position using y instead

plot of chunk unnamed-chunk-9

Simple Regression Example

You can fit a simple regression using least-squares (as CH.2 discusses) or by maximum likelihood (which is not unique to the model-selection approach as CH. 2 tends to imply).

Simple regression using ordinary least squares (OLS) regression

The lm() function fits a linear model by least squares regression.
* the formula argument specifies that y is a function of x with the code y ~ x * the data argument tells lm() where to find the variables in the formula statement * the na.action = na.omit statement tells lm() to omit cases with NA, the code for missing data.

mod.ols <- lm(formula = GroupSize ~ DistPred, data = reduced.data, na.action = na.omit)

Equivalently but using shorter code:

mod.ols <- lm(GroupSize ~ DistPred, reduced.data, na.action = na.omit)

Examine the OLS simple regression results in two ways:

summary(mod.ols)

## 

## Call:

## lm(formula = GroupSize ~ DistPred, data = reduced.data, na.action = na.omit)

## 

## Residuals:

##    Min     1Q Median     3Q    Max 

## -10.50  -8.23  -4.08   2.19 107.27 

## 

## Coefficients:

##             Estimate Std. Error t value Pr(>|t|)    

## (Intercept)   10.004      1.561    6.41    7e-10 ***

## DistPred       0.761      1.606    0.47     0.64    

## ---

## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 

## Residual standard error: 14 on 257 degrees of freedom

## Multiple R-squared:  0.000872,   Adjusted R-squared:  -0.00302 

## F-statistic: 0.224 on 1 and 257 DF,  p-value: 0.636

anova(mod.ols)

## Analysis of Variance Table

## 

## Response: GroupSize

##            Df Sum Sq Mean Sq F value Pr(>F)

## DistPred    1     44      44    0.22   0.64

## Residuals 257  50357     196

Overall, there is very little association between herd size and distance to the closest predators, just using simple regression to examine this single predictor, and pooling data across several prey species. The output shows that the regression coefficient (0.761 = slope of the line relating GroupSize to DistPred) is small relative to its standard error (1.606), so that the test of whether or not this slope differs from zero gives a small t-statistic (0.47), that has a high probability of occurring by chance (0.64 or 64%).

In addition, the coefficient of determination (R-squared) is small (almost zero: R-squared can range from zero to one, where 1 means that 100% of the variation in GroupSize is explained by DistPred, and 0 means that there is no relationship between the two variables.

You can examine whether the data meet the assumptions of OLS regression graphically, using the plot() function. plot() works in a unique manner when applied to an object that holds output from the lm() function:

par(mfrow= c(2,2))  # this sets the plot display window to have 2 rows and 2 columns for 4 plots

plot(mod.ols)

plot of chunk unnamed-chunk-13

par(mfrow = c(1,1)) # reset plot window to single pane

The left two panels show the residuals (top) and square-root of the residuals (bottom) as functions of the predicted group size. These are known as ‘scale-location’ plots, which relate a measure of scale (variance) to a measure of location (mean). You’d like to see no slope in these plots, with no obvious patterns, and no major outliers. Outliers are noted by their row numbers.

The top right plot is a quantile-quantile or ‘Q-Q’ plot that essentially plots the observed residuals against the expected values. In a Q-Q plot, the points will fall on the dashed y = x line for a model that fits the data well. Obvious patterns of deviation from the line indicate a problem with the model’s fit to the data… as can be seen here, for reasons addressed below.

The bottom right plot relates residuals to ‘leverage’. Leverage is a measure of how much a single point influences the slope of the regression line. In this plot, you’d like to see no overall slope, and few points with both high leverage and large residuals, which would indicates that a few unusual observations have a large influence on the slope of the regression line.

This regression model nicely illustrates the value of visually inspecting data summaries. From the bar plot above, you already know that the effect of predator presence differs among prey species… a regression fit to data for all of the species combined masks those effects. You’d get a very different result if you examined each species separately.

Run the same model, just for wildebeest:

mod.ols.wildebeest <- lm(formula = GroupSize ~ DistPred, data = subset(reduced.data, Species == 'Wildbst', select = c(GroupSize, DistPred)), na.action = na.omit)





summary(mod.ols.wildebeest)

## 

## Call:

## lm(formula = GroupSize ~ DistPred, data = subset(reduced.data, 

##     Species == "Wildbst", select = c(GroupSize, DistPred)), na.action = na.omit)

## 

## Residuals:

##    Min     1Q Median     3Q    Max 

## -17.74  -8.26  -4.79   1.10  78.89 

## 

## Coefficients:

##             Estimate Std. Error t value Pr(>|t|)

## (Intercept)     4.73       4.17    1.13     0.26

## DistPred        7.11       4.32    1.64     0.11

## 

## Residual standard error: 16.2 on 50 degrees of freedom

## Multiple R-squared:  0.0513, Adjusted R-squared:  0.0324 

## F-statistic: 2.71 on 1 and 50 DF,  p-value: 0.106

A stronger relationship is now apparent. The regression coefficient (slope) is about 10 times larger (7.11) than before (the ‘effect size’ is about 10X greater) and the regression coefficient is large relative to its standard error, so that the t-statistic is much larger (1.645) and the P value is consequently smaller (0.106), though still not “significant” if one adhered strictly to a convention that P < 0.05 identifies a statistically significant result, and R-squared is still not very large.

GLM: Simple regression using maximum likelihood to fit a generalised linear model

The Q-Q plot for the OLS regression indicated a problem with the fit of the model to the data. OLS regression assumes that the dependent variable has a normal distribution, but group size is a classic example of a ‘count variable’ that can only have integer values (you’ll never see 2.71 wildebeest) and group sizes are typically pretty small. This sort of variable usually has a Poisson distribution, instead of a normal distribution.

Generalized Linear Models (GLMs) are more flexible than OLS regression models, because:

a GLM lets you specify one of several distributions for the dependent variable
a GLM lets you specific one of several ‘link’ functions between the dependent variable (y) and the independent variable(s) (x).

For each distribution, there is a default link function that is usually most logical. You do not need to worry about changing the link function for now.

The flexibility of GLMs will often improve the fit of a GLM relative to an OLS regression, yielding better inferences. R Exercise 3B and 3C will provide more detail on GLMs, which are fit in R with the glm() function, which works much like the lm() function.

mod.glm.wildebeest <- glm(formula = GroupSize ~ DistPred, family = poisson(link = "log"), data = subset(reduced.data, Species == 'Wildbst', select = c(GroupSize, DistPred)), na.action = na.omit)



summary(mod.glm.wildebeest)

## 

## Call:

## glm(formula = GroupSize ~ DistPred, family = poisson(link = "log"), 

##     data = subset(reduced.data, Species == "Wildbst", select = c(GroupSize, 

##         DistPred)), na.action = na.omit)

## 

## Deviance Residuals: 

##    Min      1Q  Median      3Q     Max  

## -5.836  -3.306  -2.637   0.391  15.109  

## 

## Coefficients:

##             Estimate Std. Error z value Pr(>|z|)    

## (Intercept)   1.7624     0.0905   19.48  < 2e-16 ***

## DistPred      0.6526     0.0801    8.14  3.8e-16 ***

## ---

## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 

## (Dispersion parameter for poisson family taken to be 1)

## 

##     Null deviance: 928.28  on 51  degrees of freedom

## Residual deviance: 862.49  on 50  degrees of freedom

## AIC: 1033

## 

## Number of Fisher Scoring iterations: 6

With this more appropriate model, we detect a very strong relationship between wildebeest group sizes and distance to the nearest predator. The slope (regression coefficient = 0.65) is large in comparison to its standard error (0.08), so the z-statistic (8.144) testing whether this slope differs from zero has a P-value close to zero (3.84x10-16) … there is almost no chance that we’d see a relationship this strong in our sample, unless group sizes were actually related to predator proximity. Notice also that wildebeest actually form SMALLER groups when they are near predators (there is a positive slope between group size and distance to predators).

Multiple Regression Example

The point just made leads nicely into multiple regression. Simple regression predicts the dependent variable (y) using just one indepedent variable or predictor (x). The regression coefficient from the model is the slope of the fitted relationship. In multiple regression, we predict y on the basis of more than one x. For each predictor variable, we get a partial regression coefficient, which is the relationship between y and and that x-variable after controlling for the effects of the other x variables (predictors).

You can do this using the lm() function to fit an OLS regression, or with glm() to fit a generalized linear model. Given what we learned above, let’s fit a GLM with multiple predictors.

mod.multiple.glm.wildebeest <- glm(formula = GroupSize ~ DistPred + PredSpecies + HabOpen.Close, family = poisson(link = "log"), data = subset(reduced.data, Species == 'Wildbst', select = c(GroupSize, DistPred, PredSpecies, HabOpen.Close)), na.action = na.omit)



summary(mod.multiple.glm.wildebeest)

## 

## Call:

## glm(formula = GroupSize ~ DistPred + PredSpecies + HabOpen.Close, 

##     family = poisson(link = "log"), data = subset(reduced.data, 

##         Species == "Wildbst", select = c(GroupSize, DistPred, 

##             PredSpecies, HabOpen.Close)), na.action = na.omit)

## 

## Deviance Residuals: 

##    Min      1Q  Median      3Q     Max  

##  -5.60   -3.45   -2.64    1.38   14.51  

## 

## Coefficients:

##                Estimate Std. Error z value Pr(>|z|)    

## (Intercept)      1.2973     0.1743    7.44  9.7e-14 ***

## DistPred         0.8232     0.0927    8.88  < 2e-16 ***

## PredSpeciesLI    0.3654     0.1102    3.31  0.00092 ***

## HabOpen.CloseO   0.0591     0.1174    0.50  0.61468    

## ---

## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 

## (Dispersion parameter for poisson family taken to be 1)

## 

##     Null deviance: 928.28  on 51  degrees of freedom

## Residual deviance: 849.96  on 48  degrees of freedom

## AIC: 1025

## 

## Number of Fisher Scoring iterations: 6

Examining the model summary, we still see a strong effects of proximity to predators, and the magnitude of the effect (the value of the regression coefficient or slope, 0.82) has not changed a lot (0.65 in the simple regression). Seeing that the regression coefficient for a given predictor is failry stable from model to model increases confidence in the result. We also see that when the closest predators are lions (PredSpeces = LI) herd size is sginificantly larger than for hyenas. Last, we see that herds are a little larger, but not significantly so, in open habitats (HabOpen.Close = O for open, relative to C for closed).

Stepwise multiple regression

The model summary suggests that our multiple regression might be more complex than necessary, because the predictor for habitat type did little to predict wildebeest herd sizes. Backward stepwise regression using the drop1() function uses the deviance values to calculate likelihood ratio tests, or LRTs. We’ll go into deviance in more detail in R Exercise 3B, but for now just recall from CWP CH. 2 that a low deviance means that the model fits well.

drop1(mod.multiple.glm.wildebeest,  test = "LRT")

## Single term deletions

## 

## Model:

## GroupSize ~ DistPred + PredSpecies + HabOpen.Close

##               Df Deviance  AIC  LRT Pr(>Chi)    

## <none>                850 1025                  

## DistPred       1      926 1099 75.8  < 2e-16 ***

## PredSpecies    1      861 1034 11.1  0.00088 ***

## HabOpen.Close  1      850 1023  0.3  0.61255    

## ---

## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A LRT compares the deviances of two models. In the drop1() function, we start with the multiple regression saved in mod.multiple.glm.wildebeest, which had a residual deviance of 849.96. This model is shown in the ‘Model’ statement at the top of the output, and its deviance is listed in the top row of the table under , meaning that no predictors have been dropped. In each row below that, you see a summary of the result of dropping each single predictor from the full model. You can examine the change in deviance or the change in AIC score. In general, a model with AIC two units larger than another model is appreciably worse, and a model with AIC 8 units larger is clearly worse. The likelihood ratio test (LRT) formalizes the comparison of each model (each with a single dropped predictor) to the full model. A big LRT value (and thus a small, ‘significant’ P value, listed as “Pr(>Chi)”) indicates that dropping that predictor makes the model significantly worse. A non-signigficant LRT indicates that the predictor should be dropped.

In this case, LRTs using drop1() suggest we should keep DistPred and PredSpecies and drop habitat from the final model.

mod.multiple.glm.wildebeest <- glm(formula = GroupSize ~ DistPred + PredSpecies, family = poisson(link = "log"), data = subset(reduced.data, Species == 'Wildbst', select = c(GroupSize, DistPred, PredSpecies, HabOpen.Close)), na.action = na.omit)



summary(mod.multiple.glm.wildebeest)

## 

## Call:

## glm(formula = GroupSize ~ DistPred + PredSpecies, family = poisson(link = "log"), 

##     data = subset(reduced.data, Species == "Wildbst", select = c(GroupSize, 

##         DistPred, PredSpecies, HabOpen.Close)), na.action = na.omit)

## 

## Deviance Residuals: 

##    Min      1Q  Median      3Q     Max  

##  -5.72   -3.43   -2.63    1.26   14.55  

## 

## Coefficients:

##               Estimate Std. Error z value Pr(>|z|)    

## (Intercept)     1.3395     0.1519    8.82  < 2e-16 ***

## DistPred        0.8229     0.0925    8.90  < 2e-16 ***

## PredSpeciesLI   0.3755     0.1081    3.47  0.00051 ***

## ---

## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

## 

## (Dispersion parameter for poisson family taken to be 1)

## 

##     Null deviance: 928.28  on 51  degrees of freedom

## Residual deviance: 850.22  on 49  degrees of freedom

## AIC: 1023

## 

## Number of Fisher Scoring iterations: 6

From a hypothesis testing perspective, this model indicates that there is an association between wildebeest herd size and distance to predators, but not in the direction hypothesized at the outset of this exercise, and that the strength of grouping responses differs in reaction to different predators.

To show the results graphically, using ggplot() again:

data = subset(reduced.data, Species == ‘Wildbst’, select = c(GroupSize, DistPred, PredSpecies)) selects the variables GroupSize, DistPred and PredSpecies for the subset of the data only from wildebeest,
aes(x= DistPred, y=GroupSize, colour=PredSpecies) selects what ggplot() calls the aesthetics - the DistPred is the x, GroupSize is the y, with the data split by PredSpecies and coded by color.
stat_smooth(method = ‘lm’) fits a statistical smoother to the plot, and in this case, method = ‘lm’ specifies a linear model. by default, ggplot() will show the fitted model and its 95% confidence limits.
geom_point() shows the points in addition to the fitted linear model

#build the plot using the ggplot() convention of making the basic plot and then adding layers and modifications with +

p1 = ggplot(data = subset(reduced.data, Species == 'Wildbst', select = c(GroupSize, DistPred, PredSpecies)), aes(x= DistPred, y=GroupSize, colour=PredSpecies)) + 

          stat_smooth(method = 'lm') +

          geom_point() +

          scale_y_continuous(name="Herd Size") +

          scale_x_continuous(name = "Distance to Predator (km)") 



#display the plot

p1

plot of chunk unnamed-chunk-19