Ecology and conservation biology are highly quantitative fields of science, and to be a good ecologist you must be skilled at working with data, conducting statistical analyses and presenting the results in an appropriate manner. The good news is that statistical methods and the associated software to do this are well-developed. The bad news is … the statistical methods and associated software are well developed, so the learning curve is steep.
Most ecologists now use the free software package R for data manipulation, statistical analysis, graphics and simulation modelling. R has several advantages:
1.It is free. Until R emerged, it was common to spend thousands of dollars on statistical software,
2.It is platform-independent, meaning that it will run in the same manner on a Mac, a PC running windows, or a PC running linux. This makes it simple to share files for analysis with colleagues (and students)… if you email a file to another person, they can run it and replicate EXACTLY what you did.
3.It is not just a statistical/graphics package, it is programming language, based on the commercial language S, which is where it got its name. Consequently, there is very little that you cannot do with R once you know the language. For example, in this course you’ll use R to manipulate and plot data, conduct statistical tests, implement highly specific models developed for ecologists (such as mark-recapture models) and construct and run fairly complex stochastic simulations of population dynamics.
4.It is code-driven rather than menu-driven. Many of you will be familiar with using menu-clicks to work with data (either to produce graphs or run statistical tests), for example in MS Excel. There is nothing wrong with this, but it can be very difficult to replicate once an analysis has many steps. Once a complicated, menu-driven analysis has been completed, it can be virtually impossible to replicate without detailed notes, even for the person who did it. IN R, you write lines of code (a “script”) to open a data file, process it, conduct analysis and make plots. Once the script is saved, it is a permanent record of EXACTLY how the analysis was done. In practice, this is very useful because it allows (or perhaps forces!) you to:
5.Because of points 1 - 4, there is a lot of pre-existing R code available in two forms:
6.For ecologists in particular, most newly-developed methods are usually implemented with R and are provided as a package with documentation. In recent years, using these packages is usually the most efficient and effective way to apply recently-developed ecological methods. This class will require the use of several such packages:
Along with these advantages are two principal disadvantages:
1.R has a steeper learning curve than most menu-driven software, especially at the outset, and particularly if you’ve never written computer code before.
2.As with any computer language, R does EXACTLY what you tell it to do, so minor errors of logic, syntax, spelling and capitalization will keep an otherwise functional script from running. Yuo can raed this but R cant.
All computers in MSU labs have R (and R Studio) installed. Most of you will want to use these on your own computer, and as mentioned above, R will run on a Mac (running OS X) or a PC (running either windows or linux). If you haven’t already, you’ll have to download and install R. R is available from CRAN, the Comprehensive R Archive Network. Click the link, then at CRAN select the appropriate version for your computer from the top box, and follow instructions to download and install it.
This step is optional… you can use R without R studio. R provides a simple text editing window (where you can write a script, though you can also write a script with any other text editor Notepad++ is a good one and save it as a text file) and a console (where you run all or part of a script), and automatically pops up a plot window when you run a script that makes a graph.
I use R Studio to run R, and for this class I will assume that you’re using R Studio, though it usually won’t matter. click the link to go to the R Studio site, then click the download link and follow instructions to install it. Once installed, you just start R Studio to run R. I prefer R Studio because it provides 4 windows that help you organize your work, avoid errors, and work efficiently.
Top left window - source code editor, where you write a script and save it. Unlike the graphical user interface (or GUI) in R, this editing window:
uses color coding to distinguish numbers (blue), executable code (black), and comments (green - comments are text that is not executable code, and is identified by putting a # at the start of a line).
has auto-completion of inherently paired items like parentheses, square brackets and quotes, so it is less likely that you will forget to ‘close’ these properly, which is a common error. Also shows you (with gray highlighting) the pairing of these items (incorrect use of parentheses is also a common error in R scripts). Even with autocompletion, you have to be careful about pairing.
has a tab-completion tool. One of the most common errors in an R script is a typo in the name of a variable. R is completely literal, so a typo in the name of a variable or a function means it is simply not recognized. Tab-completion helps to avoid this. For any variable (or other item) that has been stored in memory by prior lines of, code, you can begin typing the name and then press the tab key. If the characters you’ve typed identify the variable uniquely, it will be inserted without you having to type it out. If there is more than one item in memory that starts with the letters you’ve typed, a little window will pop up and you can click the one you want for auto-completion. For various reasons, variable names are often long (e.g. “lion.surv.p.dot.phi.time”), so typing them out can be slow and error prone.
F1 context-sensitive help. Put the cursor within any function in the editing window and press F1, and help for that function (including example code) pops up in the bottom right window. THIS IS VERY USEFUL WHEN LEARNING R. I USE IT OFTEN WHEN DEALING WITH A NEW PACKAGE OR FUNCTION
Bottom right window has several tabs, including:
Bottom left window is what would be called the console in basic R. This is were lines of code are actually executed. In R studio, +position the cursor on a line of code in the editor (top left window) then press CTRL-R to execute just that line. Executing code one line at a time can be very helpful in identifying and solving problems that keep an entire script from running properly (or at all). +highlight a block of code in the editor (top left window) and press CTRL-R to execute the block. +put the cursor anywhere in the editor, press CTRL-A to select the entire script, then press CTRL-R to run it.
As the script runs, you will see each line of code (in blue) echoed on the console as it is executed, output from the script (in black) and error messages (in red). Plots will pop up in the bottom right window, and you can scroll through them once the script is done running (with the R and L arrows).
Top right window has two tabs:
Environment shows the variables and values that have been stored in memory by the script. This is helpful when debugging errors. When you’re running lines from a script, it is often true that one line depends on variables created by prior lines, so the original source of errors can be a little tricky to diagnose. A good plan for debugging a script with problems is to: Start at the first line of code with nothing in memory (use the little broom icon labelled ‘Clear’ in the top right window) and nothing in the console (put the cursor in the console window [bottom left] and press CTRL-L), execute single lines or small blocks of code in order from the start of the script, and check the console and environment tab to see if you have error messages and if the variables have values that make sense.
History shows the lines of code that have been run. I don’t use this tab much.
The html files that explain R scripts in this class will always have the same formatting for three things.
#This is a block of R code. It has the same color coding that you'd see in R Studio's script editor.
x <- c(0,1,2,3) #this assigns some numeric values to a variable named x.
mean(x) #this calculates the mean of the values in x, and will cause the mean to be displayed in console output.
## [1] 1.5
You have to download and install a package before you can load it so that the functions within the package are available. On you computer, now install a few packages that you’ll be using later for the course.
Load the ggplot2 package, which provides functions for pretty graphics, and then use the qplot function within that package to make a graph of two variables named length and height.
#load the ggplot2 package
library(ggplot2)
#create some variables and assign values
length <- c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6)
height <- c(1,2,3,1,3,4,2,2,5,4,5,6,4,6,7)
species <- c('cat','cat','cat','cat','dog','cat','dog','cat','dog','dog','dog','cat','dog','dog','dog')
#make a graph with the qplot function within the ggplot2 package
qplot(length,height, colour = species)
Often in R people use shortcuts that make writing the script faster, when there are equivalent methods that make the logic more clear. For example, the code below produces an identical graph. By default [as in the example above] the first variable name put into the qplot function is the abscissa (x) and the second variable name specified is the ordinate (y), but you can also choose to spell this out [as in the example below]. Remember that you can copy and paste code into the script editor and run it from there with CTRL-R, and that you can put the cursor on a function in the script editor and press F1 to get context-specific help that explains the arguments (the terms within the function’s parentheses) of the function. This is a good way to learn the conventions (like the first variable being assumed to be x, in the case of qplot) and the default values that will be used if you don’t specify otherwise.
qplot(x=length,y=height,colour = species)
There is often more than one way to accomplish the same thing in R. Instead of loading the ggplot2 package and using its qplot() function, I could have just used the plot() function that is already provided in the base R package.
plot(length,height)
R is sensitive to capitalization, so the package name MuMIn is not the same thing as mumin or MUMIN. This is also true for variable names: var1 and VAR1 are two different variables as far as R is concerned.
var1 <- 2
VAR1 <- 2000
var1
## [1] 2
VAR1
## [1] 2000
In the above code block, the <- assigns a value to a variable, and then typing the name of the variable causes its value to be displayed on the console. Alternatively, you can use = instead of <-.
var2 = 50
var2
## [1] 50
I’d recommend using <- to assign values, and thinking of it as assign and not equals. This will prevent any possible confusion with the R code for “is equal to” which is “==” This will be important later.
R ignores spaces between the items within a line, so it is not a problem to have extra spaces between items. You cannot have spaces within an item. For example, Note a space between the < and the - will not be recognized as an assignment statement.
var2 <- 3
var2
## [1] 3
R also allows you to type a single line of code on multiple lines without creating a problem. If you run the code below (cut and paste it into the editor, then use CTRL-A & CTRL-R), you’ll see that both assignment statements work fine. If you look at the console after it assigns the values to vector2, you’ll notice that the R console displays a > at the beginning of each new line, but displays a + instead of > when it is continuing a line of code rather than starting a new one. When you are debugging code that won’t run, it’s useful to check whether the console is displaying a plus sign, which indicates that it encountered a problem within the last line of code before the error and couldn’t finish executing that command.
vector1 <- c(1,2,3,4,5)
vector2 <-
c(6,7,8,9,10)
vector1
## [1] 1 2 3 4 5
vector2
## [1] 6 7 8 9 10
R cannot deal with spaces within a variable name: it will treat the two parts as separate entities. A common convention is to use periods (dots) as a spacer within a variable name.
var.named.joe <- "Hi I'm Joe"
var.named.joe
## [1] "Hi I'm Joe"
The above command successfully assigns the character string “Hi I’m Joe” to the variable var.named.joe.
#var named joe <- "oops"
would give an error message because of the spaces within the variable name. (Here, I have this line ‘commented out’ so that this script will run without errors.) Note that spaces within the character string stored in the variable var.named.joe are treated like any other character, unlike the spaces within the variable name itself. We’ll discuss the differences between text strings and numerical values below.
Variable names: * cannot have spaces within them. * cannot start with a number
*
cannot contain a $ (because $ is used to separate the name of a data
frame and a variable within that data frame… more on this next session.)
* cannot contain any symbols used for mathematical operations in R.
Putting a # at the start of a line turns it into a comment. Anything following the # will be displayed but will not be executed as code. This provides a way to annotate code or to disable a line of code without deleting it, which can be useful when debugging.
# this is a comment, explaining that the next line is a functional assignment statement that uses the function
# seq() to assign a sequence of values from 0 to 1 by units 0.1 to a variable named vector1.
vector1 <- seq(0,1,0.1)
#this is a comment noting that the next line of code is commented out and therefore doesn't run
#vector2 <- seq(0,1,0.1)
vector1
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
vector2
## [1] 6 7 8 9 10
You may have noticed that this code block re-used the variable names vector1 and vector2. vector1 used to hold the values 1,2,3,4,5, but now it holds the values 0, 0.1, 0.2 … 0.9, 1. vector2 still holds its original values (6,7,8,9,10) because the new assignment is commented out. When writing longer scripts, it’s important to remember that R stores only the last assignment of a variable. The Environment tab in the top right window is useful when you have any confusion about what values are currently stored in a variable.
In general, avoid reusing variable names within script unless doing so for some intentional reason.
Also, do not have multiple scripts open at once unless it is for a good reason; if they have variable names in common it can create unanticipated problems. If you have two or more scripts open (perhaps to copy an example code block), recall that you can use the clear button (broom icon) in the Environment tab of the top right window to clear all memory out and work with a clean slate.
Quotes are used to specify text, or character strings as text variables are called in R. It does not matter if they are single or double quotes.
# this stores the value 12 in a numeric variable
var.num <- 12
summary(var.num)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12 12 12 12 12 12
# this stores the character string "12", essentially as a word rather than a number. You can't do math on it.
var.char = "12"
summary(var.char)
## Length Class Mode
## 1 character character
Once you have R and R Studio installed, have installed some packages and know how to load them, the next step is entering data. You’ve already seen some simple examples of using assignment statements to store some values in named variables, but most often you’ll be reading a data file in from some other source. Most ecologists use a spreadsheet program (like MS Excel) and/or a database (like MS Access) for data files, though some people just use R with ASCII text files. for this course I’ll assume that you use Excel or something similar.
Getting data from a spreadsheet into R is fairly simple.
We’ll import a dataset on herd sizes and behavior of African ungulates in relation to their distance from predators.
#First, remove all of the items that might be stored in memory. This is a good first line for all scripts. It is also an example of nesting two functions. The ls() function lists the memory objects stored in the working environment. The rm() function removes a set of listed items. By nesting the two functions and specifying that all objects in memory should be listed and thus removed, this is an efficient way to start a script with a clean slate.
rm(list=ls(all=TRUE))
# specify the location of a data file that has already been saved as comma-delimited text and read the data file kenyaherdsize2.txt into an R dataframe stored as kenyaherdsize2. I've used the read.csv() function, which is specifically for comma-delimted files. The argument header = TRUE specifies that the first row of the file contains the names of the variables. header = FALSE assigns automatic names to each variable (v.1, v.2, etc.) There is a more general read.table() function that would also work...there is more than one way to do most things in R.
kenyaherdsize2 <- read.csv("C:/Users/screel/Desktop/KENYA/kenya_behavior_paper/Kenya_herd_size/kenyaherdsize2.txt", header=TRUE)
Note that the value TRUE in the above code is not a text string, it is a logical variable. Logical variables are different than text, and can only take the values TRUE or FALSE. These are not equivalent to the character strings “TRUE” and “FALSE” (the quotes make these into character strings; recall that single or double quotes are both OK).
log.var = TRUE
text.var = "TRUE"
summary(log.var)
## Mode TRUE NA's
## logical 1 0
summary(text.var)
## Length Class Mode
## 1 character character
Logical variables are often used in the arguments for functions and are useful in controlling whether or not a chunk of code executes. Later, you’ll see examples of control structures such as for() loops and if() or which() statements. In the read.csv() function, the TRUE or FALSE value of the argument header determines whether or not some hidden code is run to create variable names in the new dataframe from the first row of the text file being read.
Now use the attach() function to attach the dataframe so that its name does not need to be typed out whenever using the variables within it. If you do not attach the dataframe, you identify a variable using the syntax
dataframe$variable
for example kenyaherdsize2$Species identifies the variable Species within the dataframe kenyaherdsize2. after using attach(kenyaherdsize2), you can just type the variable name, Species. Note that Species has the first letter capitalized and would not be recognized if a lower-case s was used.
#use the head() function to look at the first 5 values of a variable
head(kenyaherdsize2$Species)
## [1] Giraffe Giraffe Giraffe Giraffe Giraffe Giraffe
## Levels: Giraffe Grant Impala Wildbst Zebra
#attach the dataframe and then use head() again, this time not having to specify the dataframe that holds the variable -- compare with the line just above
attach(kenyaherdsize2)
head(Species)
## [1] Giraffe Giraffe Giraffe Giraffe Giraffe Giraffe
## Levels: Giraffe Grant Impala Wildbst Zebra
The two methods yield the same result, but the second is generally easier. If complications arise because you are working with two (or more) dataframes and they both contain a variable with the same name, it is best to not use the shortcut. You can use the function detach() to detach any dataframe you’ve attached, and then rely on the full dataframe$variable method.
Do a little more inspection of the imported dataframe.
#examine the first 5 rows and last 5 rows of the imported file and look at a summary of each variable,to check if everything seems correct
head(kenyaherdsize2)
## ObsID Species Pred.LT.400m DistPred PredSpecies Adult.Prop.Vig
## 1 03-Apr-10-6 Giraffe Absent 0.883 LI 0.00
## 2 03-Apr-10-7 Giraffe Absent 0.883 LI 0.00
## 3 05-Jul-10-0 Giraffe Present 0.050 LI 0.70
## 4 05-Jul-10-6 Giraffe Absent 2.660 LI 0.00
## 5 06-May-10-1 Giraffe Absent 1.300 LI 0.00
## 6 07-Jun-10-1 Giraffe Absent 3.300 LI 0.03
## Adult.Prop.BVG Adult.Prop.Vgme Adult.Prop.VigTot Adult.Prop.Feed
## 1 0 0.14 0.14 0.14
## 2 0 0.00 0.00 0.25
## 3 0 0.25 0.95 0.00
## 4 0 0.00 0.00 0.57
## 5 0 0.00 0.00 0.09
## 6 0 0.02 0.05 0.48
## ObsDate ObsStart UTM.E UTM.N Temp.C. Clouds Wind.mph. Habitat
## 1 4/3/2010 6:40 PM NA NA 29.6 NA NA OB/OW
## 2 4/3/2010 6:50 PM NA NA 29.6 NA NA OB/OW
## 3 7/5/2010 8:01 AM 177809 9790483 19.1 0.9 0.5 CB/CW
## 4 7/5/2010 9:57 AM 178081 9792998 27.0 0.0 2.4 CW
## 5 5/6/2010 8:02 AM 177170 9790670 27.1 0.0 NA IW
## 6 6/7/2010 8:09 AM NA NA 29.1 0.9 0.0 OB
## HabOpen.Close BushWoodGrass GrassHt.m. GrassColor DistWood.m. Lion1.km.
## 1 O B NA 0.88
## 2 O B NA 0.88
## 3 C B NA 31-100 0.05
## 4 C W 0.0 31-100 2.66
## 5 C W 0.0 31-100 1.30
## 6 O B 0.5 31-100 3.30
## Lion2.km. Hyena1.km. Hyena2.km. FollowID Kill HerdType C SA A SAM SAF
## 1 NA NA NA Len-020 0 Mixed 0 0 5 0 0
## 2 NA NA NA Len-020 0 Mixed 0 0 2 0 0
## 3 NA NA NA Ren-032 0 Single 0 1 7 0 0
## 4 NA NA NA Len-045 0 Single 0 0 2 0 0
## 5 NA NA NA Ren-016 0 Single 0 0 6 0 0
## 6 3.13 NA NA Ren-027 0 Mixed 0 0 10 0 0
## AM AF GroupSize Prop.A Prop.AM Prop.AF
## 1 0 0 5 1.000 0 0
## 2 0 0 2 1.000 0 0
## 3 0 0 8 0.875 0 0
## 4 0 0 2 1.000 0 0
## 5 0 0 6 1.000 0 0
## 6 0 0 10 1.000 0 0
tail(kenyaherdsize2)
## ObsID Species Pred.LT.400m DistPred PredSpecies Adult.Prop.Vig
## 489 40545-1B Grant 3010 HY NA
## 490 40545-2 Grant 3010 HY NA
## 491 40545-3 Grant 3010 HY NA
## 492 40555-1 Zebra 1140 LI NA
## 493 40559-1 Grant 2000 HY NA
## 494 40559-1A Grant 2000 HY NA
## Adult.Prop.BVG Adult.Prop.Vgme Adult.Prop.VigTot Adult.Prop.Feed
## 489 NA NA NA NA
## 490 NA NA NA NA
## 491 NA NA NA NA
## 492 NA NA NA NA
## 493 NA NA NA NA
## 494 NA NA NA NA
## ObsDate ObsStart UTM.E UTM.N Temp.C. Clouds Wind.mph. Habitat
## 489 40545 0.367951389 183160 9784087 30.0 0.3 4.0 OG
## 490 40545 0.348333333 183160 9784087 30.0 0.3 4.0 OG/sc W
## 491 40545 0.359143519 183160 9784087 30.0 0.3 4.0 OG/sc B
## 492 40555 0.344652778 177042 9785595 29.2 0.1 3.1 OW
## 493 40559 0.365289352 181098 9788293 27.9 0.0 2.6 OW
## 494 40559 0.373854167 181098 9788293 27.9 0.0 2.6 OW
## HabOpen.Close BushWoodGrass GrassHt.m. GrassColor DistWood.m.
## 489 O G 5 grass brown >300
## 490 O G 5 grass brown >300
## 491 O G 5 grass brown >300
## 492 O W 30 grass brown >300
## 493 O W 0 no grass
## 494 O W 0 no grass
## Lion1.km. Lion2.km. Hyena1.km. Hyena2.km. FollowID Kill HerdType C SA
## 489 NA NA 3010 NA 0 Single NA NA
## 490 NA NA 3010 NA 0 Single NA NA
## 491 NA NA 3010 NA 0 Single NA NA
## 492 1140 NA NA NA 0 Single NA NA
## 493 NA NA 2000 NA 0 Single NA NA
## 494 NA NA 2000 NA 0 Single NA NA
## A SAM SAF AM AF GroupSize Prop.A Prop.AM Prop.AF
## 489 NA NA NA NA NA 19 NA NA NA
## 490 NA NA NA NA NA 6 NA NA NA
## 491 NA NA NA NA NA 12 NA NA NA
## 492 NA NA NA NA NA 10 NA NA NA
## 493 NA NA NA NA NA 20 NA NA NA
## 494 NA NA NA NA NA 5 NA NA NA
summary(kenyaherdsize2)
## ObsID Species Pred.LT.400m DistPred PredSpecies
## 10-Jun-10-3: 4 Giraffe: 49 :132 Min. : 0 HY:128
## 10-Jun-10-5: 4 Grant :152 Absent :282 1st Qu.: 1 LI:366
## 12-Apr-10-4: 4 Impala : 55 Present: 80 Median : 2
## 12-Apr-10-5: 4 Wildbst: 87 Mean : 448
## 28-May-10-4: 4 Zebra :151 3rd Qu.: 49
## 28-May-10-7: 4 Max. :3890
## (Other) :470
## Adult.Prop.Vig Adult.Prop.BVG Adult.Prop.Vgme Adult.Prop.VigTot
## Min. :0.00 Min. :0.0 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.0 1st Qu.:0.00 1st Qu.:0.00
## Median :0.00 Median :0.0 Median :0.00 Median :0.00
## Mean :0.07 Mean :0.0 Mean :0.02 Mean :0.10
## 3rd Qu.:0.09 3rd Qu.:0.0 3rd Qu.:0.00 3rd Qu.:0.11
## Max. :1.00 Max. :0.5 Max. :1.00 Max. :1.00
## NA's :146 NA's :146 NA's :146 NA's :146
## Adult.Prop.Feed ObsDate ObsStart UTM.E
## Min. :0.00 4/12/2010: 16 : 14 Min. :171123
## 1st Qu.:0.02 6/10/2010: 16 6:56 AM: 10 1st Qu.:176118
## Median :0.28 7/28/2009: 16 6:40 PM: 9 Median :177109
## Mean :0.36 4/3/2010 : 14 6:07 PM: 7 Mean :177509
## 3rd Qu.:0.61 40477 : 13 7:56 AM: 7 3rd Qu.:179041
## Max. :1.00 5/28/2010: 13 6:06 PM: 6 Max. :183970
## NA's :138 (Other) :406 (Other):441 NA's :39
## UTM.N Temp.C. Clouds Wind.mph.
## Min. :9775132 Min. :19.1 Min. :0.00 Min. : 0.00
## 1st Qu.:9785680 1st Qu.:26.0 1st Qu.:0.15 1st Qu.: 0.60
## Median :9787424 Median :28.5 Median :0.40 Median : 1.20
## Mean :9787328 Mean :28.7 Mean :0.46 Mean : 1.85
## 3rd Qu.:9789825 3rd Qu.:31.5 3rd Qu.:0.80 3rd Qu.: 2.50
## Max. :9801904 Max. :38.5 Max. :1.00 Max. :14.50
## NA's :39 NA's :33 NA's :132 NA's :96
## Habitat HabOpen.Close BushWoodGrass GrassHt.m.
## OW :158 C:109 B:143 Min. : 0.0
## OG : 74 O:385 G:144 1st Qu.: 0.1
## OB : 49 W:207 Median : 0.5
## OG/sc B: 39 Mean :11.4
## OB/OW : 34 3rd Qu.:30.0
## IW : 33 Max. :60.0
## (Other):107 NA's :203
## GrassColor DistWood.m. Lion1.km. Lion2.km.
## :323 :124 Min. : 0 Min. :0.2
## grass brown: 93 >300 : 73 1st Qu.: 1 1st Qu.:1.2
## grass mixed: 25 1-30 : 52 Median : 2 Median :2.8
## green : 25 101-300:155 Mean : 463 Mean :2.5
## brown : 11 31-100 : 90 3rd Qu.: 594 3rd Qu.:3.3
## grass green: 11 Max. :3730 Max. :4.3
## (Other) : 6 NA's :128 NA's :473
## Hyena1.km. Hyena2.km. FollowID Kill HerdType
## Min. : 0 Mode:logical :141 Min. :0.000 Mixed :260
## 1st Qu.: 1 NA's:494 Len-015: 16 1st Qu.:0.000 Single:234
## Median : 2 Ren-028: 16 Median :0.000
## Mean : 329 Piz-012: 15 Mean :0.115
## 3rd Qu.: 4 Len-020: 14 3rd Qu.:0.000
## Max. :3890 Mku-007: 13 Max. :1.000
## NA's :337 (Other):279
## C SA A SAM
## Min. :0.00 Min. : 0.00 Min. : 0.00 Min. : 0.00
## 1st Qu.:0.00 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Median :0.00 Median : 0.00 Median : 3.00 Median : 0.00
## Mean :0.18 Mean : 0.19 Mean : 7.38 Mean : 0.39
## 3rd Qu.:0.00 3rd Qu.: 0.00 3rd Qu.: 10.00 3rd Qu.: 0.00
## Max. :6.00 Max. :13.00 Max. :104.00 Max. :20.00
## NA's :132 NA's :132 NA's :132 NA's :132
## SAF AM AF GroupSize
## Min. :0.00 Min. : 0.00 Min. : 0.0 Min. : 1.0
## 1st Qu.:0.00 1st Qu.: 0.00 1st Qu.: 0.0 1st Qu.: 2.0
## Median :0.00 Median : 0.00 Median : 0.0 Median : 6.0
## Mean :0.02 Mean : 1.18 Mean : 1.4 Mean : 10.2
## 3rd Qu.:0.00 3rd Qu.: 1.00 3rd Qu.: 1.0 3rd Qu.: 12.0
## Max. :4.00 Max. :21.00 Max. :24.0 Max. :118.0
## NA's :132 NA's :132 NA's :132
## Prop.A Prop.AM Prop.AF
## Min. :0.00 Min. :0.00 Min. :0.00
## 1st Qu.:0.00 1st Qu.:0.00 1st Qu.:0.00
## Median :0.83 Median :0.00 Median :0.00
## Mean :0.55 Mean :0.23 Mean :0.17
## 3rd Qu.:1.00 3rd Qu.:0.27 3rd Qu.:0.15
## Max. :1.00 Max. :1.00 Max. :1.00
## NA's :132 NA's :132 NA's :132
Or better, just open the entire dataframe in R Studio. You can do this by just clicking on the dataframe in the Environment tab in the top right window, or with the View() command, shown below. Either way, it will open a tab with the dataframe in the top left window, so you can easily jump between looking at the data and the script editor. You will not see the dataframe in this html page, but if you run the script in R Studio, you will.
View(kenyaherdsize2)
The data seem generally correct … Make a plot as a further check that everything seems more-or-less as expected.
library(ggplot2)
qplot(y = Adult.Prop.VigTot, x = Adult.Prop.Feed, xlab = "Proportion Foraging", ylab = "Proportion Vigilant", geom= c('jitter','smooth'))
## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.
## Warning: Removed 146 rows containing missing values (stat_smooth).
## Warning: Removed 146 rows containing missing values (geom_point).
Looks reasonable. I’d emphasize that you have to be vigilant about inspecting data sets thoroughly to make sure that little errors or omissions have not accidentally crept in. For example, if you examine this dataframe carefully (or if you’ve been looking at the warning messages in the R output) you’ll notice that some variables have missing values (which can easily be derived from other variables in the file, but one has to be aware of the problem to fix it).
Next session - working with data in R: selecting subsets of data, creating new variables and ‘binding’ them to an existing dataframe, plotting, and some simple statistical tests, using data on human population growth.