R and RStudio - Getting Started

Ecology and conservation biology are highly quantitative fields of science, and to be a good ecologist you must be skilled at working with data, conducting statistical analyses and presenting the results in an appropriate manner. The good news is that statistical methods and the associated software to do this are well-developed. The bad news is … the statistical methods and associated software are well developed, so the learning curve is steep.

Most ecologists now use the free software package R for data manipulation, statistical analysis, graphics and simulation modelling. R has several advantages:

1.It is free. Until R emerged, it was common to spend thousands of dollars on statistical software,

2.It is platform-independent, meaning that it will run in the same manner on a Mac, a PC running windows, or a PC running linux. This makes it simple to share files for analysis with colleagues (and students)… if you email a file to another person, they can run it and replicate EXACTLY what you did.

3.It is not just a statistical/graphics package, it is programming language, based on the commercial language S, which is where it got its name. Consequently, there is very little that you cannot do with R once you know the language. For example, in this course you’ll use R to manipulate and plot data, conduct statistical tests, implement highly specific models developed for ecologists (such as mark-recapture models) and construct and run fairly complex stochastic simulations of population dynamics.

4.It is code-driven rather than menu-driven. Many of you will be familiar with using menu-clicks to work with data (either to produce graphs or run statistical tests), for example in MS Excel. There is nothing wrong with this, but it can be very difficult to replicate once an analysis has many steps. Once a complicated, menu-driven analysis has been completed, it can be virtually impossible to replicate without detailed notes, even for the person who did it. IN R, you write lines of code (a “script”) to open a data file, process it, conduct analysis and make plots. Once the script is saved, it is a permanent record of EXACTLY how the analysis was done. In practice, this is very useful because it allows (or perhaps forces!) you to:

write methods up for publication with no ambiguity
modify analyses
recognize and consider the assumptions of the analysis you’ve conducted, and better understand the results
replicate analyses with other data sets

5.Because of points 1 - 4, there is a lot of pre-existing R code available in two forms:

Packages are self-contained extensions of R that you install once and then load when you want to make use of the functions that they provide. (As an example, I used the R package markdown to create the formatted html files with R examples for this class, including the one you’re reading now.)
Example Code can often be found – more on this below. Starting from a working example and modifiying the code is often the easiest way to solve a problem that is new to you, but not to others. Once you have a set of R scripts of your own, you’ll often cut and past code then make a few modifications, rather than starting from scratch.

6.For ecologists in particular, most newly-developed methods are usually implemented with R and are provided as a package with documentation. In recent years, using these packages is usually the most efficient and effective way to apply recently-developed ecological methods. This class will require the use of several such packages:

unmarked to estimate population density with distance sampling models that account for factors that affect the density of a species and the probability of detecting it.
RMark to estimate survival rates and the factors that affect them, using capture-mark-recapture methods. Using RMark is a very time-efficient way to implement models that are otherwise most readily available in the FORTRAN program MARK, which can be quite difficult to master.
popbio to derive information on population growth and the factors that affect it, and to build simulation models that estimate extinction risk (population viability analysis)
more general packages such as glm or lme4 that can accomplish complicated but appropriate statistical analyses such a generalized mixed-effects models. This is not a STAT class but it’s impossible to conduct research in ecology without getting into statistics.

Along with these advantages are two principal disadvantages:

1.R has a steeper learning curve than most menu-driven software, especially at the outset, and particularly if you’ve never written computer code before.

2.As with any computer language, R does EXACTLY what you tell it to do, so minor errors of logic, syntax, spelling and capitalization will keep an otherwise functional script from running. Yuo can raed this but R cant.

Download and Install R

All computers in MSU labs have R (and R Studio) installed. Most of you will want to use these on your own computer, and as mentioned above, R will run on a Mac (running OS X) or a PC (running either windows or linux). If you haven’t already, you’ll have to download and install R. R is available from CRAN, the Comprehensive R Archive Network. Click the link, then at CRAN select the appropriate version for your computer from the top box, and follow instructions to download and install it.

Download and Install R Studio

This step is optional… you can use R without R studio. R provides a simple text editing window (where you can write a script, though you can also write a script with any other text editor Notepad++ is a good one and save it as a text file) and a console (where you run all or part of a script), and automatically pops up a plot window when you run a script that makes a graph.

I use R Studio to run R, and for this class I will assume that you’re using R Studio, though it usually won’t matter. click the link to go to the R Studio site, then click the download link and follow instructions to install it. Once installed, you just start R Studio to run R. I prefer R Studio because it provides 4 windows that help you organize your work, avoid errors, and work efficiently.

Top left window - source code editor, where you write a script and save it. Unlike the graphical user interface (or GUI) in R, this editing window:

uses color coding to distinguish numbers (blue), executable code (black), and comments (green - comments are text that is not executable code, and is identified by putting a # at the start of a line).
has auto-completion of inherently paired items like parentheses, square brackets and quotes, so it is less likely that you will forget to ‘close’ these properly, which is a common error. Also shows you (with gray highlighting) the pairing of these items (incorrect use of parentheses is also a common error in R scripts). Even with autocompletion, you have to be careful about pairing.
has a tab-completion tool. One of the most common errors in an R script is a typo in the name of a variable. R is completely literal, so a typo in the name of a variable or a function means it is simply not recognized. Tab-completion helps to avoid this. For any variable (or other item) that has been stored in memory by prior lines of, code, you can begin typing the name and then press the tab key. If the characters you’ve typed identify the variable uniquely, it will be inserted without you having to type it out. If there is more than one item in memory that starts with the letters you’ve typed, a little window will pop up and you can click the one you want for auto-completion. For various reasons, variable names are often long (e.g. “lion.surv.p.dot.phi.time”), so typing them out can be slow and error prone.
F1 context-sensitive help. Put the cursor within any function in the editing window and press F1, and help for that function (including example code) pops up in the bottom right window. THIS IS VERY USEFUL WHEN LEARNING R. I USE IT OFTEN WHEN DEALING WITH A NEW PACKAGE OR FUNCTION

Bottom right window has several tabs, including:

As mentioned, there is a Help window. Use the F1 key in the top left editing window to get help on a specific function, or type the name in the search bar of the help window. This is very valuable even for functions you understand well, to refresh your memory of the syntax required, to learn the arguments that you can specify within a function, and to learn the default values for arguments).
Plots stores and displays graphics as they are made. Scroll through with R and L arrows, delete them individually or all at once.
A Packages window that is helpful to see what packages you’ve downloaded and installed, and to load them, though you should generally do this using the library() function within your script.

Bottom left window is what would be called the console in basic R. This is were lines of code are actually executed. In R studio, +position the cursor on a line of code in the editor (top left window) then press CTRL-R to execute just that line. Executing code one line at a time can be very helpful in identifying and solving problems that keep an entire script from running properly (or at all). +highlight a block of code in the editor (top left window) and press CTRL-R to execute the block. +put the cursor anywhere in the editor, press CTRL-A to select the entire script, then press CTRL-R to run it.

As the script runs, you will see each line of code (in blue) echoed on the console as it is executed, output from the script (in black) and error messages (in red). Plots will pop up in the bottom right window, and you can scroll through them once the script is done running (with the R and L arrows).

Top right window has two tabs:

Environment shows the variables and values that have been stored in memory by the script. This is helpful when debugging errors. When you’re running lines from a script, it is often true that one line depends on variables created by prior lines, so the original source of errors can be a little tricky to diagnose. A good plan for debugging a script with problems is to: Start at the first line of code with nothing in memory (use the little broom icon labelled ‘Clear’ in the top right window) and nothing in the console (put the cursor in the console window [bottom left] and press CTRL-L), execute single lines or small blocks of code in order from the start of the script, and check the console and environment tab to see if you have error messages and if the variables have values that make sense.
History shows the lines of code that have been run. I don’t use this tab much.

Sources of Help for R

The Cookbook for R website is has a nice index of example scripts, with explanations, for all basic operations like importing data, manipulating it, making graphs, etc. This is a very useful resource for new users.
The Quick-R website is similar to the cookbook. I prefer the cookbook site but they are both good.
The R homepage has links to manuals, reference cards, webpages and user-groups. Authoritative, but not as user-friendly as the first two.
You should download a free pdf copy of the complete book “A Beginner’s Guide to R” using MSU’s SpringerLink connection, as long as you are on an MSU-domain computer. (click the ‘download book’ link to get the entire thing as a pdf and save it on a jumpstick). I’ll periodically assign reading from this. The Introduction (pages 1-27) covers the material discussed here.
For example code you can very often just use google, searching for something like “r change axis range”. Just include R and the thing you want to do and you’ll often find a good example that solves your problem, with code provided. Go ahead and google the example above and follow a few of the links. As you become more proficient, you’ll rely more heavily on this.
More specifically, you can go to stackoverflow.com and search the site with R included in your search term. Any number of serious computer programmers post on that forum and it can be very useful. Don’t use the ‘ask question’ button unless you’ve already searched for an existing answer. Virtually EVERYTHING has already been asked and answered more than once.
Specifically for statistical analysis in R, the UCLA statistical consulting department has truly excellent explanations with great example code for many common analyses.

First Steps

The html files that explain R scripts in this class will always have the same formatting for three things.

Explanatory text in these files is not boxed.
R code appears in grey blocks. You can cut and paste the code into the R Studio script editor to build your own scripts and run them. Comments (which are not executable code) are identified by a # at the start of the comment and appear in green. Lines of functional code use blue for numeric and logical (TRUE/FALSE) values, green for character strings,and black for everything else.
Output from R appears in a box with no background color, with ## at the start of the line. These boxes show you the output that R sends to the console window when you run the code box that precedes it.

#This is a block of R code. It has the same color coding that you'd see in R Studio's script editor.  



x <- c(0,1,2,3)   #this assigns some numeric values to a variable named x.

mean(x)           #this calculates the mean of the values in x, and will cause the mean to be displayed in console output.

## [1] 1.5

Installing and loading packages

You have to download and install a package before you can load it so that the functions within the package are available. On you computer, now install a few packages that you’ll be using later for the course.

In the bottom right window, click the Packages tab. The window will show a list of the packages that are installed. You can use the Update icon to ensure you have the most recent versions from CRAN.
Click the Install icon. A window pops up with three boxes. Leave the first (repository) on the default. Leave the last (directory path) on the default unless you installed R somewhere other than the default location. In the middle window, begin typing the name of the package you’d like to install, and list of packages will appear. Select the one you want and click install, leaving the ‘install dependencies’ box checked (this ensures that packages required by the selected package will all be installed together – many R packages require functions that are defined in some other package.)
Use this process to install the packages popbio, unmarked, RMark, ggplot2, msm, lme4, MASS, MuMIn, car, class and gplots. (Capitalization matters when typing these package names.)
Once a package has been installed, you have to load it to make use of the functions in the package. You can do this in R Studio by clicking the box next to the name of an installed package, but it’s better to use the library() function within the script that will use the package.

Load the ggplot2 package, which provides functions for pretty graphics, and then use the qplot function within that package to make a graph of two variables named length and height.

#load the ggplot2 package

library(ggplot2)



#create some variables and assign values

length <- c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6)

height <- c(1,2,3,1,3,4,2,2,5,4,5,6,4,6,7)

species <- c('cat','cat','cat','cat','dog','cat','dog','cat','dog','dog','dog','cat','dog','dog','dog')



#make a graph with the qplot function within the ggplot2 package

qplot(length,height, colour = species)

Often in R people use shortcuts that make writing the script faster, when there are equivalent methods that make the logic more clear. For example, the code below produces an identical graph. By default [as in the example above] the first variable name put into the qplot function is the abscissa (x) and the second variable name specified is the ordinate (y), but you can also choose to spell this out [as in the example below]. Remember that you can copy and paste code into the script editor and run it from there with CTRL-R, and that you can put the cursor on a function in the script editor and press F1 to get context-specific help that explains the arguments (the terms within the function’s parentheses) of the function. This is a good way to learn the conventions (like the first variable being assumed to be x, in the case of qplot) and the default values that will be used if you don’t specify otherwise.

qplot(x=length,y=height,colour = species)

Some General Points About R Scripts

There is often more than one way to accomplish the same thing in R. Instead of loading the ggplot2 package and using its qplot() function, I could have just used the plot() function that is already provided in the base R package.

plot(length,height)

plot of chunk unnamed-chunk-4

R is sensitive to capitalization, so the package name MuMIn is not the same thing as mumin or MUMIN. This is also true for variable names: var1 and VAR1 are two different variables as far as R is concerned.

var1 <- 2

VAR1 <- 2000

var1

## [1] 2

VAR1

## [1] 2000

In the above code block, the <- assigns a value to a variable, and then typing the name of the variable causes its value to be displayed on the console. Alternatively, you can use = instead of <-.

var2 = 50

var2

## [1] 50

I’d recommend using <- to assign values, and thinking of it as assign and not equals. This will prevent any possible confusion with the R code for “is equal to” which is “==” This will be important later.

R ignores spaces between the items within a line, so it is not a problem to have extra spaces between items. You cannot have spaces within an item. For example, Note a space between the < and the - will not be recognized as an assignment statement.

var2 <-              3

var2

## [1] 3

R also allows you to type a single line of code on multiple lines without creating a problem. If you run the code below (cut and paste it into the editor, then use CTRL-A & CTRL-R), you’ll see that both assignment statements work fine. If you look at the console after it assigns the values to vector2, you’ll notice that the R console displays a > at the beginning of each new line, but displays a + instead of > when it is continuing a line of code rather than starting a new one. When you are debugging code that won’t run, it’s useful to check whether the console is displaying a plus sign, which indicates that it encountered a problem within the last line of code before the error and couldn’t finish executing that command.

vector1 <- c(1,2,3,4,5)

vector2 <-

          

      c(6,7,8,9,10)



vector1

## [1] 1 2 3 4 5

vector2

## [1]  6  7  8  9 10

R cannot deal with spaces within a variable name: it will treat the two parts as separate entities. A common convention is to use periods (dots) as a spacer within a variable name.

var.named.joe <- "Hi I'm Joe" 

var.named.joe

## [1] "Hi I'm Joe"

The above command successfully assigns the character string “Hi I’m Joe” to the variable var.named.joe.

#var named joe <- "oops"

would give an error message because of the spaces within the variable name. (Here, I have this line ‘commented out’ so that this script will run without errors.) Note that spaces within the character string stored in the variable var.named.joe are treated like any other character, unlike the spaces within the variable name itself. We’ll discuss the differences between text strings and numerical values below.

Variable names: * cannot have spaces within them. * cannot start with a number
* cannot contain a $ (because $ is used to separate the name of a data frame and a variable within that data frame… more on this next session.) * cannot contain any symbols used for mathematical operations in R.

Putting a # at the start of a line turns it into a comment. Anything following the # will be displayed but will not be executed as code. This provides a way to annotate code or to disable a line of code without deleting it, which can be useful when debugging.

# this is a comment, explaining that the next line is a functional assignment statement that uses the function

# seq() to assign a sequence of values from 0 to 1 by units 0.1 to a variable named vector1.



vector1 <- seq(0,1,0.1)



#this is a comment noting that the next line of code is commented out and therefore doesn't run



#vector2 <- seq(0,1,0.1)



vector1

##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

vector2

## [1]  6  7  8  9 10

You may have noticed that this code block re-used the variable names vector1 and vector2. vector1 used to hold the values 1,2,3,4,5, but now it holds the values 0, 0.1, 0.2 … 0.9, 1. vector2 still holds its original values (6,7,8,9,10) because the new assignment is commented out. When writing longer scripts, it’s important to remember that R stores only the last assignment of a variable. The Environment tab in the top right window is useful when you have any confusion about what values are currently stored in a variable.

In general, avoid reusing variable names within script unless doing so for some intentional reason.

Also, do not have multiple scripts open at once unless it is for a good reason; if they have variable names in common it can create unanticipated problems. If you have two or more scripts open (perhaps to copy an example code block), recall that you can use the clear button (broom icon) in the Environment tab of the top right window to clear all memory out and work with a clean slate.

Quotes are used to specify text, or character strings as text variables are called in R. It does not matter if they are single or double quotes.

# this stores the value 12 in a numeric variable 

var.num <- 12      

summary(var.num)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 

##      12      12      12      12      12      12

# this stores the character string "12", essentially as a word rather than a number.  You can't do math on it.

var.char = "12"     

summary(var.char)

##    Length     Class      Mode 

##         1 character character

Data Input

Once you have R and R Studio installed, have installed some packages and know how to load them, the next step is entering data. You’ve already seen some simple examples of using assignment statements to store some values in named variables, but most often you’ll be reading a data file in from some other source. Most ecologists use a spreadsheet program (like MS Excel) and/or a database (like MS Access) for data files, though some people just use R with ASCII text files. for this course I’ll assume that you use Excel or something similar.

Getting data from a spreadsheet into R is fairly simple.

Save the file a comma-delimited text file.
Optionally, check the file with a text editor like Notepad++ to be sure it does not have anything extraneous in it. The most common problem I’ve seen is blank lines at the end of the dataset. This can crash some packages, especially if some variables have trailing blanks and others don’t.
Read the data in and assign it to a dataframe (more on dataframes next session – a dataframe is basically similar to a spreadsheet)

We’ll import a dataset on herd sizes and behavior of African ungulates in relation to their distance from predators.

#First, remove all of the items that might be stored in memory.  This is a good first line for all scripts.  It is also an example of nesting two functions.  The ls() function lists the memory objects stored in the working environment.  The rm() function removes a set of listed items.  By nesting the two functions and specifying that all objects in memory should be listed and thus removed, this is an efficient way to start a script with a clean slate.



rm(list=ls(all=TRUE))



# specify the location of a data file that has already been saved as comma-delimited text and read the data file kenyaherdsize2.txt into an R dataframe stored as kenyaherdsize2.  I've used the read.csv() function, which is specifically for comma-delimted files.  The argument header = TRUE specifies that the first row of the file contains the names of the variables.  header = FALSE assigns automatic names to each variable (v.1, v.2, etc.)   There is a more general read.table() function that would also work...there is more than one way to do most things in R.



kenyaherdsize2 <- read.csv("C:/Users/screel/Desktop/KENYA/kenya_behavior_paper/Kenya_herd_size/kenyaherdsize2.txt", header=TRUE)

Note that the value TRUE in the above code is not a text string, it is a logical variable. Logical variables are different than text, and can only take the values TRUE or FALSE. These are not equivalent to the character strings “TRUE” and “FALSE” (the quotes make these into character strings; recall that single or double quotes are both OK).

log.var = TRUE

text.var = "TRUE"



summary(log.var)

##    Mode    TRUE    NA's 

## logical       1       0

summary(text.var)

##    Length     Class      Mode 

##         1 character character

Logical variables are often used in the arguments for functions and are useful in controlling whether or not a chunk of code executes. Later, you’ll see examples of control structures such as for() loops and if() or which() statements. In the read.csv() function, the TRUE or FALSE value of the argument header determines whether or not some hidden code is run to create variable names in the new dataframe from the first row of the text file being read.

Now use the attach() function to attach the dataframe so that its name does not need to be typed out whenever using the variables within it. If you do not attach the dataframe, you identify a variable using the syntax

dataframe$variable

for example kenyaherdsize2$Species identifies the variable Species within the dataframe kenyaherdsize2. after using attach(kenyaherdsize2), you can just type the variable name, Species. Note that Species has the first letter capitalized and would not be recognized if a lower-case s was used.

#use the head() function to look at the first 5 values of a variable

head(kenyaherdsize2$Species)

## [1] Giraffe Giraffe Giraffe Giraffe Giraffe Giraffe

## Levels: Giraffe Grant Impala Wildbst Zebra

#attach the dataframe and then use head() again, this time not having to specify the dataframe that holds the variable -- compare with the line just above



attach(kenyaherdsize2)

head(Species)

## [1] Giraffe Giraffe Giraffe Giraffe Giraffe Giraffe

## Levels: Giraffe Grant Impala Wildbst Zebra

The two methods yield the same result, but the second is generally easier. If complications arise because you are working with two (or more) dataframes and they both contain a variable with the same name, it is best to not use the shortcut. You can use the function detach() to detach any dataframe you’ve attached, and then rely on the full dataframe$variable method.

Do a little more inspection of the imported dataframe.

#examine the first 5 rows and last 5 rows of the imported file and look at a summary of each variable,to check if everything seems correct



head(kenyaherdsize2)

##         ObsID Species Pred.LT.400m DistPred PredSpecies Adult.Prop.Vig

## 1 03-Apr-10-6 Giraffe       Absent    0.883          LI           0.00

## 2 03-Apr-10-7 Giraffe       Absent    0.883          LI           0.00

## 3 05-Jul-10-0 Giraffe      Present    0.050          LI           0.70

## 4 05-Jul-10-6 Giraffe       Absent    2.660          LI           0.00

## 5 06-May-10-1 Giraffe       Absent    1.300          LI           0.00

## 6 07-Jun-10-1 Giraffe       Absent    3.300          LI           0.03

##   Adult.Prop.BVG Adult.Prop.Vgme Adult.Prop.VigTot Adult.Prop.Feed

## 1              0            0.14              0.14            0.14

## 2              0            0.00              0.00            0.25

## 3              0            0.25              0.95            0.00

## 4              0            0.00              0.00            0.57

## 5              0            0.00              0.00            0.09

## 6              0            0.02              0.05            0.48

##    ObsDate ObsStart  UTM.E   UTM.N Temp.C. Clouds Wind.mph. Habitat

## 1 4/3/2010  6:40 PM     NA      NA    29.6     NA        NA   OB/OW

## 2 4/3/2010  6:50 PM     NA      NA    29.6     NA        NA   OB/OW

## 3 7/5/2010  8:01 AM 177809 9790483    19.1    0.9       0.5   CB/CW

## 4 7/5/2010  9:57 AM 178081 9792998    27.0    0.0       2.4      CW

## 5 5/6/2010  8:02 AM 177170 9790670    27.1    0.0        NA      IW

## 6 6/7/2010  8:09 AM     NA      NA    29.1    0.9       0.0      OB

##   HabOpen.Close BushWoodGrass GrassHt.m. GrassColor DistWood.m. Lion1.km.

## 1             O             B         NA                             0.88

## 2             O             B         NA                             0.88

## 3             C             B         NA                 31-100      0.05

## 4             C             W        0.0                 31-100      2.66

## 5             C             W        0.0                 31-100      1.30

## 6             O             B        0.5                 31-100      3.30

##   Lion2.km. Hyena1.km. Hyena2.km. FollowID Kill HerdType C SA  A SAM SAF

## 1        NA         NA         NA  Len-020    0    Mixed 0  0  5   0   0

## 2        NA         NA         NA  Len-020    0    Mixed 0  0  2   0   0

## 3        NA         NA         NA  Ren-032    0   Single 0  1  7   0   0

## 4        NA         NA         NA  Len-045    0   Single 0  0  2   0   0

## 5        NA         NA         NA  Ren-016    0   Single 0  0  6   0   0

## 6      3.13         NA         NA  Ren-027    0    Mixed 0  0 10   0   0

##   AM AF GroupSize Prop.A Prop.AM Prop.AF

## 1  0  0         5  1.000       0       0

## 2  0  0         2  1.000       0       0

## 3  0  0         8  0.875       0       0

## 4  0  0         2  1.000       0       0

## 5  0  0         6  1.000       0       0

## 6  0  0        10  1.000       0       0

tail(kenyaherdsize2)

##        ObsID Species Pred.LT.400m DistPred PredSpecies Adult.Prop.Vig

## 489 40545-1B   Grant                  3010          HY             NA

## 490  40545-2   Grant                  3010          HY             NA

## 491  40545-3   Grant                  3010          HY             NA

## 492  40555-1   Zebra                  1140          LI             NA

## 493  40559-1   Grant                  2000          HY             NA

## 494 40559-1A   Grant                  2000          HY             NA

##     Adult.Prop.BVG Adult.Prop.Vgme Adult.Prop.VigTot Adult.Prop.Feed

## 489             NA              NA                NA              NA

## 490             NA              NA                NA              NA

## 491             NA              NA                NA              NA

## 492             NA              NA                NA              NA

## 493             NA              NA                NA              NA

## 494             NA              NA                NA              NA

##     ObsDate    ObsStart  UTM.E   UTM.N Temp.C. Clouds Wind.mph. Habitat

## 489   40545 0.367951389 183160 9784087    30.0    0.3       4.0      OG

## 490   40545 0.348333333 183160 9784087    30.0    0.3       4.0 OG/sc W

## 491   40545 0.359143519 183160 9784087    30.0    0.3       4.0 OG/sc B

## 492   40555 0.344652778 177042 9785595    29.2    0.1       3.1      OW

## 493   40559 0.365289352 181098 9788293    27.9    0.0       2.6      OW

## 494   40559 0.373854167 181098 9788293    27.9    0.0       2.6      OW

##     HabOpen.Close BushWoodGrass GrassHt.m.  GrassColor DistWood.m.

## 489             O             G          5 grass brown        >300

## 490             O             G          5 grass brown        >300

## 491             O             G          5 grass brown        >300

## 492             O             W         30 grass brown        >300

## 493             O             W          0    no grass            

## 494             O             W          0    no grass            

##     Lion1.km. Lion2.km. Hyena1.km. Hyena2.km. FollowID Kill HerdType  C SA

## 489        NA        NA       3010         NA             0   Single NA NA

## 490        NA        NA       3010         NA             0   Single NA NA

## 491        NA        NA       3010         NA             0   Single NA NA

## 492      1140        NA         NA         NA             0   Single NA NA

## 493        NA        NA       2000         NA             0   Single NA NA

## 494        NA        NA       2000         NA             0   Single NA NA

##      A SAM SAF AM AF GroupSize Prop.A Prop.AM Prop.AF

## 489 NA  NA  NA NA NA        19     NA      NA      NA

## 490 NA  NA  NA NA NA         6     NA      NA      NA

## 491 NA  NA  NA NA NA        12     NA      NA      NA

## 492 NA  NA  NA NA NA        10     NA      NA      NA

## 493 NA  NA  NA NA NA        20     NA      NA      NA

## 494 NA  NA  NA NA NA         5     NA      NA      NA

summary(kenyaherdsize2)

##          ObsID        Species     Pred.LT.400m    DistPred    PredSpecies

##  10-Jun-10-3:  4   Giraffe: 49          :132   Min.   :   0   HY:128     

##  10-Jun-10-5:  4   Grant  :152   Absent :282   1st Qu.:   1   LI:366     

##  12-Apr-10-4:  4   Impala : 55   Present: 80   Median :   2              

##  12-Apr-10-5:  4   Wildbst: 87                 Mean   : 448              

##  28-May-10-4:  4   Zebra  :151                 3rd Qu.:  49              

##  28-May-10-7:  4                               Max.   :3890              

##  (Other)    :470                                                         

##  Adult.Prop.Vig Adult.Prop.BVG Adult.Prop.Vgme Adult.Prop.VigTot

##  Min.   :0.00   Min.   :0.0    Min.   :0.00    Min.   :0.00     

##  1st Qu.:0.00   1st Qu.:0.0    1st Qu.:0.00    1st Qu.:0.00     

##  Median :0.00   Median :0.0    Median :0.00    Median :0.00     

##  Mean   :0.07   Mean   :0.0    Mean   :0.02    Mean   :0.10     

##  3rd Qu.:0.09   3rd Qu.:0.0    3rd Qu.:0.00    3rd Qu.:0.11     

##  Max.   :1.00   Max.   :0.5    Max.   :1.00    Max.   :1.00     

##  NA's   :146    NA's   :146    NA's   :146     NA's   :146      

##  Adult.Prop.Feed      ObsDate       ObsStart       UTM.E       

##  Min.   :0.00    4/12/2010: 16          : 14   Min.   :171123  

##  1st Qu.:0.02    6/10/2010: 16   6:56 AM: 10   1st Qu.:176118  

##  Median :0.28    7/28/2009: 16   6:40 PM:  9   Median :177109  

##  Mean   :0.36    4/3/2010 : 14   6:07 PM:  7   Mean   :177509  

##  3rd Qu.:0.61    40477    : 13   7:56 AM:  7   3rd Qu.:179041  

##  Max.   :1.00    5/28/2010: 13   6:06 PM:  6   Max.   :183970  

##  NA's   :138     (Other)  :406   (Other):441   NA's   :39      

##      UTM.N            Temp.C.         Clouds       Wind.mph.    

##  Min.   :9775132   Min.   :19.1   Min.   :0.00   Min.   : 0.00  

##  1st Qu.:9785680   1st Qu.:26.0   1st Qu.:0.15   1st Qu.: 0.60  

##  Median :9787424   Median :28.5   Median :0.40   Median : 1.20  

##  Mean   :9787328   Mean   :28.7   Mean   :0.46   Mean   : 1.85  

##  3rd Qu.:9789825   3rd Qu.:31.5   3rd Qu.:0.80   3rd Qu.: 2.50  

##  Max.   :9801904   Max.   :38.5   Max.   :1.00   Max.   :14.50  

##  NA's   :39        NA's   :33     NA's   :132    NA's   :96     

##     Habitat    HabOpen.Close BushWoodGrass   GrassHt.m.  

##  OW     :158   C:109         B:143         Min.   : 0.0  

##  OG     : 74   O:385         G:144         1st Qu.: 0.1  

##  OB     : 49                 W:207         Median : 0.5  

##  OG/sc B: 39                               Mean   :11.4  

##  OB/OW  : 34                               3rd Qu.:30.0  

##  IW     : 33                               Max.   :60.0  

##  (Other):107                               NA's   :203   

##        GrassColor   DistWood.m.    Lion1.km.      Lion2.km.  

##             :323          :124   Min.   :   0   Min.   :0.2  

##  grass brown: 93   >300   : 73   1st Qu.:   1   1st Qu.:1.2  

##  grass mixed: 25   1-30   : 52   Median :   2   Median :2.8  

##  green      : 25   101-300:155   Mean   : 463   Mean   :2.5  

##  brown      : 11   31-100 : 90   3rd Qu.: 594   3rd Qu.:3.3  

##  grass green: 11                 Max.   :3730   Max.   :4.3  

##  (Other)    :  6                 NA's   :128    NA's   :473  

##    Hyena1.km.   Hyena2.km.        FollowID        Kill         HerdType  

##  Min.   :   0   Mode:logical          :141   Min.   :0.000   Mixed :260  

##  1st Qu.:   1   NA's:494       Len-015: 16   1st Qu.:0.000   Single:234  

##  Median :   2                  Ren-028: 16   Median :0.000               

##  Mean   : 329                  Piz-012: 15   Mean   :0.115               

##  3rd Qu.:   4                  Len-020: 14   3rd Qu.:0.000               

##  Max.   :3890                  Mku-007: 13   Max.   :1.000               

##  NA's   :337                   (Other):279                               

##        C              SA              A               SAM       

##  Min.   :0.00   Min.   : 0.00   Min.   :  0.00   Min.   : 0.00  

##  1st Qu.:0.00   1st Qu.: 0.00   1st Qu.:  0.00   1st Qu.: 0.00  

##  Median :0.00   Median : 0.00   Median :  3.00   Median : 0.00  

##  Mean   :0.18   Mean   : 0.19   Mean   :  7.38   Mean   : 0.39  

##  3rd Qu.:0.00   3rd Qu.: 0.00   3rd Qu.: 10.00   3rd Qu.: 0.00  

##  Max.   :6.00   Max.   :13.00   Max.   :104.00   Max.   :20.00  

##  NA's   :132    NA's   :132     NA's   :132      NA's   :132    

##       SAF             AM              AF         GroupSize    

##  Min.   :0.00   Min.   : 0.00   Min.   : 0.0   Min.   :  1.0  

##  1st Qu.:0.00   1st Qu.: 0.00   1st Qu.: 0.0   1st Qu.:  2.0  

##  Median :0.00   Median : 0.00   Median : 0.0   Median :  6.0  

##  Mean   :0.02   Mean   : 1.18   Mean   : 1.4   Mean   : 10.2  

##  3rd Qu.:0.00   3rd Qu.: 1.00   3rd Qu.: 1.0   3rd Qu.: 12.0  

##  Max.   :4.00   Max.   :21.00   Max.   :24.0   Max.   :118.0  

##  NA's   :132    NA's   :132     NA's   :132                   

##      Prop.A        Prop.AM        Prop.AF    

##  Min.   :0.00   Min.   :0.00   Min.   :0.00  

##  1st Qu.:0.00   1st Qu.:0.00   1st Qu.:0.00  

##  Median :0.83   Median :0.00   Median :0.00  

##  Mean   :0.55   Mean   :0.23   Mean   :0.17  

##  3rd Qu.:1.00   3rd Qu.:0.27   3rd Qu.:0.15  

##  Max.   :1.00   Max.   :1.00   Max.   :1.00  

##  NA's   :132    NA's   :132    NA's   :132

Or better, just open the entire dataframe in R Studio. You can do this by just clicking on the dataframe in the Environment tab in the top right window, or with the View() command, shown below. Either way, it will open a tab with the dataframe in the top left window, so you can easily jump between looking at the data and the script editor. You will not see the dataframe in this html page, but if you run the script in R Studio, you will.

View(kenyaherdsize2)

The data seem generally correct … Make a plot as a further check that everything seems more-or-less as expected.

library(ggplot2)

qplot(y = Adult.Prop.VigTot, x = Adult.Prop.Feed, xlab = "Proportion Foraging", ylab = "Proportion Vigilant", geom= c('jitter','smooth'))

## geom_smooth: method="auto" and size of largest group is <1000, so using loess. Use 'method = x' to change the smoothing method.

## Warning: Removed 146 rows containing missing values (stat_smooth).

## Warning: Removed 146 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-18

Looks reasonable. I’d emphasize that you have to be vigilant about inspecting data sets thoroughly to make sure that little errors or omissions have not accidentally crept in. For example, if you examine this dataframe carefully (or if you’ve been looking at the warning messages in the R output) you’ll notice that some variables have missing values (which can easily be derived from other variables in the file, but one has to be aware of the problem to fix it).

Next session - working with data in R: selecting subsets of data, creating new variables and ‘binding’ them to an existing dataframe, plotting, and some simple statistical tests, using data on human population growth.