Ecology and conservation biology are highly quantitative fields of science, and to be a good ecologist you must be skilled at working with data, conducting statistical analyses and presenting the results in an appropriate manner. The good news is that statistical methods and the associated software to do this are well-developed. The bad news is … the statistical methods and associated software are well developed, so the learning curve is steep.
Most ecologists now use the free software package R for data manipulation, statistical analysis, graphics and simulation modelling. R has several advantages:
1.It is free. Until R emerged, it was common to spend thousands of dollars on statistical software,
2.It is platform-independent, meaning that it will run in the same manner on a Mac, a PC running windows, or a PC running linux. This makes it simple to share files for analysis with colleagues (and students)… if you email a file to another person, they can run it and replicate EXACTLY what you did.
3.It is not just a statistical/graphics package, it is programming language, based on the commercial language S, which is where it got its name. Consequently, there is very little that you cannot do with R once you know the language. For example, in this course you’ll use R to manipulate and plot data, conduct statistical tests, implement highly specific models developed for ecologists (such as mark-recapture models) and construct and run fairly complex stochastic simulations of population dynamics.
4.It is code-driven rather than menu-driven. Many of you will be familiar with using menu-clicks to work with data (either to produce graphs or run statistical tests), for example in MS Excel. There is nothing wrong with this, but it can be very difficult to replicate once an analysis has many steps. Once a complicated, menu-driven analysis has been completed, it can be virtually impossible to replicate without detailed notes, even for the person who did it. IN R, you write lines of code (a “script”) to open a data file, process it, conduct analysis and make plots. Once the script is saved, it is a permanent record of EXACTLY how the analysis was done. In practice, this is very useful because it allows (or perhaps forces!) you to:
5.Because of points 1 - 4, there is a lot of pre-existing R code available in two forms:
6.For ecologists in particular, most newly-developed methods are usually implemented with R and are provided as a package with documentation. In recent years, using these packages is usually the most efficient and effective way to apply recently-developed ecological methods. This class will require the use of several such packages:
Along with these advantages are two principal disadvantages:
1.R has a steeper learning curve than most menu-driven software, especially at the outset, and particularly if you’ve never written computer code before.
2.As with any computer language, R does EXACTLY what you tell it to do, so minor errors of logic, syntax, spelling and capitalization will keep an otherwise functional script from running. Yuo can raed this but R cant.
All computers in MSU labs have R (and R Studio) installed. Most of you will want to use these on your own computer, and as mentioned above, R will run on a Mac (running OS X) or a PC (running either windows or linux). If you haven’t already, you’ll have to download and install R. R is available from CRAN, the Comprehensive R Archive Network. Click the link, then at CRAN select the appropriate version for your computer from the top box, and follow instructions to download and install it.
I use R Studio to run R, and this class will assume that you’re using it, because the authors of RFDS are two of the main developers of R Studio (and the ‘tidyverse’ R packages). click the link to go to the R Studio site, then click the download link and follow instructions to install it. Once installed, you just start R Studio to run R. I prefer R Studio because it provides 4 windows that help you organize your work, avoid errors, and work efficiently.
Top left window - source code editor, where you write a script and save it. Unlike the graphical user interface (or GUI) in R, this editing window:
uses color coding to distinguish numbers (blue), executable code (black), and comments (green - comments are text that is not executable code, and is identified by putting a # at the start of a line).
has auto-completion of inherently paired items like parentheses, square brackets and quotes, so it is less likely that you will forget to ‘close’ these properly, which is a common error. Also shows you (with gray highlighting) the pairing of these items (incorrect use of parentheses is also a common error in R scripts). Even with autocompletion, you have to be careful about pairing.
has a tab-completion tool. One of the most common errors in an R script is a typo in the name of a variable. R is completely literal, so a typo in the name of a variable or a function means it is simply not recognized. Tab-completion helps to avoid this. For any variable (or other item) that has been stored in memory by prior lines of, code, you can begin typing the name and then press the tab key. If the characters you’ve typed identify the variable uniquely, it will be inserted without you having to type it out. If there is more than one item in memory that starts with the letters you’ve typed, a little window will pop up and you can click the one you want for auto-completion. For various reasons, variable names are often long (e.g. “lion.surv.p.dot.phi.time”), so typing them out can be slow and error prone.
F1 context-sensitive help. Put the cursor within any function in the editing window and press F1, and help for that function (including example code) pops up in the bottom right window. THIS IS VERY USEFUL WHEN LEARNING R. I USE IT OFTEN WHEN DEALING WITH A NEW PACKAGE OR FUNCTION
Bottom right window has several tabs, including:
Bottom left window is what would be called the console in basic R. This is were lines of code are actually executed. In R studio, +position the cursor on a line of code in the editor (top left window) then press CTRL-R to execute just that line. Executing code one line at a time can be very helpful in identifying and solving problems that keep an entire script from running properly (or at all). +highlight a block of code in the editor (top left window) and press CTRL-R to execute the block. +put the cursor anywhere in the editor, press CTRL-A to select the entire script, then press CTRL-R to run it.
As the script runs, you will see each line of code (in blue) echoed on the console as it is executed, output from the script (in black) and error messages (in red). Plots will pop up in the bottom right window, and you can scroll through them once the script is done running (with the R and L arrows).
Top right window has two tabs:
Environment shows the variables and values that have been stored in memory by the script. This is helpful when debugging errors. When you’re running lines from a script, it is often true that one line depends on variables created by prior lines, so the original source of errors can be a little tricky to diagnose. A good plan for debugging a script with problems is to: Start at the first line of code with nothing in memory (use the little broom icon labelled ‘Clear’ in the top right window) and nothing in the console (put the cursor in the console window [bottom left] and press CTRL-L), execute single lines or small blocks of code in order from the start of the script, and check the console and environment tab to see if you have error messages and if the variables have values that make sense.
History shows the lines of code that have been run. I don’t use this tab much.
The html files that explain R scripts in this class will always have the same formatting for three things.
#This is a block of R code. It has the same color coding that you'd see in R Studio's script editor.
x <- c(0,1,2,3) #this assigns some numeric values to a variable named x.
mean(x) #this calculates the mean of the values in x, and will cause the mean to be displayed in console output.
## [1] 1.5
You have to download and install a package before you can load it so that the functions within the package are available. On you computer, now install a few packages that you’ll be using later for the course.
Load the ggplot2 package, which provides functions for pretty graphics, and then use the qplot function within that package to make a graph of two variables named length and height.
#load the ggplot2 package
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.6.2
#create some variables and assign values
length <- c(2,2,2,3,3,3,4,4,4,5,5,5,6,6,6)
height <- c(1,2,3,1,3,4,2,2,5,4,5,6,4,6,7)
species <- c('cat','cat','cat','cat','dog','cat','dog','cat','dog','dog','dog','cat','dog','dog','dog')
#make a graph with the qplot function within the ggplot2 package
qplot(length,height, colour = species)
Often in R people use shortcuts that make writing the script faster, when there are equivalent methods that make the logic more clear. For example, the code below produces an identical graph. By default [as in the example above] the first variable name put into the qplot function is the abscissa (x) and the second variable name specified is the ordinate (y), but you can also choose to spell this out [as in the example below]. Remember that you can copy and paste code into the script editor and run it from there with CTRL-R, and that you can put the cursor on a function in the script editor and press F1 to get context-specific help that explains the arguments (the terms within the function’s parentheses) of the function. This is a good way to learn the conventions (like the first variable being assumed to be x, in the case of qplot) and the default values that will be used if you don’t specify otherwise.
qplot(x=length,y=height,colour = species)
There is often more than one way to accomplish the same thing in R. Instead of loading the ggplot2 package and using its qplot() function, I could have just used the plot() function that is already provided in the base R package.
plot(length,height)
R is sensitive to capitalization, so the package name MuMIn is not the same thing as mumin or MUMIN. This is also true for variable names: var1 and VAR1 are two different variables as far as R is concerned.
var1 <- 2
VAR1 <- 2000
var1
## [1] 2
VAR1
## [1] 2000
In the above code block, the <- assigns a value to a variable, and then typing the name of the variable causes its value to be displayed on the console. Alternatively, you can use = instead of <-.
var2 = 50
var2
## [1] 50
I’d recommend using <- to assign values, and thinking of it as assign and not equals. This will prevent any possible confusion with the R code for “is equal to” which is “==” This will be important later.
R ignores spaces between the items within a line, so it is not a problem to have extra spaces between items. You cannot have spaces within an item. For example, Note a space between the < and the - will not be recognized as an assignment statement.
var2 <- 3
var2
## [1] 3
R also allows you to type a single line of code on multiple lines without creating a problem. If you run the code below (cut and paste it into the editor, then use CTRL-A & CTRL-R), you’ll see that both assignment statements work fine. If you look at the console after it assigns the values to vector2, you’ll notice that the R console displays a > at the beginning of each new line, but displays a + instead of > when it is continuing a line of code rather than starting a new one. When you are debugging code that won’t run, it’s useful to check whether the console is displaying a plus sign, which indicates that it encountered a problem within the last line of code before the error and couldn’t finish executing that command.
vector1 <- c(1,2,3,4,5)
vector2 <-
c(6,7,8,9,10)
vector1
## [1] 1 2 3 4 5
vector2
## [1] 6 7 8 9 10
R cannot deal with spaces within a variable name: it will treat the two parts as separate entities. A common convention is to use periods (dots) as a spacer within a variable name.
var.named.joe <- "Hi I'm Joe"
var.named.joe
## [1] "Hi I'm Joe"
The above command successfully assigns the character string “Hi I’m Joe” to the variable var.named.joe.
#var named joe <- "oops"
would give an error message because of the spaces within the variable name. (Here, I have this line ‘commented out’ so that this script will run without errors.) Note that spaces within the character string stored in the variable var.named.joe are treated like any other character, unlike the spaces within the variable name itself. We’ll discuss the differences between text strings and numerical values below.
Variable names: * cannot have spaces within them. * cannot start with a number
* cannot contain a $ (because $ is used to separate the name of a data frame and a variable within that data frame… more on this next session.) * cannot contain any symbols used for mathematical operations in R.
Putting a # at the start of a line turns it into a comment. Anything following the # will be displayed but will not be executed as code. This provides a way to annotate code or to disable a line of code without deleting it, which can be useful when debugging.
# this is a comment, explaining that the next line is a functional assignment statement that uses the function
# seq() to assign a sequence of values from 0 to 1 by units 0.1 to a variable named vector1.
vector1 <- seq(0,1,0.1)
#this is a comment noting that the next line of code is commented out and therefore doesn't run
#vector2 <- seq(0,1,0.1)
vector1
## [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
vector2
## [1] 6 7 8 9 10
You may have noticed that this code block re-used the variable names vector1 and vector2. vector1 used to hold the values 1,2,3,4,5, but now it holds the values 0, 0.1, 0.2 … 0.9, 1. vector2 still holds its original values (6,7,8,9,10) because the new assignment is commented out. When writing longer scripts, it’s important to remember that R stores only the last assignment of a variable. The Environment tab in the top right window is useful when you have any confusion about what values are currently stored in a variable.
In general, avoid reusing variable names within script unless doing so for some intentional reason.
Also, do not have multiple scripts open at once unless it is for a good reason; if they have variable names in common it can create unanticipated problems. If you have two or more scripts open (perhaps to copy an example code block), recall that you can use the clear button (broom icon) in the Environment tab of the top right window to clear all memory out and work with a clean slate.
Quotes are used to specify text, or character strings as text variables are called in R. It does not matter if they are single or double quotes.
# this stores the value 12 in a numeric variable
var.num <- 12
summary(var.num)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12 12 12 12 12 12
# this stores the character string "12", essentially as a word rather than a number. You can't do math on it.
var.char = "12"
summary(var.char)
## Length Class Mode
## 1 character character