The US Census Bureau is a definitive source of comprehensive data on human population growth and the factors that affect it. The USCB has estimated near-term future trends (2015 - 2060) in national population growth, including birth rates, death rates, immigration, emigration, ‘natural increase’ (which is growth due to births and deaths in place) and population growth (accounting for immigration and emigration). These estimates are tabulated in a 2012 report. This lab will use these data to:
1.Explore projected patterns of human population growth and carrying capacity that were discussed in class, and
2.Use R for somw basic data processing and graphical analysis.
Download and save the data file US_pop_projections.txt
First clean the memory of all objects and read in the data. You will need to revise the setwd() line to set the working directory to location that you saved the data. In the read.table() function, the sep = "" argument specifies that the variables are delimited by tabs rather than by commas, in this input file. The argument header = TRUE specifies that the first row contains variable names.
rm(list = ls())
setwd("C:/Users/g23b661/Desktop/BIOE 440 2016/BIOE_440_R_Markdown/2.Manipulating_and_using_data")
us.pop.projection <- read.table("US_pop_projections.txt", sep = "", header = TRUE)
Always inspect the data to be sure it was read in correctly and that you know what the file holds.
head(us.pop.projection)
## year US_population annual_change percent_change natural_increase births
## 1 2015 321363 2471 0.77 1677 4290
## 2 2016 323849 2486 0.77 1669 4312
## 3 2017 326348 2499 0.77 1659 4333
## 4 2018 328857 2510 0.77 1647 4351
## 5 2019 331375 2517 0.77 1631 4367
## 6 2020 333896 2521 0.76 1612 4380
## deaths net_immigration
## 1 2613 794
## 2 2643 817
## 3 2673 840
## 4 2704 863
## 5 2736 886
## 6 2768 909
tail(us.pop.projection)
## year US_population annual_change percent_change natural_increase births
## 41 2055 409873 2037 0.5 825 4879
## 42 2056 411923 2051 0.5 838 4889
## 43 2057 413989 2065 0.5 852 4899
## 44 2058 416068 2079 0.5 865 4909
## 45 2059 418161 2093 0.5 879 4920
## 46 2060 420268 2106 0.5 891 4930
## deaths net_immigration
## 41 4054 1212
## 42 4051 1213
## 43 4048 1214
## 44 4044 1214
## 45 4041 1215
## 46 4039 1215
summary(us.pop.projection)
## year US_population annual_change percent_change
## Min. :2015 Min. :321363 Min. :1967 Min. :0.5000
## 1st Qu.:2026 1st Qu.:349476 1st Qu.:2007 1st Qu.:0.5000
## Median :2038 Median :374917 Median :2100 Median :0.5550
## Mean :2038 Mean :373172 Mean :2204 Mean :0.6017
## 3rd Qu.:2049 3rd Qu.:397324 3rd Qu.:2454 3rd Qu.:0.7075
## Max. :2060 Max. :420268 Max. :2521 Max. :0.7700
## natural_increase births deaths net_immigration
## Min. : 770.0 Min. :4290 Min. :2613 Min. : 794
## 1st Qu.: 810.2 1st Qu.:4417 1st Qu.:3016 1st Qu.:1053
## Median : 916.0 Median :4556 Median :3640 Median :1165
## Mean :1093.2 Mean :4599 Mean :3506 Mean :1110
## 3rd Qu.:1400.8 3rd Qu.:4800 3rd Qu.:4026 3rd Qu.:1201
## Max. :1677.0 Max. :4930 Max. :4055 Max. :1215
Looking at the data, you probably noticed that all of the values for population size, numbers of births and deaths, etc. are too small (The US had about 320,000,000 of the world’s 7.2 billion people when I wrote this). This is because the US Census Bureau provides these raw data in units of 1,000 people. (The logic is that when numbers become too large, we lose our numerical intuition.) Convert by multiplying by 1000, so that the units are individuals.
year <- us.pop.projection$year
pop <- us.pop.projection$US_population *1000
annual.change <- us.pop.projection$annual_change * 1000
natural.increase <- us.pop.projection$natural_increase * 1000
births <- us.pop.projection$births * 1000
deaths <- us.pop.projection$deaths * 1000
net.immigration <- us.pop.projection$net_immigration * 1000
Now, re-assemble the newly created variables into a dataframe. This can be done with the cbind() function, which binds variables together into a single object with one column per variable. (The function rbind() would bind the variables as rows. There are several join() functions that can handle more complicated tasks.)
pop.data.new <- cbind(year, pop, annual.change, natural.increase,births,deaths, net.immigration)
View the new object to be sure it’s OK.
head(pop.data.new, 5)
## year pop annual.change natural.increase births deaths
## [1,] 2015 321363000 2471000 1677000 4290000 2613000
## [2,] 2016 323849000 2486000 1669000 4312000 2643000
## [3,] 2017 326348000 2499000 1659000 4333000 2673000
## [4,] 2018 328857000 2510000 1647000 4351000 2704000
## [5,] 2019 331375000 2517000 1631000 4367000 2736000
## net.immigration
## [1,] 794000
## [2,] 817000
## [3,] 840000
## [4,] 863000
## [5,] 886000
There is more than one way to accomplish this task (or any task) in R. You can also use the data.frame() function to assemble the variables into a singe data frame, organized by columns.
pop.data.new.2 <- data.frame(year, pop,annual.change,natural.increase,births,deaths, net.immigration)
View the new dataframe.
head(pop.data.new.2, 5)
## year pop annual.change natural.increase births deaths net.immigration
## 1 2015 321363000 2471000 1677000 4290000 2613000 794000
## 2 2016 323849000 2486000 1669000 4312000 2643000 817000
## 3 2017 326348000 2499000 1659000 4333000 2673000 840000
## 4 2018 328857000 2510000 1647000 4351000 2704000 863000
## 5 2019 331375000 2517000 1631000 4367000 2736000 886000
Look at the top right panel in R Studio and note a subtle but important difference between pop.data.new and pop.data.new.2. The first is a matrix, and the second is a data frame. A matrix in R is just like a matrix in linear algebra (and a vector in R is just an N X 1 matrix). A matrix in R can contain data of only one type (numeric, character or logical (true/false), but usually numeric).
A data frame can hold variables of more than one type. In general, it is best to have your data in a data frame, though there are some functions in R that require the data to be in a matrix. Often, the difference does not matter, but you should be aware of it. There is also a function as.data.frame() that can be used to convert a matrix to a data frame, like so:
converted <- as.data.frame(pop.data.new)
The function as.data.frame() will only work if the variables in the objecte being converted are all numeric, or can be coerced to be numeric.
The US Census Bureau uses age-structured population models (Leslie Matrix models, which you will learn in detail later in the course) to make projections of future population trends with defined assumptions. FOr this dataset the projections go out to 2060. The further into the future a projection is made, the less certainty we have about the predictions. Suppose we want to restrict our predictions to only 25 years beyond the present. To do this, we can subset the data. Two common ways to subset data are:
Using row and column index values for a vector or matrix
Using the subset() function
Here’s an example of the index method - selecting the first 25 years of the variables for year and population size.
So the code below selects the first 25 rows of the first column of pop.data.new and assigns it to a variable years, and the first 25 rows of the second column of pop.data.new and assigns it to a variable psizes.
years <- pop.data.new[1: 25, 1]
psizes <- pop.data.new[1:25, 2]
#view the new variables
years
## [1] 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029
## [16] 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039
psizes
## [1] 321363000 323849000 326348000 328857000 331375000 333896000 336416000
## [8] 338930000 341436000 343929000 346407000 348867000 351304000 353718000
## [15] 356107000 358471000 360792000 363070000 365307000 367503000 369662000
## [22] 371788000 373883000 375950000 377993000
Here’s an example of the subset() method. subset() works well on dataframes but will also work on a matrix. For a dataframe, the first argument is the dataframe to be subsetted, the second argument identifies what subset of the original data to retain, and the the third argument specifies which variables to retain.
short.term <- subset(pop.data.new.2, year < 2040, select = c(year, pop))
short.term
## year pop
## 1 2015 321363000
## 2 2016 323849000
## 3 2017 326348000
## 4 2018 328857000
## 5 2019 331375000
## 6 2020 333896000
## 7 2021 336416000
## 8 2022 338930000
## 9 2023 341436000
## 10 2024 343929000
## 11 2025 346407000
## 12 2026 348867000
## 13 2027 351304000
## 14 2028 353718000
## 15 2029 356107000
## 16 2030 358471000
## 17 2031 360792000
## 18 2032 363070000
## 19 2033 365307000
## 20 2034 367503000
## 21 2035 369662000
## 22 2036 371788000
## 23 2037 373883000
## 24 2038 375950000
## 25 2039 377993000
USING THE subset() FUNCTION IS USUALLY LESS PRONE TO SIMPLE BUT FRUSTRATING ERRORS The index method is trickier and more error-prone. (For example, it is very easy to be one row off by forgetting that the range 1000 - 1009 has ten entries, not nine, particularly when you’re doing the indexing as part of some more complex task.) The index method is very flexible, because R allows logical operations in subscripts. That is, you can use subscripts to select subsets of the data that meet a certain criterion, rather than just identify subsets of the data by row and column position. This usually makes it easier to do what you intended.
The code below identifies the maximum value in column 2 of pop.data.new (this column holds population sizes) and assigns it to variable x.
x <- max(pop.data.new.2[,2])
x
## [1] 420268000
Statements like this can be used to perform a wide range of data manipulation or selection. For example, the code below sums column 7 (net immigration) across all rows (years) and assigns it to variable y. It then does the same for just the first 10 years, and assigns the sum to variable z. z/y is the proportion of immigrants for the wholw period that arrive in the first 10 years.
y <- sum(pop.data.new[, 7])
y
## [1] 51083000
z <- sum(pop.data.new[pop.data.new[,1]<2025, 7]) #this will take a little inspection to understand! See below.
z
## [1] 8975000
z/y
## [1] 0.1756945
In the assignment statement for variable z, we used a logical subscript, The sum statement is applied to all rows of column 7 (net immigration) because there is no entry for the row index before the comman. But it is applied to them only when the value in column 1 (year)) is less than 2025, usng the second argument of the sum() function.
Remember that in R Studio you can put the cursor in any function and press F1 to get context-specific help about that function.
R has the following logical operators, which can be used in subscripts or in other ways (notably in if() statements, which you will see later).
R logical operators:
> for "greater than"
>= for "greater than or equal to"
< for "less than"
<= for "less than or equal to"
== for "equal to" (this one is a common source of confusion. Not the same as =, which is equivalent to <-)
!= for "not equal to"
& for "and"
| for "or"
You can always gain intuition about a data set by plotting it. Plotting it in multiple wayas will increase your understanding. The base R installation has a plot function, though you will mainly be using the ggplot2 package for this.
#attach the original dataset to save on typing.
#recall that you can use F1 to learn about the attach() function
attach(pop.data.new.2)
## The following objects are masked _by_ .GlobalEnv:
##
## annual.change, births, deaths, natural.increase, net.immigration,
## pop, year
plot(x= year, y = pop, xlab = 'Year', ylab = 'Population Size',main = 'US Census Projections', col = 'blue', pch = 19)
The projected pattern of growth is not too easily distinguished from linear, over a period of 45 years, but you’ll explore this more carefully in the HW below
Read RFDS 1, 2, 3.1-3.6 to get started with using the ggplot2 package to make plots.
Make a series of plots to better understand projected US human population growth. Some of these will require you to calculate a new variable before plotting it, using the simple data manipulation methods from this exercise
Write a one paragraph summary of what these plots reveal about the pattern of growth predicted for the US for this period, and why the growth rate is expected to change. Birth rates? Death rates? Both?