# ---------------------------------------------------------------
# this is my program to say "hello world"
#
# hiworld.r
#
# january 12, 2025
# ---------------------------------------------------------------
Tutorial 1: Welcome to R
Welcome! This tutorial assumes that you’ve already done everything in Tutorial 0, “Get R Ready.” If this is not the case, do that first.
Today we are working on basic data management commands and familiarity with the R interface. Sadly, no graphs today. Today’s tutorial shows you how much more powerful statistical software is than Excel, and prepares you to make graphics next class.
A. Open RStudio
RStudio is an interface for using R.
Open up RStudio.
What you need to know about what you see
- Console : This is the window where you input R commands. You can input them one by one, or you can write a R program, which is what we will do.
- Terminal : This window is a terminal window. On a Mac, it should be a unix interface; on a PC it is a DOS interface. I don’t anticipate using this window, but you could use it to find the location of files and execute system commands.
- Environment : Ignore for now.
- History : This reports a history of your R commands. Hopefully you won’t need to refer to this window since you won’t be programming interactively.
- Connections : Ignore for now.
- Files : Ignore for now.
- Plots : Your plots will appear here. But nothing this class!
- Packages : This lists packages you have installed (they sit on your hard drive). This is probably empty, but will fill up as the class goes on.
- Help : Help for commands. Alternatively, you can type
help("command")
at the console prompt. However, I usually just google. - Viewer : This is for seeing your data as if it were a Excel table. Using this for help when you first get started is fine. Using it as your only method of understanding the structure of your data is unwise. As your data get large, this tool works more and more poorly – you can’t see all your data, and what you see becomes less representative.
B. Hello world program
We now write a very first program. This is the same program people learn in almost any language: make the computer print the words “hello world” to the screen.
B.1. Why a program?
There are two ways to tell R how to do things. One is to program “interactively.” That means write things at the \(>\) in the console and press return.
The second way to tell R how to do things is to write a program. A program is a text file that has all the command you want R to execute in logical order. You should always write R code (or any code) in a program. Once you’ve written the program, or part of the program, you run the program, and R does all the steps you outline in the program.
There are at least two key reasons to write a program rather than to code interactively. First, if you don’t write down all the commands you’ve executed, you will lose track of them and not be sure what you did. Then, when you go to fix problems, it will be impossible for you to figure out the steps that you took to create the problem or data. Second, writing a program makes the logical order of steps very clear, and allows you to replicate the work you have already done.
Writing code – what you’ll learn to do in this class – is a major advantage over using Excel or similar programs. All your steps to your final output are clear for you or others to follow. When you program interactively, you lose this advantage.
B.2. Write a R program
Now let’s write a R program. Open RStudio, and choose File \(\rightarrow\) New File \(\rightarrow\) RScript.
You should see a new window open up. Your first job is to write two things you should have at the top of every program.
The first of these is information about what this program is, what it does and who created it. The lines are called “comments,” since they are not directions to R. Instead they are notes for the programmer. In this program, it may seems silly to write this, since your program will start by doing so little. However, it is good practice, it will pay off when you many programs or many coders, and is a very good habit to get into.
To write these comments, use a “#” sign in front of each line. This tells R that everything after the “#” is a comment for you and not code to evaluate.
My comments look like this
After your comments, write a strange line of code you should put at the beginning of every program you write. This line of code gets rid of all data in R’s memory. It makes sure that you always start with a clean slate and do not use any data not directly created by this program. Right now it does nothing, since there are no pre-existing data. However, it’s good practice to always have it, so we begin this way.
###### A. remove prior content
# code to remove all objects so you start with a clean slate
rm(list = ls())
Now write the R command
###### B. say hello to world
print("Hello World!")
Save this file (all the code lines above) with a name you’ll remember in a location you’ll remember. Mine is saved as H:/pppa_data_viz/2018/assignments/lecture01_helloworld.R
(If you type the file name, R will automatically add the .R extension to the name). This extension, .R
, means it is an R program. Without the proper extension, the program will not work.
B.3. Run a R program
Now we want to run this program.
There are two ways to run this file (also known as a script)
While in the editor window, go to the Code menu \(\rightarrow\) Run Region \(\rightarrow\) Run All (or click the “run” bottom at the top of the window).
At the prompt in the console window, type the full name of your file and use the option “echo=TRUE” so that R will show all the individual commands that you’re using.
source('H:/pppa_data_viz/2018/assignments/lecture01_helloworld.R', echo=TRUE)
Either way that you run the program, you should see something like the below in the Console window:
[1] "Hello World!"
That’s success! You’ve written a program (lecture01_helloworld.R
) and you’ve run it. Now you’re ready to work on a more complicated program with real data.
C. Loading Data
In this next step, we learn to bring data from an outside source, such as an Excel sheet or a comma separate values file, into R so that we can use programming to analyze the data. In short, we take external data and use it to create a R dataframe.
A dataframe is the basic data structure we’ll use most of the time in this course. Think of it like an excel table. A dataframe is a rectangle with rows and columns. We will refer to portions of the dataframe by row and column values.
We’ll do this task in a new R program. Close your prior program (File \(\rightarrow\) Close). Start a new R program (choose File \(\rightarrow\) New File \(\rightarrow\) RScript) that you’ll use for this class and for your homework. For the homework, you must turn in annotated code (that’s a R Script with comments) that includes the class commands and the homework.
As we did for the hello world program, remember to start your .R
file with
- comments that say what the file is, who created it and when
- code that erases all previously created objects (these are pre-existing data, for our purposes). As a reminder, this code is
#------------------------------------------------------------
#
# this is code for tutorial 1
#
# january 12, 2025
#
# tutorial01_v01.R
#
#------------------------------------------------------------
# code to remove all objects so you start with a clean slate
rm(list = ls())
Use one .R file for the rest of this tutorial (and make one for each subsequent tutorial). I encourage you to do a few steps, run the program, and see how it goes. If there are errors, fix them before moving forward. Once you have no errors and understand the output, then add new commands to the .R file.
Re-run the whole file each time you add commands. Because most of the files we use (particularly at the beginning of the course) are small, re-running takes very little time and ensures that all your steps work in the order you wrote them.
Remember, do not program interactively! It is very hard to get help on your code if you program interactively.
We’ll begin by loading a dataset where each observation is a county in the Washington metropolitan area (or in Virginia, independent city) in a year, 1910 to 2010. See the Washington metropolitan area here. These data come from the Decennial Census, and you can find source info in this paper of mine in the data appendix. The data have one row per jurisdiction and year.
In R, we usually read .csv (comma separated values) files, and I have prepared files in this format for today.
Download the data for this class from here. Save this file in a location that you will remember, and for which you know the path (the path is an ordered list of folders). For example, I save the file into the path h:/pppa_data_viz/2019/tutorial_data/
. You need to know the full path of the directory where you saved the data, and the name under which you saved it.
We begin by loading the file called was_msas_1910_2010_20190107.csv
. If you’re curious what a .csv file looks like, open it in Excel.
Here’s a simple example of a .csv file:
var1, var2, var3"a", "b", 2
"d", "e", 2
"g", "h", 5
The first row in this file has the variable names: var1
, var2
, and var3
. Each subsequent row in this example is an observation, which means one unit – perhaps a person, or a state. Each row reports the realization of each of the three variables (var1
, var2
, and var3
) for that observation.
For example, if the observations were states, variables could be population, housing units and median income. Each row would be a state, and population, housing units and median income would be the columns.
In a comma separated values dataset, all variables (var1
, var2
, and var3
) are separated by commas.
To load the file, we use the read.csv
command. (There is another command we are not currently using called read_csv
. It behaves differently and you should be careful to note which command tutorials use.) The input to this command is the location of the file, and you create (with the <- command) a new dataframe which I am calling was.counties
, but you can call whatever you’d like.
Make a comment that you are loading .csv data, and bring it in. I write
# load csv data
<- read.csv("h:/pppa_data_viz/2019/tutorial_data/was_msas_1910_2010_20190107.csv") was.counties
Note that R uses a forward slash to denote directories, even though Windows usually uses a backslash (Macs usually use a backslash).
This command creates a dataframe (R’s version of a dataset) that contains the input csv file. This dataframe sits in R’s memeory as long as you have the RStudio session one. When you close the session the dataframe disappears unless you save it.
The read.csv
code is a specific example of general R syntax. The assignment function <-
assigns the thing on the right (here, the csv file) to the thing on the object on the left (here, the dataframe).
D. What’s in these data?
The rest of this class teaches you techniques for viewing at and summarizing data. You might think, “why can’t I just do this in Excel?” Some of what you’re learning today may seem easier in Excel… until one of three things happen.
You get a large dataset that you can’t use easily (or at all) in Excel. I don’t teach techniques that rely on using RStudio’s internal data viewer, since these techniques don’t work for large data. All that said, we begin by practicing with a small dataset so you can see all the data and better understand what’s going on.
You need replicable steps. When you write the steps in your code, you know exactly what you did (sadly, not always why, unless you do a good job with comments). When you know the steps you take, and you find an error, you can easily alter the steps. When you have only the output of the steps, it is substantially more difficult to make alterations. All of this is very hard to do in Excel.
You want to do the same thing for many groups. For example, if I’d like to know the average population by decade, you can take ten averages in Excel, or write two lines of code in R.
Now that you’ve loaded these data, let’s start by looking at the two key elements of what defines a dataframe. How many rows and columns does this dataframe have? And what variables (columns) does it have?
To answer how “big” these data are, that is how many rows and columns, we use R’s dim()
command. The dim()
command reports the dimension of a dataframe:
# how big is it?
print("this is how many rows x columns the dataset has")
[1] "this is how many rows x columns the dataset has"
dim(was.counties)
[1] 246 5
The output in the console window reports the number of rows, comma the number of variables (or columns). This dataframe has 246
rows and 5
columns (or variables).
Rows and columns are the building blocks of dataframes. Starting by understanding whether the number of rows and columns you have is reasonable is always a good place to begin. There are about 20 jurisdictions in the Washington area, and the data set has 11 years. This would mean about 20*11 rows, or 220 rows. Therefore, 246
rows seems roughly reasonable.
Next, let’s explore which variables are in this dataframe. We query the variable names with the names()
command:
# what variables does it have?
print("these are the names of the variables")
[1] "these are the names of the variables"
names(was.counties)
[1] "statefips" "countyfips" "cv1" "year" "cv28"
We learn that this dataframe has five variables. States and counties are identified by FIPS (federal information processing) codes, which you can find at this website (and many others). State FIPS codes are always two digits, and county codes are always three digits.
Each observation in the was.counties
dataframe has a state FIPS and a county FIPS code. Beware that county FIPS codes are not unique across states. In other words, two states may have a county 003. To uniquely identify a county, you need both the state and county FIPS codes.
The remaining undefined variables here are cv1
, which is population, and cv28
, which is the number of housing units.
Apart from the name of the variable, it is also helpful to know whether a variable is numeric (numbers only), or a string (there are many kinds of strings in R, and we’ll hold on discussing this till a future class). You can do mathematical operations with numeric variables, but not with strings.
Use str()
(for structure) to find the types of variables in this dataframe:
# what kinds of variables are these?
print("these are the types of variables")
[1] "these are the types of variables"
str(was.counties)
'data.frame': 246 obs. of 5 variables:
$ statefips : int 11 24 24 24 24 24 51 51 51 51 ...
$ countyfips: int 1 9 17 21 31 33 13 43 47 59 ...
$ cv1 : int 331069 10325 16386 52673 32089 36147 10231 7468 13472 20536 ...
$ year : int 1910 1910 1910 1910 1910 1910 1910 1910 1910 1910 ...
$ cv28 : int NA NA NA NA NA NA NA NA NA NA ...
This output reports that this is a dataframe with 246 obervations and 5 variables. R then lists the variables. For each variable, R reports the variable type (string, int, numeric) and prints the values of this variable for the first ten or so observations.
In this particular dataframe, all variables are of type “int” for integer. This means they are all numeric with no decimals. There is no data for housing units (cv28
) for the observations listed. The value “NA” is R’s way of stating a missing value.
To get an overall sense of the magnitude of these variables, use summary()
, which reports summary statistics on each variable:
# look at values
print("these are the values they take on")
[1] "these are the values they take on"
summary(was.counties)
statefips countyfips cv1 year
Min. :11.00 Min. : 1.0 Min. : 5199 Min. :1910
1st Qu.:24.00 1st Qu.: 31.0 1st Qu.: 13350 1st Qu.:1940
Median :51.00 Median : 61.0 Median : 24218 Median :1960
Mean :43.31 Mean :178.3 Mean : 119984 Mean :1962
3rd Qu.:51.00 3rd Qu.:179.0 3rd Qu.: 93974 3rd Qu.:1990
Max. :54.00 Max. :685.0 Max. :1081726 Max. :2010
NA's :6
cv28
Min. : 1739
1st Qu.: 5832
Median : 13438
Mean : 58393
3rd Qu.: 57275
Max. :407998
NA's :86
Here we learn that the statefips
variable seems to take on only limited values (11,24,51,43), which makes sense: there are four states (counting DC) in the greater Washington metropolitan area. The population variable (cv1
) has six observations with no value (NA), and the housing variable (cv28
) has 86 NAs.
While an average is a reasonable way to look at population, it makes less sense for a categorical variable like statefips
(“categorical” means the variable takes on a number of discrete categories; there is no state 11.5, for example).
To look at the distribution of categorical variables, it is helpful to make a frequency table, which is easy in R using table()
, combined with a reference to a specific variable. To refer to one specific variable, use the syntax dataframe$varname
.
Combining these two concepts, we get
# look at non-numeric variables
print("for non-numeric variables")
[1] "for non-numeric variables"
table(was.counties$statefips)
11 24 51 54
11 55 169 11
We see that there are four unique values for statefips
. State 11 (DC) has 11 observations (one for each decade). State 24 (Maryland) has 55 observations, or 5 for each year. State 51 (Virginia) has 169 observations; it has a very complicated institutional set-up with many small jurisdictions. State 54 is West Virginia, and has one county that appears 11 times.
Note that we could also have done summary(was.counties$cv1)
to just get descriptive statistics for population only.
E. Dataframe Structure
To work with dataframes, you must know how to refer to rows and columns. We discuss how to do this in this section. Generally, you can refer to the rows and columns in a dataframe using dataframe.name[rows,columns]
. This convention of rows -comma- columns is standard in R.
You can use this format to print rows to the screen. Here I print the first five:
# print some rows to the screen
print("first five rows")
[1] "first five rows"
1:5,] was.counties[
statefips countyfips cv1 year cv28
1 11 1 331069 1910 NA
2 24 9 10325 1910 NA
3 24 17 16386 1910 NA
4 24 21 52673 1910 NA
5 24 31 32089 1910 NA
This shows only the first five rows. You could print rows 20 to 30 by replacing 1:5
with 20:30
:
# print some rows to the screen
print("first five rows")
[1] "first five rows"
20:30,] was.counties[
statefips countyfips cv1 year cv28
20 54 37 15889 1910 NA
21 11 1 437571 1920 NA
22 24 9 9744 1920 NA
23 24 17 17705 1920 NA
24 24 21 52541 1920 NA
25 24 31 34921 1920 NA
26 24 33 43347 1920 NA
27 51 13 16040 1920 NA
28 51 43 7165 1920 NA
29 51 47 13292 1920 NA
30 51 59 21943 1920 NA
You can also print just some columns:
# print some columns to the screen
print("first ten rows and two columns")
[1] "first ten rows and two columns"
1:10,1:2] was.counties[
statefips countyfips
1 11 1
2 24 9
3 24 17
4 24 21
5 24 31
6 24 33
7 51 13
8 51 43
9 51 47
10 51 59
This prints rows 1 to 10 and columns 1 and 2. To print all rows, with columns 1 and 2, you would write was.counties[,1:2]
(but this prints 200 rows and takes up too much space for this tutorial!).
Above, I used the column order to pick out columns. This is a terrible idea and you should never again do it. This is because column numbering is opaque and because the column ordering can and does change. Instead of column numbers, you should use the column name directly, as I do below. Below I tell R to take the columns named statefips
, countyfips
, and year
. I use the notation c()
to list a vector (list of things) that R should take.
# print columns by name
print("first five rows and three columns")
[1] "first five rows and three columns"
1:5,c("statefips","countyfips","year")] was.counties[
statefips countyfips year
1 11 1 1910
2 24 9 1910
3 24 17 1910
4 24 21 1910
5 24 31 1910
This is particularly helpful as your data gets bigger and you want to check your work.
F. Subsetting
It is also frequently useful to work with a smaller dataset. (Perhaps not with this current dataset, but the principle we’ll learn here will be useful in the future.) There are many ways to do this in R. Here we review two methods. The first is “Base R,” and the second uses a package (more on this below).
F.1. Base R
Your R program arrives with a number of built-in commands. We first review how to subset with these “Base R” commands. Base R commands can be very useful because they always work, unlike commands from packages, which may not work in all circumstances. In addition, when we go to write functions, around Tutorial 8, Base R subsetting is much easier to put in a function.
The Base R subsetting relies on the same logic of limiting rows and columns as we did before. To make a dataframe without 1910, we tell R to take all rows where the variable year is not equal (!=
) to 1910. We check the new dataframe (was.counties.no1910
) using dim()
.
Note that we are creating a new dataframe called was.counties.no1910
. This new dataframe does not replace the previous was.counties
– R now holds both of them in memory and you can refer to them both. The ability to have multiple dataframes is an additional strength of R relative to Excel or Stata.
The code below
- writes a comment to explain what we are doing
- makes a subset of the Washington area counties data without 1910
- checks the size of the data without 1910 – it should be smaller than the data with 1910!
# make a dataframe without 1910
print("make a dataset w/o 1910")
[1] "make a dataset w/o 1910"
<- was.counties[was.counties$year != 1910,]
was.counties.no1910 dim(was.counties.no1910)
[1] 226 5
Recall that the original had 246 rows. Does this seem reasonable?
We can subset based on any variable. Below we omit Washington, DC from the dataframe, again using the not equals command:
# make a dataframe without washington dc
print("make a dataframe w/o washington dc")
[1] "make a dataframe w/o washington dc"
<- was.counties[was.counties$statefips != "11",]
was.counties.no.dc dim(was.counties.no.dc)
[1] 235 5
Subsetting is not just limited to rows. We can also subset to just some of the original columns. For example, I can tell R to not include any column where the column name is cv28
: !(names(was.counties) %in% c("cv28"))
. The !
command means “not.” The other part – names(was.counties) %in% c("cv28")
– means “where the name of the column in was.counties
is cv28.” The %in%
command means “any item in the following list.” So we could potentially expand the list to include more variables by doing, for example, %in% c("cv1","cv28")
. We use c
to let R know that this is a set of values.
# make a dataframe without housing (cv28)
print("make a dataframe w/o housing variable")
[1] "make a dataframe w/o housing variable"
<- was.counties[, !(names(was.counties) %in% c("cv28"))]
was.counties.no.cv28 dim(was.counties.no.cv28)
[1] 246 4
Note that the number of rows is still 246, but the number of columns is now 4, rather than 5.
F.2. Using filter
and select
from tidyverse
package
Alternatively, you can achieve the same outcome using commands from a user-written package. A package means a set of commands written by someone to plug into R. The first time you use a new package you need to install it.
Installing packages is the one time I will tell you to do something by writing the command directly into the console and pressing return. Do this now:
install.packages("tidyverse")
There are a lot of commands in this package, and it may take a while to load.
This is so important that I repeat: Installing packages is the one time that you should not write a command in a program. If you put this command in your program, you would install it everything you ran your program. This would be unnecessary and very slow. Bottom line: Do this once, and never put an install.packages()
command in your program.
Then, having installed the package, you need to let R know that you want to use the package. When I load packages, I always put the library()
code at the top of my program, after my introduction comments and the rm(list = ls())
code. Do this in your program and run the code. The output after the library()
code is below.
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.5
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.5.1 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.1
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
This output tells us that tidyverse
is really a collection of packages, including ggplot2
, purrr
, tibble
, dplyr
, and four others. The filter
and select
command are from the dplyr
package.
The “conflicts” portion of the output warns you that there are commands in the packages you just loaded that have the same name as command in the stats
package. For now, we don’t worry about this.
If you want to use a package in a program, you should put this library command at the top of your R script to make R load the package immediately when you start running the program. We will use two commands from the tidyverse
set of packages (for now – it contains many other useful commands): filter
and select
.
We start with the filter
command, which is a way of creating a subset of a dataframe based on characteristics of observations. The filter
command takes two inputs. The first is the data frame and the second is the subsetting condition: filter(.data = dataframe, *condition*)
. You can have more than one condition, linked by “and” (&
) or “or” (|
).
As before, we can keep only counties (rows) where the year is 1910. Note that when we use these tidyverse
commands, we name the dataframe at the beginning, and then don’t have to call variables by their full name with an $.
# keeps only rows where the condition evaluates to TRUE
print("use dplyr to make a dataset that is just 1910")
[1] "use dplyr to make a dataset that is just 1910"
1910.d <- filter(.data = was.counties, year == 1910)
was.counties.dim(was.counties.1910.d)
[1] 20 5
Comfortingly, these new data are smaller than the old data.
But let’s check to be sure that the new dataframe has only 1910. It is good programming practice to, as President Reagan said, “trust but verify.” Mistakes are, sadly, almost always your fault.
# check if the filter command does what we wanted
print("before filtering")
[1] "before filtering"
table(was.counties$year)
1910 1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
20 20 20 20 24 24 22 24 24 24 24
print("after filtering")
[1] "after filtering"
table(was.counties.1910.d$year)
1910
20
This worked! The output above from table(was.counties.1910.d$year)
tells us that the value 1910
appears 20 times in was.counties.1910.d
. In addition, it tells us that the variable year
in dataframe was.counties.1910.d
takes on only the value 1910, as we wished.
We can also keep all states that are not DC, again using !
as not. This command says “return all observations where the state ID number is not 11.”
# keep everything but washington DC
print("filter: keep not DC")
[1] "filter: keep not DC"
<- filter(.data = was.counties, statefips != "11")
was.counties.no.dc.d dim(was.counties.no.dc.d)
[1] 235 5
However, unlike with base R, filter
does not also drop dataframe columns. Instead, we use the select
command to choose columns. The select
command takes the inputs select(.data = dataframe, *columns*)
, where dataframe is the input dataframe and columns
are the columns to keep (no minus sign in front) or drop (put a minus sign in front, as below).
The select command below says “make a new dataframe called was.counties.no.cv28.d
that has all columns from was.counties
except cv28
.”
# make a dataframe w/o housing variable
# try select
print("can't use filter to drop columns. use select")
[1] "can't use filter to drop columns. use select"
<- select(.data = was.counties, -c("cv28"))
was.counties.no.cv28.d names(was.counties.no.cv28.d)
[1] "statefips" "countyfips" "cv1" "year"
head(was.counties.no.cv28.d)
statefips countyfips cv1 year
1 11 1 331069 1910
2 24 9 10325 1910
3 24 17 16386 1910
4 24 21 52673 1910
5 24 31 32089 1910
6 24 33 36147 1910
We check the output with head()
and names()
, both of which report that the newly created dataframe does not have the cv28
variable.
G. Create new variables and dataframes
In this step, we create a new variable and make new dataframes of summary statistics.
G.1. Create a new variable
Creating a new variable in R is reasonably straightforward. Use the <-
to denote the output, and use the dataframe$var
notation to describe variables.
For example, suppose we’d like to know the average number of people per housing unit in each jurisdiction. To do this, we need to divide population (cv1
) by housing (cv28
). We do this as below:
# people per housing unit
print("make new variable with people per housing unit in each county")
[1] "make new variable with people per housing unit in each county"
$ppl.per.hu <- was.counties$cv1 / was.counties$cv28
was.countiessummary(was.counties$ppl.per.hu)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1.909 2.566 2.762 2.852 3.272 3.975 86
We created the number of people per housing units, and then we use summary()
to check the value. Does this value seem reasonable? I anticipate that it should be between 1 and 5. Is it?
To program successfully, you must build in checks like these. Things can and do go wrong. Without checks you don’t know when they do.
G.2. Summarizing data to screen
The most familiar summary statistic is the mean. R can easily calculate the mean of a variable with mean()
. If there are any missing values in a variable and you don’t tell R to omit these missing value observations, R will report a missing value for the mean. That is why the first mean command below reports NA
, while the second gives a value. The na.rm = TRUE
means “yes, omit missing values in calculations.”
# brute force
print("find a mean")
[1] "find a mean"
mean(was.counties$cv1)
[1] NA
mean(was.counties$cv1, na.rm = TRUE)
[1] 119983.7
Does this value seem reasonable for the mean jurisdiction population over the entire period?
It is also sometimes helpful to put the mean into its own variable (maybe you’ll want to put it in a title or some such). You can do this as we do below, creating a newobject
.
# putting value into its own object
print("put mean value into a object")
[1] "put mean value into a object"
<- mean(was.counties$cv1, na.rm = TRUE)
newobject newobject
[1] 119983.7
This newobject
is a 1x1 dataframe.
There are many different statistical functions in R – so many that you’re better off googling “statistical functions in R” to find them. Most of them work just like the mean()
function that we just explored.
G.3. Summarizing variables to a dataframe
We now move on to creating dataframes of summary values. Suppose we’d like to know the mean population across all jurisdictions in each year. We can do this using the commands group_by
and summarize
, which are part of tidyverse
.
As your program should already have library(tidyverse)
at the top, you don’t need to do anything else to load this package now.
This section of the tutorial is the beginning of understanding what statistical software (like R) can do that Excel cannot. We are going to take data at one level – the county-year level – and create a dataframe at a different level – the year level.
To do this, we first need to understand what “groups” we want in the final output. For a year-level dataset, we want to make groups (from the county-year data) by year.
Tell R what your groups are using the group_by
command. The group_by
function takes inputs group_by(.data = dataframe, [variables to group by, separated by commas])
.
Only after you’ve told R how your data are grouped, you then ask R to calculate summary statistics at the level of that group. R calculates grouped summary statistics with the tidyverse
command summarize
. Note that the dataframe we used in the summarize
command is the grouped dataframe (was.counties.grp.yr
) that we created in the previous line.
This is the kind of “summarizing” of data that I am asking you do in the policy brief. In the example below, we go from data at the county-year level to data at the year level.
In the command below we create a new dataframe, was.by.year
with the variable cv1.yr
, which is the mean population for all counties in each year. Note that the group_by
command doesn’t change the look of your data – it just changes its functionality in subsequent commands.
# summarize by year
print("find average by year")
[1] "find average by year"
<- group_by(.data = was.counties, year)
was.counties.grp.yr <- summarize(.data = was.counties.grp.yr, cv1.yr=mean(cv1))
was.by.year was.by.year
# A tibble: 11 × 2
year cv1.yr
<int> <dbl>
1 1910 32892.
2 1920 39282.
3 1930 44222.
4 1940 59891.
5 1950 NA
6 1960 NA
7 1970 143834.
8 1980 142777
9 1990 173222.
10 2000 201560.
11 2010 234843
This is a good point to note that all commands in this package can be used with American spelling ("summarize") or British spelling ("summarise"). I will use the American spelling throughout the course, but you may find the British spelling when you look for help online.
Looking at this new dataframe, we have one observation per year (correct!), with a population mean that is increasing over time (seems reasonable). Unfortunately, we have some missing values. We can correct by using the same na.rm=TRUE
as above:
# summarize by year w/o missings
print("find average by year w/o missing values")
[1] "find average by year w/o missing values"
<- summarize(.data = was.counties.grp.yr, cv1.yr=mean(cv1, na.rm = TRUE))
was.by.year was.by.year
# A tibble: 11 × 2
year cv1.yr
<int> <dbl>
1 1910 32892.
2 1920 39282.
3 1930 44222.
4 1940 59891.
5 1950 81966.
6 1960 110812.
7 1970 143834.
8 1980 142777
9 1990 173222.
10 2000 201560.
11 2010 234843
An improvement!
This set-up is flexible. We can calculate not just the mean population, but also the total population by adding an additional function (sum()
).
# summarize two variables by year
print("find two things by year")
[1] "find two things by year"
<- summarize(.data = was.counties.grp.yr,cv1.yr=mean(cv1, na.rm = TRUE),
was.by.year cv.yr.total = sum(cv1, na.rm = TRUE))
was.by.year
# A tibble: 11 × 3
year cv1.yr cv.yr.total
<int> <dbl> <int>
1 1910 32892. 657845
2 1920 39282. 785643
3 1930 44222. 884441
4 1940 59891. 1197826
5 1950 81966. 1721291
6 1960 110812. 2327056
7 1970 143834. 3164346
8 1980 142777 3426648
9 1990 173222. 4157327
10 2000 201560. 4837428
11 2010 234843 5636232
Additionally, we can add a second variable to `group_by`. Instead of summaries by year, we can report data by state and year. Notice how we first define a new "grouped" dataframe.
# summarize by state and year
print("find info by state and year")
[1] "find info by state and year"
<- group_by(.data = was.counties,year,statefips)
was.counties.grp.st <- summarize(.data = was.counties.grp.st, cv.st.total = sum(cv1, na.rm = TRUE)) was.by.state.yr
`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
was.by.state.yr
# A tibble: 44 × 3
# Groups: year [11]
year statefips cv.st.total
<int> <int> <int>
1 1910 11 331069
2 1910 24 147620
3 1910 51 163267
4 1910 54 15889
5 1920 11 437571
6 1920 24 158258
7 1920 51 174085
8 1920 54 15729
9 1930 11 486869
10 1930 24 189435
# ℹ 34 more rows
Now, rather than 11 observations, we have 44. Does that make sense?
H. Using ChatGPT and the like for help
You are very welcome to seek help from AI with coding for this class. I have found AI very helpful in finding stupid coding errors I make. I have less experience asking it to write code directly.
If you do use AI to write code, it remains your responsibility to understand what it is the code is doing. If you get answers you don’t understand, you can ask the AI and you can also write in to Piazza for more human answers.
I. Problem Set 1
Now you are ready to work on your own.
I.1. What to turn in
For this and all subsequent problem sets you should turn in one pdf that includes
- Written output that directly answers the questions (not the code that finds the numbers). You can create this in a word doc, or any other output of your choice.
- R script for the tutorial and the questions below (the .R program file)
- R output (from console window; ok to paste into a separate file)
Next tutorial, I’ll show how to make good-looking output using quarto. For this first tutorial, just copy the output from the console window.
I.2. Where to turn in
All your work in this class will go into the Box folder that I link to on Blackboard. Inside this folder, you should create a new folder called lastname_firstname
, inviting me (and only me) to see the folder.
Make a subfolder called “tutorials” and turn all your tutorial work into this folder. Name your assignments [last name]_PS1.pdf
I.3. Questions
You are welcome and encouraged to work with others on the homework. However, each of you must turn in your own homework, in your own words. All duplicate versions of a homework receive a grade of zero.
Your final submission needs R code that gets to the answers for the questions below. Please use comments to label the code for each question.
Why do we use
table(was.counties$statefips)
andsummary(was.counties$cv1)
and notsummary(was.counties$statefips)
andtable(was.counties$cv1)
?Why does the first summary in part G.3. yield 11 observations, but the second 44?
Find and report the average population in Washington, DC (the city only) for the entire period 1910-2010.
Find the average population of Washington area counties by state for all years (so that the final output has one observation per state). Put a table with this information in your final output. Describe the results in a sentence or two.
For each of the four states in the Washington area, assess whether that state has more or fewer jurisdictions in 2010 that in 1910. (Hint:
sum(!is.na(variable.name))
tells you the total number of non-missing observations. Observations with missing population do not exist in the year the data are missing.)What is the most populous jurisdiction in the DC area in 2010? (Hint: there is a
max()
function that works similarly tomean()
.)
You may find it helpful to refer to this cheat sheet for this and future classes.