Today’s tutorial focuses on line charts. Line charts, as you should know from today’s lecture, are for showing change over time.
We also discuss how to add text and line segments onto a plot using
the annotate()
command, and we discuss ggplot
theme options. You learn how to make data long from wide and
vice-versa.
We review summarizing and the use of factor variables. We also do give an example of data cleaning to prepare a file to load.
Let’s begin by loading packages. The only addition from what we’ve
used before is scales
, which helps put commas into numbers
so they are legible, among other things.
library(tidyverse)
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──
## ✔ ggplot2 3.3.6 ✔ purrr 0.3.4
## ✔ tibble 3.1.7 ✔ dplyr 1.0.9
## ✔ tidyr 1.2.0 ✔ stringr 1.4.0
## ✔ readr 2.1.2 ✔ forcats 0.5.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
library(scales) # for making numbers with commas
##
## Attaching package: 'scales'
## The following object is masked from 'package:purrr':
##
## discard
## The following object is masked from 'package:readr':
##
## col_factor
##
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
##
## group_rows
We are going to begin with a “simple” line graph. Below I give the example of a “simple” line graph that took a quite a bit of work to look so simple.
Here are the sequential slides that I used in my presentation of this graph. I want the audience to know when DC’s population declined, when it rebounded and to have sense of the magnitude of the decline.
Now we’ll go through some code to build up to the final chart.
Begin by downloading data from here. These are county-level data on population 1910 to 2010 (among other variables). I created these data for a research project from Decennial Census data.
Use read.csv
to grab these data as we’ve done
before.
# load data
counties <- read.csv("h:/pppa_data_viz/2019/tutorial_data/lecture08/counties_1910to2010_20180116.csv")
Now just limit the data to DC. You could do this in the
ggplot
call itself. However, in this case when we are only
planning to use DC, this gives us a smaller dataset to work with and
that speeds processing. This will also make the coding easier, since we
won’t have to subset in each graph.
Take a look at the data after we subset to DC. Does it have the right number of observations?
# get just dc
dct <- counties[which(counties$statefips == 11),]
dim(dct)
## [1] 11 68
dct[,c("year","statefips","countyfips","cv1")]
## year statefips countyfips cv1
## 285 1910 11 1 331069
## 3244 1920 11 1 437571
## 6314 1930 11 1 486869
## 9418 1940 11 1 663091
## 12520 1950 11 1 802178
## 15626 1960 11 1 763956
## 18764 1970 11 1 756510
## 21899 1980 11 1 638333
## 25039 1990 11 1 606900
## 28182 2000 11 1 572059
## 31326 2010 11 1 601723
We have only one state and one county in that state. We observe data from 1910 to 2010. This all looks good.
Now that you know many ggplot
commands, it will not be a
shock to hear that you make a line graph using geom_line()
.
As for all ggplot
graphs, you should specify a dataframe
and x and y variables. Below we make the simplest possible line
graph.
b3 <- ggplot() +
geom_line(data = dct,
mapping = aes(x = year, y = cv1))
b3
Note that line graphs do not default to a y-axis baseline of zero.
The line graph above is great for getting a sense of the data. It’s not so good for communicating. The x-axis labels don’t line up with the years in the data. The vertical axis labels are hard to read. And the grey background does not help tell the story.
We fix the horizontal axis to put 20-year labels with
scale_x_continuous(limits= c(1910, 2010), breaks = c(seq(1910,2010,20)))
This tells R to start in 1910, and stop in 2010
(limits= c(1910, 2010)
). It also tells R to make breaks on
the axis at 1910 and every 20 years until 2010
(breaks = c(seq(1910,2010,20))
).
We fix the vertical axis with
scale_y_continuous(labels = comma, limits = c(0, 825000), breaks = c(seq(0,800000,200000)))
This tells R to use commas in the numbers, to start at 0 and end at
825,000, and to make value labels every 200,000. The option
commas
comes from the scales
package we loaded
at the beginning.
Generally, ggplot
line graphs are easier to read when
lines are thicker than the default width. We adjust the line width with
the geom_line()
option of size = 1.5
. Note
that this goes outside of the aes()
command. Things inside
the aes()
describe how “variables in the data are mapped to
visual properties (aesthetics) of geoms” (see cite).
Things outside of the aes()
command are for more general
settings.
In addition, we modify the theme to do the following:
panel.grid.major = element_blank()
panel.grid.minor = element_blank()
panel.grid = element_blank()
)panel.background = element_blank()
panel.grid.major.y = element_line(color="gray")
legend.position = "none"
axis.line.x = element_line(color = "black")
axis.ticks = element_blank()
axis.text = element_text(size = 10)
done <-
ggplot() +
geom_line(data = dct,
mapping = aes(x=year, y=cv1), size=1.5) +
scale_y_continuous(labels = comma, limits = c(0, 825000), breaks = c(seq(0,800000,200000))) +
scale_x_continuous(limits= c(1910, 2010), breaks = c(seq(1910,2010,20))) +
labs(x="", y="") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.grid.major.y = element_line(color="gray"),
legend.position = "none",
axis.line.x = element_line(color = "black"),
axis.ticks = element_blank(),
axis.text = element_text(size = 10))
done
Beware that things can look different in the plot window and the final graphics output. To judge, save it in the proportions you’d like for a final product and then look at in with an image viewer. Here’s code to save this one.
fn <- "H:/pppa_data_viz/2020/tutorials/tutorial_08/b4_testerv1.jpg"
ggsave(plot = done,
file = fn,
dpi = 300,
units = c("in"),
width = 7,
height = 3.5)
Then I pull in the final image:
Different! When doing final edits, work with the properly-sized graph.
I’d call the above plot functional, but the point of this graph is to point out specific historical moments that explain the shape of the plot. To do that we add text and lines to the plot.
We add data points on the line via geom_point()
. This
hints to readers that the data are only actually at the points. The line
between the points is really just made up – or extrapolated if you’d
like to be fancier.
To add “stuff” to your graph that is not data, use the
annotate
command. Until recently, I was under the incorrect
impression that annotate(geom = "text"))
and
geom_text()
did the same thing. I recently learned that
annotate()
is much more efficient, as it draws just once.
In contrast, geom_text()
will draw as many times as you
have data points – again and again in the same place.
The annotate()
command has some basic options. The first
is geom
, which is what you want to show. Choices include,
but are not limited to, “segment”, “rect”, or “text”. You specific the
location with x
and y
for text, or, for
rectangles and the like, xmin
, xmax
,
ymin
, and ymax
(alternatively, x
and xend
, etc).
You can also adjust other options such as size (size=
)
or justification (hjust
and vjust
; see here).
See how to implement these in the example below. We use both
hjust = 0
(left align) and hjust = 1
(right
align).
Note that I set the on-graph-text size variable at the beginning
(on.g.text.size
). I use this for the size of the text that
goes on the annotate command. That way if I don’t like it, I change it
once, rather than eight times.
In the command below, I use two types of annotate: text and segment.
If you want to shade an entire area, which can be helpful in a line
chart to point out a particular set of years, use rect
.
on.g.text.size <- 3
done2 <-
ggplot(dct) +
geom_line(dct, mapping = aes(x=year, y=cv1), size=1.5) +
geom_point(dct, mapping = aes(x=year, y=cv1), size=3) +
scale_y_continuous(labels = comma, limits = c(0, 825000), breaks = c(seq(0,800000,200000))) +
scale_x_continuous(limits= c(1910, 2010), breaks = c(seq(1910,2010,20))) +
labs(x="", y="") +
theme(panel.grid.major = element_blank(),
panel.grid.minor = element_blank(),
panel.background = element_blank(),
panel.grid.major.y = element_line(color="gray"),
legend.position = "none",
axis.line.x = element_line(color = "black"),
axis.ticks.x = element_blank(),
axis.ticks.y = element_blank(),
axis.text = element_text(size = 10)) +
annotate(geom = "segment", x=1995, y=0, xend=1995, yend=220000, color="#045a8d") +
annotate(geom = "segment", x=1995, y=360000, xend=1995, yend=825000, color="#045a8d") +
annotate(geom = "segment", x=1968, y=0, xend=1968, yend=450000, color = "#2b8cbe") +
annotate(geom = "segment", x=1968, y=550000, xend=1968, yend=825000, color = "#2b8cbe") +
annotate(geom = "segment", x=1954, y=0, xend=1954, yend=460000, color = "#74a9cf") +
annotate(geom = "segment", x=1954, y=610000, xend=1954, yend=825000, color = "#74a9cf") +
annotate(geom = "text", x=1955, y=575000, label="1954:", color = "#74a9cf",
size=on.g.text.size, hjust=1) +
annotate(geom = "text", x=1955, y=535000, label="School", color = "#74a9cf",
size=on.g.text.size, hjust=1) +
annotate(geom = "text", x=1955, y=495000, label="Desegregation", color = "#74a9cf",
size=on.g.text.size, hjust=1) +
annotate(geom = "text", x=1967, y=525000, label="1968:", color = "#2b8cbe",
size=on.g.text.size, hjust=0) +
annotate(geom = "text", x=1967, y=475000, label="Civil Disturbance", color = "#2b8cbe",
size=on.g.text.size, hjust=0) +
annotate(geom = "text", x=1994, y=325000, label="1995:", color="#045a8d",
size=on.g.text.size, hjust=0) +
annotate(geom = "text", x=1994, y=285000, label="Control Board", color="#045a8d",
size=on.g.text.size, hjust=0) +
annotate(geom = "text", x=1994, y=245000, label="Takes Power", color="#045a8d",
size=on.g.text.size, hjust=0)
## save it
fn2 <- "H:/pppa_data_viz/2020/tutorials/tutorial_08/b4_testerv2.jpg"
ggsave(plot = done2,
file = fn2,
dpi = 300,
units = c("in"),
width = 7,
height = 3.5)
Pulling in the final image:
Now we take a detour from line graphs to discuss themes.
In the previous graph, we substantially changed the look of the graph
by modifying elements in the theme()
portion of the
command. There are many, many different elements you can change in the
ggplot
theme, and you can find the complete list here.
ggplot
’s built-in themesIn addition to modifying individual elements of a theme, you can use
ggplot
’s built-in themes to modify a graph. You can see the
full list here,
and we’ll do examples with theme_minimal()
and
theme_bw()
.
To make sure we understand what the themes are doing, let’s return to our original, somewhat ugly, graph of DC population over time.
b3 <- ggplot() +
geom_line(data = dct,
mapping = aes(x = year, y = cv1))
b3
First, we’ll apply the ggplot
’s minimal theme. Notice
that instead of retyping the graphing command, we just add the theme to
the basic graph with a plus. You can see that many aspects of the
graph’s look are changed.
c1 <- b3 + theme_minimal()
c1
We can try a different built-in theme, theme_bw()
, for
black and white:
c2 <- b3 + theme_bw()
c2
Finally, you can use a built-in theme and then also modify additional elements, as in
c3 <- b3 + theme_minimal() +
theme(axis.line.x = element_line(color = "red"))
c3
I don’t recommend this modifcation, but hopefully what it’s changing is clear.
Users have also created a variety of pre-packaged themes that you can use if you like. This webpage shows a variety of examples. To use these additional themes, you usually need to install a package.
Finally, it can be quite helpful to create your own theme if you want to make a consistent look across many graphs.
Written as below, we create a new theme of our own that is a
modification of R’s default theme. Next week we will get to functions
explicitly, but for now, let’s suffice to say that We declare a function
that has no inputs, but which creates the theme theme_me
.
You can call this function whatever you want – for example,
theme_myself
. You name this function so you can use it
later.
theme_me <- function(){
theme(axis.line.x = element_line(color = "red"),
axis.line.y = element_line(color = "blue"))
}
You could include one of R’s other themes below, too, if you’d prefer, as in
theme_me <- function(){
theme_minimal() +
theme(axis.line.x = element_line(color = "red"),
axis.line.y = element_line(color = "blue"))
}
Then apply your theme to the line graph:
c4 <- b3 + theme_me()
c4
Bottom line: themes modify the look and feel of a graph. You need to alter them to make decent looking graphics. Making a custom theme can help create consistency across multiple graphics.
The section above graphs just DC. In this section, we graph multiple counties and make them distinguishable.
Here we subset the counties
data to just keep DC (state
11), Maryland’s Montgomery and Prince George’s counties (state 24,
counties 31 and 33), and Virginia’s Arlington, Alexandria and Fairfax
jurisdictions (state 51, counties 13, 510 and 59).
We also make the year variable numeric for ease of plotting. And then
to save space, we get rid of the counties dataframe with
rm(counties)
. You can use rm()
for any objects
that you no longer want.
dcm <- counties[which(counties$statefips == 11 |
counties$statefips == 24 & counties$countyfips %in% c(31,33) |
counties$statefips == 51 & counties$countyfips %in% c(13,510,59)),]
dcm$nyear <- as.numeric(dcm$year)
rm(counties)
Finally, to identify a county in ggplot
, we need both
the state and county variables together. I use paste0
to
concatenate the state and county variables. “Concatenate” means stick
together. The paste0
command takes as many strings as you
like and puts them together.
Here is a small example of what paste0
does. First, we
make a small dataframe with two character variables s1
and
`s2’.
ex.df <- data.frame(s1 = c("fred","ted","pj"),
s2 = c("dog","cat","pig"))
ex.df
## s1 s2
## 1 fred dog
## 2 ted cat
## 3 pj pig
Now we make some new variables from these existing variables using
the paste
function. The paste
function sticks
character variables together, and puts “separators” in between the
character variables if you request. The new variables p1
and p2
put s1
and s2
together
using a “XX” and then a ” is a ” separator (see sep = "XX"
or sep = "is a "
). It’s also sometimes helpful to default
to nothing in the separator argument, which is when you should use
paste0
.
ex.df$p1 <- paste(ex.df$s1,ex.df$s2, sep = "XX")
ex.df$p2 <- paste(ex.df$s1,ex.df$s2, sep = " is a ")
ex.df$p3 <- paste0(ex.df$s1,ex.df$s2)
ex.df
## s1 s2 p1 p2 p3
## 1 fred dog fredXXdog fred is a dog freddog
## 2 ted cat tedXXcat ted is a cat tedcat
## 3 pj pig pjXXpig pj is a pig pjpig
And here we use this paste0
command to stick the state
and county identifiers together:
# make a state+county variable
dcm$stc <- paste0(dcm$statefips,dcm$countyfips)
To plot multiple counties at one time, we use the
group()
command and tell R that the groups are by the
stc
variable. Note that the group =
goes
inside the aes()
because it is telling R to do something
based on the data. For legibility, I add commas in the y-axis
values.
# all counties
ac <- ggplot() +
geom_line(data = dcm, aes(x=year, y=cv1, group = stc)) +
scale_y_continuous(labels = comma)
ac
Of course, without a legend or identifying features, this graph is
very hard to interpret. We add color = stc
, again inside
aes()
, so that we can see which counties are which on the
graph.
# color by state
ac <- ggplot() +
geom_line(data = dcm, aes(x=year, y=cv1,
group = stc,
color = stc)) +
scale_y_continuous(labels = comma)
ac
There are still many improvements we could make to this graph. For
basic legibility, we should made the county codes names rather than
numbers. We can do this by making stc
a factor variable and
assigning names to its levels. (We have done this is in a previous
tutorial.) Even better, and depending on the point of the graph, omit
the legend and put the jurisdiction names directly on the graph using
annotate()
.
Now that we have two date variables, we can make our own measure of duration and check the bikeshare’s measure.
# my duration calculation
cabi.201901$my.duration <- cabi.201901$time.stop - cabi.201901$time.start
# comparing my results to built-in results
summary(cabi.201901$my.duration)
## Length Class Mode
## 158130 difftime numeric
Wait – this variable is not returning a normal summary output. We
need to tell R that is is a numeric variable in the
summary()
command.
summary(as.numeric(cabi.201901$my.duration))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.817 9.633 14.939 15.967 1435.000
summary(as.numeric(cabi.201901$Duration))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 60.0 349.0 577.0 895.9 957.0 86100.0
This looks similar, but our calculated duration measure is in minutes
and the bikeshare’s measure is in seconds. I could divide
cabi$Duration
by 60 to see if they are the same. I can also
look at the correlation between the two measures using
cor()
. Here is the correlation method:
# look at the correlation -- looks like 1
cor(x = as.numeric(cabi.201901$my.duration),
y = as.numeric(cabi.201901$Duration),
method = c("pearson"))
## [1] 1
I find that the correlation between the two measures is 1. This makes we suspect that the bikeshare people calculated the duration in the exact way we did.
Now I do the other check: divide our measure by 60.
cabi.201901$Duration.minutes <- cabi.201901$Duration / 60
summary(cabi.201901$Duration.minutes)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 5.817 9.617 14.931 15.950 1435.000
OK – it’s the same.
The trick to successfully plotting these data is to reduce their dimensionality. “Dimensionality” means size, or the number of observations by the number of variables. Our dataframe has about 150,000 observations – way too many to show on any plot.
One way to shrink what we show is to show data by hour, rather than by ride. Therefore, for each hour, we will find the average number of rides and the average duration of rides.
To do this, we first need a variable that tells us the hour of the
trip. We extract the hour component from the date variable using the
date notation. We can use the format()
function because we
already created a date variable called time.start
. We write
format(df$varname, "%H")
to get the hour from the time
variable. (This is the benefit of the time variable!) We then check the
output using both summary()
and table()
.
# get the hour out of the date variable
cabi.201901$start.hour <- as.numeric(format(cabi.201901$time.start, "%H"))
summary(cabi.201901$start.hour)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 14.00 13.63 17.00 23.00
table(cabi.201901$start.hour)
##
## 0 1 2 3 4 5 6 7 8 9 10 11 12
## 949 539 383 162 284 1369 3931 9314 15908 9409 6123 6818 8025
## 13 14 15 16 17 18 19 20 21 22 23
## 8304 8311 9728 12994 18785 13846 8570 5713 4082 2904 1679
From the summary command, we see that mean start hour is about 1:30 (13.6 hours). That seems fine, as does the max start hour of 11 pm (23 hours) and the min start hour of 0 (midnight).
Looking at the table output, the hour with the single largest number of rides is 17 – or 5 pm. This also seems reasonable.
Now that we are reassured the times are ok, we use
group_by()
and summarize()
to find hourly
traveling information. We calculate both the number of rides
(no_rides
) and the average duration of those rides
(mean_dur
).
# summarize to hourly data
cabi.201901 <- group_by(cabi.201901, start.hour)
cabisum <- summarize(.data = cabi.201901, no_rides = n(), mean_dur = mean(Duration))
dim(cabisum)
## [1] 24 3
In our next class, we’ll learn how to automate writing the same type of graph twice. In this class, we’ll just make two line graphs, one for total number of rides by hour of the day, and one for the average duration of rides by hour of the day.
First, number of rides by hour of the day:
# plot total number of rides by hour of the day
e5a <- ggplot() +
geom_line(data = cabisum, mapping = aes(x = start.hour, y = no_rides)) +
labs(title = "Total number of rides by hour of the day")
e5a
And then average duration of ride by hour of the day:
# plot total number of rides by hour of the day
e5b <- ggplot() +
geom_line(data = cabisum, mapping = aes(x = start.hour, y = mean_dur)) +
labs(title = "Average duration of ride by hour of the day")
e5b
The final type of line graph we’re trying today is a stacked line chart, which can sometimes be very helpful to convey change over time along with the relative importance of categories.
Since this is a policy class, it seems fitting to graph at least some budget data. We are introducing a new dataset: US federal budget statistics. You can find the data from the Office of Management and Budget here. Download the zip file from the top of the page and unzip it.
I am not prepping these data for you, since I want to make sure you learn how to put raw data into R. You will find that there are many small issues that cause trouble. This is not atypical, so it is helpful to show how to handle them.
Unzip the file you downloaded, and you’ll see a bunch of files in this new folder. They follow the naming convention on the page from which you downloaded. Open up Tables 1.3 (hist01z3.xls; for homework) and 2.3 (hist02z3.xls; for now) in Excel.
From Table 2.3, we want the year and columns B, C, D, G, H and I.
Create a new excel document with just this information, and make one row
at top with names that you’ll understand. Keep just through 2017, and
make sure that you don’t have any junk at the bottom of the table. Save
this file as csv (file, save as, choose “csv” option for file type). If
there are numeric variables that take the value *
, make
them “.”, which is code for missing.
Load the csv file you just created into R.
### makeup of receipts ####
hist02z3 <- read.csv("H:/pppa_data_viz/2018/tutorials/lecture05/omb_data/hist02z3.csv")
str(hist02z3)
## 'data.frame': 91 obs. of 7 variables:
## $ year : chr "1934" "1935" "1936" "1937" ...
## $ income.taxes: num 0.7 0.7 0.8 1.2 1.4 1.1 0.9 1.1 2.2 3.5 ...
## $ corp.taxes : num 0.6 0.8 0.9 1.2 1.4 1.2 1.2 1.8 3.2 5.2 ...
## $ social.ins : chr "." "." "0.1" "0.7" ...
## $ excise : num 2.2 2 2 2.1 2.1 2.1 2 2.2 2.3 2.2 ...
## $ other : num 1.3 1.5 1.1 0.9 0.9 0.7 0.7 0.7 0.5 0.4 ...
## $ total : num 4.8 5.1 4.9 6.1 7.5 7 6.7 7.5 9.9 13 ...
Begin by making sure that what you’ve imported into R is what you expect. We run through the problems I encountered – your problems may differ! The goal here is to give you enough tools that you know how to look for problems and how to fix them once you find them.
We’ll start with the year
variable, using
tables()
.
# make sure year is always ok
table(hist02z3$year)
##
## 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948
## 6 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 2013 2014 2015 2016 2017 TQ
## 1 1 1 1 1 1
Notice that there are some odd things here. A few observations with no year at all, and one observation where year is “TQ”. Let’s fix this.
We use subset the data to get rid of the strange years.
hist02z3 <- hist02z3[which(hist02z3$year != ""),]
hist02z3 <- hist02z3[which(hist02z3$year != "TQ"),]
Strangely, in previous years hist02z3$year
loaded as a
factor variable. When it did, I used this code to make it numeric –
code I am not using this year:
hist02z3$nyear <- as.numeric(levels(hist02z3$year))[hist02z3$year]
summary(hist02z3$nyear)
This year, hist02z3$year
loaded as a character variable.
So make a numeric year and check the values, I use
as.numeric
:
hist02z3$nyear <- as.numeric(hist02z3$year)
summary(hist02z3$nyear)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1934 1955 1976 1976 1996 2017
Good – the year variable now seems to have just numeric years and the values are non-crazy.
Let’s take a quick look at the data again:
head(hist02z3)
## year income.taxes corp.taxes social.ins excise other total nyear
## 1 1934 0.7 0.6 . 2.2 1.3 4.8 1934
## 2 1935 0.7 0.8 . 2.0 1.5 5.1 1935
## 3 1936 0.8 0.9 0.1 2.0 1.1 4.9 1936
## 4 1937 1.2 1.2 0.7 2.1 0.9 6.1 1937
## 5 1938 1.4 1.4 1.7 2.1 0.9 7.5 1938
## 6 1939 1.1 1.2 1.8 2.1 0.7 7.0 1939
Notice that these data have one observation per year, and that the budget data goes across in columns. This is what is known as a “wide” dataset. If we want to show multiple types of budget variables at one time (and especially if we want to make a stacked area graph, as we do below), we need the data to be long – with one row for each year and budget variable.
To understand how to make wide data long and vice versa, we begin
with a little example. The data.wide
dataframe printed
below is a wide dataframe. The unit of observation here is the year, and
we observe multiple variables for that unit. Specifically, we observe
the number of two types of baked goods: muffins and sugar cookies.
data.wide <- data.frame(year = c(2001,2002,2003),
muffins = c(50,49,36),
sugar_cookies = c(100,200,300))
Here are what these wide data look like:
year | muffins | sugar_cookies |
---|---|---|
2001 | 50 | 100 |
2002 | 49 | 200 |
2003 | 36 | 300 |
What if we want to make these wide data long? Use
pivot_longer()
. Tell R which data you want to use
(data.wide
), which variables you want to make long
(cols = c("muffins","sugar_cookies")
), what you want the
column that names the types to be called
(names_to = "baked_goods_type"
) and what you want the
column that describes the values to be called
(values_to = "num_baked_goods"
).
data.long <- pivot_longer(data = data.wide,
cols = c("muffins","sugar_cookies"),
names_to = "baked_goods_type",
values_to = "num_baked_goods")
Here is what the long data look like:
year | baked_goods_type | num_baked_goods |
---|---|---|
2001 | muffins | 50 |
2001 | sugar_cookies | 100 |
2002 | muffins | 49 |
2002 | sugar_cookies | 200 |
2003 | muffins | 36 |
2003 | sugar_cookies | 300 |
You can then go backwards with pivot_wider()
:
data.wide.again <- pivot_wider(data = data.long,
id_cols = c("year"),
names_from = "baked_goods_type",
values_from = "num_baked_goods")
Here we are, back to wide data:
year | muffins | sugar_cookies |
---|---|---|
2001 | 50 | 100 |
2002 | 49 | 200 |
2003 | 36 | 300 |
Same data, different representations. Usually we prefer long data
when using ggplot
.
Here’s another example:
To make a stacked line (or really, any multiple line), the data need to be long, not wide. Wide data look like this, with one observation per unit:
wide <- data.frame(state = c("6","36","48"),
female_pop = c("10","12","14"),
male_pop = c("11","13","12"))
And the data look like
state | female_pop | male_pop |
---|---|---|
6 | 10 | 11 |
36 | 12 | 13 |
48 | 14 | 12 |
Long data look like this, with one observation per unit and type:
long <- data.frame(state = c("6","36","48","6","36","48"),
pop = c("10","12","14","11","13","12"),
sex = c("female","female","female","male","male","male"))
Which gives data that look like
state | pop | sex |
---|---|---|
6 | 10 | female |
36 | 12 | female |
48 | 14 | female |
6 | 11 | male |
36 | 13 | male |
48 | 12 | male |
Note how this dataset requires a variable that tells you which type of population the row contains.
Neither data format is “right.” If you were doing a regression and wanted to control for male and female population, you’d need the wide format. However, to make a line graph with multiple lines in R, you need a long dataset.
To make the data long, first I tried the code below
## make this wide dataset long
head(hist02z3)
r.long <- pivot_longer(data = hist02z3,
cols = c("income.taxes","corp.taxes","social.ins","excise","other","total"),
names_to = "revenue_type",
values_to = "revenue")
r.long[1:15,]
This code delivers this error message:
This error message is telling us that not all our income variables are the same type. Check this:
str(hist02z3)
## 'data.frame': 84 obs. of 8 variables:
## $ year : chr "1934" "1935" "1936" "1937" ...
## $ income.taxes: num 0.7 0.7 0.8 1.2 1.4 1.1 0.9 1.1 2.2 3.5 ...
## $ corp.taxes : num 0.6 0.8 0.9 1.2 1.4 1.2 1.2 1.8 3.2 5.2 ...
## $ social.ins : chr "." "." "0.1" "0.7" ...
## $ excise : num 2.2 2 2 2.1 2.1 2.1 2 2.2 2.3 2.2 ...
## $ other : num 1.3 1.5 1.1 0.9 0.9 0.7 0.7 0.7 0.5 0.4 ...
## $ total : num 4.8 5.1 4.9 6.1 7.5 7 6.7 7.5 9.9 13 ...
## $ nyear : num 1934 1935 1936 1937 1938 ...
Sadly, it seems that social.ins
is a factor variable –
all other taxes are numeric. So let’s fix this.
We fix social.ins
by doing
hist02z3$social.ins <- as.numeric(as.character(hist02z3$social.ins))
## Warning: NAs introduced by coercion
str(hist02z3$social.ins)
## num [1:84] NA NA 0.1 0.7 1.7 1.8 1.8 1.7 1.7 1.6 ...
The new structure says social.ins
is now a numeric
variable.
Now try the pivot_longer()
command again:
## make this wide dataset long
r.long <- pivot_longer(data = hist02z3,
cols = c("income.taxes","corp.taxes","social.ins","excise","other","total"),
names_to = "revenue_type",
values_to = "revenue")
r.long[1:15,]
## # A tibble: 15 × 4
## year nyear revenue_type revenue
## <chr> <dbl> <chr> <dbl>
## 1 1934 1934 income.taxes 0.7
## 2 1934 1934 corp.taxes 0.6
## 3 1934 1934 social.ins NA
## 4 1934 1934 excise 2.2
## 5 1934 1934 other 1.3
## 6 1934 1934 total 4.8
## 7 1935 1935 income.taxes 0.7
## 8 1935 1935 corp.taxes 0.8
## 9 1935 1935 social.ins NA
## 10 1935 1935 excise 2
## 11 1935 1935 other 1.5
## 12 1935 1935 total 5.1
## 13 1936 1936 income.taxes 0.8
## 14 1936 1936 corp.taxes 0.9
## 15 1936 1936 social.ins 0.1
This looks like what we want. Notice that there are NA
values for social insurance spending in the 1930s. If you go back to
your original download, you can see that this isn’t a mistake. In 1935,
there was no social insurance spending.
Let’s start with total tax revenue over time. As in the previous
section, we need to note group=1
, and recall that total is
r.long$rtype == 6
.
#### line chart of total receipts
g4.1 <-
ggplot() +
geom_line(data = r.long[which(r.long$revenue_type=="total"),],
mapping = aes(x=nyear, y=revenue, group=1))
g4.1
Now we’ll modify the chart to have all the categories but the total.
I do this by subetting r.long
into all record types that
are not total revenue. In addition, I tell R that the group by which we
want to make the graph is a variable called revenue_type
,
which R should treat as a factor. We also tell r to color the lines by
revenue_type
, taken as a factor.
#### line chart of total receipts by type ###
g4.2 <-
ggplot() +
geom_line(data = r.long[which(r.long$revenue_type != "total"),],
mapping = aes(x=nyear, y=revenue,
group=as.factor(revenue_type),
color=as.factor(revenue_type)))
g4.2
## Warning: Removed 2 row(s) containing missing values (geom_path).
This graph is very hard to read. There are too many lines, and we don’t get a sense of the total, which may be a key point. An alternative is a stacked line. Stacked likes highlight the total amount, and give readers some sense of the relative share of different categories.
#### stacked chart of total receipts by type ###
## without factor() this doesnt work
g4.3 <- ggplot() +
geom_area(data = r.long[which(r.long$revenue_type != "total"),],
mapping = aes(x=nyear, y=revenue,
group=factor(revenue_type),
fill=factor(revenue_type)),
position="stack") +
labs(x="", y="share of GDP")
g4.3
## Warning: Removed 2 rows containing missing values (position_stack).
If you do this, it is frequently wise to put labels on the area portions of the graph and omit the legend. If the areas are too small to label, consider whether you need then individually.
These charts have the same downsides of stacked bars: the numbers for only the bottom category are directly legible from the graph.
In my example of DC population over time in section B.1., I present the population graphic in four steps, using four separate graphics. In Section B.5, we re-create the fourth and final graphic. For this problem, modify your code to make the first three graphs I have in section B.1.
Using the bikeshare data,
Now you try to load your own budget data!
Use Table 1.3 (his01z3.xls), from which we want the year and columns E, F, G and columns I, J and K. Create a new excel document with just this information, and make one row at top with names that you’ll understand. Keep just through 2017, and make sure that you don’t have any junk at the bottom of the table. Save this file as csv (file, save as, choose “csv” option for file type).
Load it into R and make a stacked area graph of receipts, outlays and deficits over time.
Having done this myself, here are a few suggestions
hist01z3$b1 <- as.numeric(gsub(",", "", hist01z3$cd.receipts, fixed = TRUE))