Tutorial 2 Answers

Problem Set 2: Questions and Answers

  1. Use R programming commands to fix the one problematic observation in the student2 dataframe and make it merge properly (by first and last name) with student1.

Answer:

# pull in libraries
library(tidyverse)
── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6     ✔ purrr   0.3.4
✔ tibble  3.1.7     ✔ dplyr   1.0.9
✔ tidyr   1.2.0     ✔ stringr 1.4.0
✔ readr   2.1.2     ✔ forcats 0.5.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
# load the first student data
student1 <- read.csv("H:/pppa_data_viz/2019/tutorial_data/lecture02/2019-01-27_fake_student_data.csv")

# how many students?
student1
  First_name last_name GWID gpa degree remaining_sem
1      elpis    josefa  G37 2.9    mpa             1
2    longina   nedelya  G12 3.9    mpp             4
3   richelle    bjoern  G08 3.4    mpp             1
4    mozghan      mara  G62 3.5    mpa             2
5       runa   marcelo  G14 3.8    mpp             3
dim(student1)
[1] 5 6
# load the second student data
student2 <- read.csv("H:/pppa_data_viz/2019/tutorial_data/lecture02/2019-01-27_fake_student_data_part2.csv")

# how many students?
dim(student2)
[1] 5 4
# what variables?
names(student2)
[1] "First_name" "last_name"  "GWID"       "age"       
# merge by first name / last name
student3 <- merge(x = student1,
                  y = student2,
                  by = c("last_name","First_name"),
                  all= TRUE)

student3
  last_name First_name GWID.x gpa degree remaining_sem GWID.y age
1    bjoern   richelle    G08 3.4    mpp             1    G08  40
2    josefa      elpis    G37 2.9    mpa             1    G37  22
3      mara    mozghan    G62 3.5    mpa             2    G62  25
4   marcelo       runa    G14 3.8    mpp             3   <NA>  NA
5  marcelo        runa   <NA>  NA   <NA>            NA    G14  24
6   nedelya    longina    G12 3.9    mpp             4    G12  25
# so fix the non-matcher
student2$last_name <- ifelse(test = student2$last_name == "marcelo ",
                              yes = "marcelo",
                              no = student2$last_name)

# check
student2
  First_name last_name GWID age
1      elpis    josefa  G37  22
2    longina   nedelya  G12  25
3   richelle    bjoern  G08  40
4    mozghan      mara  G62  25
5       runa   marcelo  G14  24
# re-merge
student4 <- merge(x = student1,
                  y = student2,
                  by = c("last_name","First_name"),
                  all = TRUE )
student4
  last_name First_name GWID.x gpa degree remaining_sem GWID.y age
1    bjoern   richelle    G08 3.4    mpp             1    G08  40
2    josefa      elpis    G37 2.9    mpa             1    G37  22
3      mara    mozghan    G62 3.5    mpa             2    G62  25
4   marcelo       runa    G14 3.8    mpp             3    G14  24
5   nedelya    longina    G12 3.9    mpp             4    G12  25
  1. Find a dataset with county information and merge it either with the data we used last class, or with a larger county-level dataset for all counties in the US, 1910 to 2010. This second dataset is here, and the variable definitions are here.

It is sufficient for your new dataset to have just one year of information, or information for just one state or one year.

The final dataset could be at the county level, or at some other unit of observation. (For example, if you found a dataset about power plants, which had one observation per power plant with a county ID, you would end up with a power plant-level dataset.)

Make sure you explain the following:

  • what your new data describe
  • the source of your additional data
  • how many observations the new data have
  • how many observations you expect from the merge and why; if this seems off, fix it and explain what you did

Answer

# load library
library(tidyverse)

# load the big county level data
counties <- read.csv("H:/pppa_data_viz/2018/tutorials/lecture01/counties_1910to2010_20180115.csv")

# keep just 2010
counties.2010 <- filter(.data = counties, year == 2010)

# make a marker for these data
counties.2010$in.cnty <- 1

# load -other- county level data
# i grab some IRS data from here
# https://www.irs.gov/statistics/soi-tax-stats-county-data-2016
# this command reads an excel file
library(readxl)
tax.stuff <- read_excel("H:/pppa_data_viz/2023/tutorials/answer_programs/tutorial_02/20230207_simplified_irs2016.xlsx")
tax.stuff$in.tax <- 1
head(tax.stuff)
# A tibble: 6 × 7
  state_fips state_abbrev county_fips state_name    num_returns returns_w_prepa…
       <dbl> <chr>              <dbl> <chr>               <dbl>            <dbl>
1          1 AL                     0 Alabama           2043540          1158020
2          1 AL                     1 Autauga Coun…       24090            11130
3          1 AL                     3 Baldwin Coun…       95900            48250
4          1 AL                     5 Barbour Coun…        9210             6270
5          1 AL                     7 Bibb County          7940             4590
6          1 AL                     9 Blount County       22110            14110
# … with 1 more variable: in.tax <dbl>
# get rid of rows that are state totals -- countyfips == 0
tax.stuff <- filter(.data = tax.stuff, county_fips != "0")

# merge original county data and new county data
mm <- merge(x = counties.2010,
            y = tax.stuff,
            by.x = c("statefips","countyfips"),
            by.y = c("state_fips","county_fips"),
            all = TRUE)
dim(counties.2010)
[1] 3143   69
dim(tax.stuff)
[1] 3140    7
dim(mm)
[1] 3145   74
# re-code indicators as needed
mm$in.tax <- ifelse(is.na(mm$in.tax) == TRUE,
                    yes = 0,
                    no = mm$in.tax)
mm$in.cnty <- ifelse(is.na(mm$in.cnty) == TRUE,
                    yes = 0,
                    no = mm$in.cnty)

# print problematic obs
in.tax.no.census <- filter(.data = mm, in.tax == 1 & in.cnty != 1)
print(in.tax.no.census[,c("statefips","countyfips","in.tax","in.cnty")])
  statefips countyfips in.tax in.cnty
1         2        158      1       0
2        46        102      1       0

Answering the specific questions

  • what your new data describe

A variety of information about the number of tax returns, the number of filers, etc., etc. in each county.

  • the source of your additional data

The IRS! See here.

  • how many observations the new data have

3191 – see log above

  • how many observations you expect from the merge and why; if this seems off, fix it and explain what you did

I expected 3143 and I got 3145. I’m calling that good enough. However, when I first did these answers, my tax dataframe had 3191 obs, and so there were many leftover observations. It turned out that the tax data included state-level totals. You can see in the code above that I filter these out. After I filter them out, the merge improves substantially.

  1. Make three summary statistics of your choice for these new merged data. What these output are depends on what you’re merging in.

Answer: Below I find the share of returns that have a paid preparer. With this in hand, I find the average of this value by state, as well as the minimum and maximum value by county within each state. Code below.

# first, I calculate the share of returns signed by preparers
mm$preparer.share <- mm$returns_w_preparer / mm$num_returns

# then I find, by state, the average, min and max across counties
mm <- group_by(.data = mm, statefips)
mm2 <- summarize(.data = mm,
                 mean.preparer.rate = mean(preparer.share, na.rm = TRUE),
                 min.preparer.rate = min(preparer.share),
                 max.preparer.rate = max(preparer.share))
mm2
# A tibble: 51 × 4
   statefips mean.preparer.rate min.preparer.rate max.preparer.rate
       <dbl>              <dbl>             <dbl>             <dbl>
 1         1              0.614             0.445             0.763
 2         2              0.431            NA                NA    
 3         4              0.511             0.395             0.578
 4         5              0.630             0.479             0.753
 5         6              0.581             0.496             0.701
 6         8              0.534             0.401             0.735
 7         9              0.519             0.475             0.607
 8        10              0.461             0.421             0.515
 9        11            NaN                NA                NA    
10        12              0.485             0.369             0.704
# … with 41 more rows