Use R programming commands to fix the one problematic observation in the student2 dataframe and make it merge properly (by first and last name) with student1.
# load the first student datastudent1 <-read.csv("H:/pppa_data_viz/2019/tutorial_data/lecture02/2019-01-27_fake_student_data.csv")# how many students?student1
# load the second student datastudent2 <-read.csv("H:/pppa_data_viz/2019/tutorial_data/lecture02/2019-01-27_fake_student_data_part2.csv")# how many students?dim(student2)
[1] 5 4
# what variables?names(student2)
[1] "First_name" "last_name" "GWID" "age"
# merge by first name / last namestudent3 <-merge(x = student1,y = student2,by =c("last_name","First_name"),all=TRUE)student3
last_name First_name GWID.x gpa degree remaining_sem GWID.y age
1 bjoern richelle G08 3.4 mpp 1 G08 40
2 josefa elpis G37 2.9 mpa 1 G37 22
3 mara mozghan G62 3.5 mpa 2 G62 25
4 marcelo runa G14 3.8 mpp 3 <NA> NA
5 marcelo runa <NA> NA <NA> NA G14 24
6 nedelya longina G12 3.9 mpp 4 G12 25
# so fix the non-matcherstudent2$last_name <-ifelse(test = student2$last_name =="marcelo ",yes ="marcelo",no = student2$last_name)# checkstudent2
First_name last_name GWID age
1 elpis josefa G37 22
2 longina nedelya G12 25
3 richelle bjoern G08 40
4 mozghan mara G62 25
5 runa marcelo G14 24
Find a dataset with county information and merge it either with the data we used last class, or with a larger county-level dataset for all counties in the US, 1910 to 2010. This second dataset is here, and the variable definitions are here.
It is sufficient for your new dataset to have just one year of information, or information for just one state or one year.
The final dataset could be at the county level, or at some other unit of observation. (For example, if you found a dataset about power plants, which had one observation per power plant with a county ID, you would end up with a power plant-level dataset.)
Make sure you explain the following:
what your new data describe
the source of your additional data
how many observations the new data have
how many observations you expect from the merge and why; if this seems off, fix it and explain what you did
Answer
# load librarylibrary(tidyverse)# load the big county level datacounties <-read.csv("H:/pppa_data_viz/2018/tutorials/lecture01/counties_1910to2010_20180115.csv")# keep just 2010counties.2010<-filter(.data = counties, year ==2010)# make a marker for these datacounties.2010$in.cnty <-1# load -other- county level data# i grab some IRS data from here# https://www.irs.gov/statistics/soi-tax-stats-county-data-2016# this command reads an excel filelibrary(readxl)tax.stuff <-read_excel("H:/pppa_data_viz/2023/tutorials/answer_programs/tutorial_02/20230207_simplified_irs2016.xlsx")tax.stuff$in.tax <-1head(tax.stuff)
# A tibble: 6 × 7
state_fips state_abbrev county_fips state_name num_returns returns_w_prepa…
<dbl> <chr> <dbl> <chr> <dbl> <dbl>
1 1 AL 0 Alabama 2043540 1158020
2 1 AL 1 Autauga Coun… 24090 11130
3 1 AL 3 Baldwin Coun… 95900 48250
4 1 AL 5 Barbour Coun… 9210 6270
5 1 AL 7 Bibb County 7940 4590
6 1 AL 9 Blount County 22110 14110
# … with 1 more variable: in.tax <dbl>
# get rid of rows that are state totals -- countyfips == 0tax.stuff <-filter(.data = tax.stuff, county_fips !="0")# merge original county data and new county datamm <-merge(x = counties.2010,y = tax.stuff,by.x =c("statefips","countyfips"),by.y =c("state_fips","county_fips"),all =TRUE)dim(counties.2010)
how many observations you expect from the merge and why; if this seems off, fix it and explain what you did
I expected 3143 and I got 3145. I’m calling that good enough. However, when I first did these answers, my tax dataframe had 3191 obs, and so there were many leftover observations. It turned out that the tax data included state-level totals. You can see in the code above that I filter these out. After I filter them out, the merge improves substantially.
Make three summary statistics of your choice for these new merged data. What these output are depends on what you’re merging in.
Answer: Below I find the share of returns that have a paid preparer. With this in hand, I find the average of this value by state, as well as the minimum and maximum value by county within each state. Code below.
# first, I calculate the share of returns signed by preparersmm$preparer.share <- mm$returns_w_preparer / mm$num_returns# then I find, by state, the average, min and max across countiesmm <-group_by(.data = mm, statefips)mm2 <-summarize(.data = mm,mean.preparer.rate =mean(preparer.share, na.rm =TRUE),min.preparer.rate =min(preparer.share),max.preparer.rate =max(preparer.share))mm2
# A tibble: 51 × 4
statefips mean.preparer.rate min.preparer.rate max.preparer.rate
<dbl> <dbl> <dbl> <dbl>
1 1 0.614 0.445 0.763
2 2 0.431 NA NA
3 4 0.511 0.395 0.578
4 5 0.630 0.479 0.753
5 6 0.581 0.496 0.701
6 8 0.534 0.401 0.735
7 9 0.519 0.475 0.607
8 10 0.461 0.421 0.515
9 11 NaN NA NA
10 12 0.485 0.369 0.704
# … with 41 more rows