Why do we do table(was.counties$statefips) and summary(was.counties$cv1) and not vice-versa?
We use table() for categorical or integer variables, and summary() for continuous variables. While statefips is a “number,” it has no intrinsic meaning – the number is a code that represents a state. As a matter of practice, you could take an average of this numeric categorical variable – it just won’t mean anything!
Why does the first summary in part G.3. yield 11 observations, but the second 44?
'data.frame': 246 obs. of 5 variables:
$ statefips : int 11 24 24 24 24 24 51 51 51 51 ...
$ countyfips: int 1 9 17 21 31 33 13 43 47 59 ...
$ cv1 : int 331069 10325 16386 52673 32089 36147 10231 7468 13472 20536 ...
$ year : int 1910 1910 1910 1910 1910 1910 1910 1910 1910 1910 ...
$ cv28 : int NA NA NA NA NA NA NA NA NA NA ...
Here is the first summary:
# summarize by year w/o missingsprint("find average by year w/o missing values")
The first reports one observation by year, and there are 11 years in the data (1910-2010). The second summarize reports data by year and state, so that there are 11 years x 4 states (DC, MD, VA, WV) observations, or 44.
Find and report the average population in DC for the entire period 1910-2010
# keep dc onlywas.counties.dc <- was.counties[which(was.counties$statefips ==11),]# mean population for all years mean(was.counties.dc$cv1, na.rm =TRUE)
[1] 605478.1
Find state-level (or the part of the state we observe) average population over the entire period. Put a table with this information in your final output. Describe the results in a sentence or two.
The important logic here is the following
first make a dataset at the state-year level
then take the average by state
If you take the average without creating a state-year level dataframe, you are taking the average across counties, not across state observations.
Here is my code to do this.
# find state-level population by yearstr(was.counties)
'data.frame': 246 obs. of 5 variables:
$ statefips : int 11 24 24 24 24 24 51 51 51 51 ...
$ countyfips: int 1 9 17 21 31 33 13 43 47 59 ...
$ cv1 : int 331069 10325 16386 52673 32089 36147 10231 7468 13472 20536 ...
$ year : int 1910 1910 1910 1910 1910 1910 1910 1910 1910 1910 ...
$ cv28 : int NA NA NA NA NA NA NA NA NA NA ...
was.counties.st <-group_by(.data = was.counties, statefips, year)# add up to state-year levelstate.year <-summarize(.data = was.counties.st, state_pop =sum(cv1, na.rm =TRUE))
`summarise()` has grouped output by 'statefips'. You can override using the
`.groups` argument.
# now take a state-level averagestate.year <-group_by(.data = state.year, statefips)state.overall <-summarize(.data = state.year, state_pop_all =mean(state_pop, na.rm =TRUE))state.overall
For each of the four states, are there more or fewer jurisdictions in this dataset now than in 1910? (Hint: sum(!is.na(variable.name)) tells you the total number of non-missing observations.)
Here I count jurisdictions by year for all years
# group at state-year levelwas.counties.st <-group_by(.data = was.counties, statefips, year)# count jurisdictions by state-yearstate.year <-summarize(.data = was.counties.st, no_jurisdictions =sum(!is.na(cv1), na.rm =TRUE))
`summarise()` has grouped output by 'statefips'. You can override using the
`.groups` argument.
# just print 1910 and 2010state.year[which(state.year$year %in%c(1910,2010)),]
This table shows each state in each year, where no_jurisdictions reports the number of jurisdictions.
What is the most populous jurisdiction in the DC area in 2010?
# just limit to 2010 countieswas.counties.2010<- was.counties[which(was.counties$year ==2010),]# print the maximum populationmax.pop <-max(was.counties.2010$cv1, na.rm =TRUE)print("maximum population for all 2010 counties")
[1] "maximum population for all 2010 counties"
max.pop
[1] 1081726
# list all countieswas.counties.2010[,c("statefips","countyfips","cv1")]