• Instructions
  • Theory and Concepts
    • Question 1
    • Question 2
  • R Questions
    • Question 3
      • Part A
      • Part B
      • Part C
      • Part D
      • Part E
      • Part F
      • Part G
      • Part H
      • Part I
      • Part I
      • Part K
      • Part L
      • Part M
    • Question 4
      • Part A
      • Part B
      • Part C
      • Part D
      • Part E
      • Part F
      • Part G
      • Part H
      • Part I
      • Part J

Due by Thursday, November 21, 2019

Instructions

Answers may be longer than I would deem sufficient on an exam. Some might vary slightly based on points of interest, examples, or personal experience. These suggested answers are designed to give you both the answer and a short explanation of why it is the answer.

Theory and Concepts

Question 1

In your own words, describe what fixed effects are, when we can use them, and how they remove endogeneity.


Fixed effects can be used for panel data, where each observationit belongs to a group i and a time period t. Running a simple OLS model as usual, known as a pooled model would be biased due to systematic relationships within and between groups and time periods that cause Xit to be endogenous:

^Yit=β0+β1Xit+uit

Group-fixed effects (αi) pull out of the error term all of the factors that explain Yit that are stable over time within each individual group (i), but vary across groups. For example, if groups are U.S. states, state-fixed effects pull out all of the error term all of the idiosyncrasies of each state that do not change over time, such as cultural differences, geographical differences, historical differences, population differences, etc. Group fixed effects do not pull out factors that change over time, such as federal laws passed, recessions affecting all states, etc.

Time-fixed effects (θi) pull out of the error term all of the factors that explain Yit that change over time but do not vary across groups. For example, if groups are U.S. states, and time is in years, year-fixed effects pull out all of the error term all of the changes over the years that affect all U.S. states, such as recessions, inflation, national population growth, national immigration trends, or federal laws passed that affect all states. Time-fixed effects do not pull out factors that do not change over time.

^Yit=β0+β1Xit+αi+θt+ϵit

Mechanically, OLS estimates a separate constant for each group (and/or each time period), giving the expected value of Yit for that group or time-period. This can be done by either de-meaning the data and calculating OLS coefficients by exploiting the variation within each group and/or time period (which is why fixed effects estimators are called “within” estimators), or by creating a dummy variable for each group and/or time period (and omitting one of each, to avoid the dummy variable trap).


Question 2

In your own words, describe the logic of a difference-in-difference model: what is it comparing against what, and how does it estimate the effect of treatment? What assumption must be made about the treatment and control group for the model to be valid?


^Yit=β0+β1Treatedi+β2Aftert+β3(Treatedi×Aftert)+uit

A difference-in-difference model compares the difference before and after a treatment between a group that receives the treatment and a group that does not. By doing so, we can isolate the effect of the treatment. It is easiest to see in the following equation:

ΔΔY=(TreatedafterTreatedbefore)(ControlafterControlbefore)

In OLS regression, we can simply make two dummy variables for observationsit depending on the group and the time period each observation is:

Treatedi={1if i receives treatment0if i does not receive treatment 

Aftert={1if tis after treatment period0if tis before treatment period

Lastly, we make an interaction term TreatediAftert which isolates the treatment effect, captured by the coefficient on this term, β3.

Diff-in-diff models assume a counterfactual that if the treated group did not receive treatment, it would have experience the same change as the control group over the pre- to post-treatment period.

A classic example is if a state(s) passes a law and others do not. We want to compare the difference in the difference before and after the law was passed between states that passed the law and states that did not:

ΔiΔtOutcome=(PassedafterPassedbefore)(Didn'tafterDidn'tbefore)

“Treatmenti” and time-fixed effects for “Aftert” to estimate the treatment effect (still β3): ^Yit=αi+θt+β3(Treatedi×Aftert)+ϵit


R Questions

Answer the following questions using R. When necessary, please write answers in the same document (knitted Rmd to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your knitted markdown) your code and the outputs of your code with the rest of your answers.

Question 3

How do people respond to changes in economic conditions? Are they more likely to pursue public service when private sector jobs are scarce? This dataset contains variables at the U.S. State (& D.C.) level:

Variable Description
state U.S. State
year Year
appspc Applications to the Peace Corps (per capita) in State
unemployrate State unemployment rate

Do more people apply to the Peace Corps when unemployment increases (and reduces other opportunities)?

Part A

Before looking at the data, what does your economic intuition tell you? Explain your hypothesis.


If joining the Peace Corps is a substitute for private sector jobs, then as unemployment rises, so too should applications. Although, it could also be possible that people also are more willing to opt out of the workforce when the economy is strong, so we should examine this empirically to be sure.


Part B

To get the hang of the data we’re working with, count (separately) the number of states, and the number of years. Get the number of n_distinct() states and years1, as well as the distinct() values of each2.


library(tidyverse)
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
# load in data
peacecorps<-read_csv("https://metricsf19.classes.ryansafner.com/data/PeaceCorps.csv")
## Parsed with column specification:
## cols(
##   state = col_character(),
##   year = col_double(),
##   unemployrate = col_double(),
##   appspc = col_double(),
##   stcode = col_double(),
##   stateshort = col_character()
## )
# count number of states
peacecorps %>%
  count(state)
ABCDEFGHIJ0123456789
state
<chr>
n
<int>
ALABAMA6
ALASKA6
ARIZONA6
ARKANSAS6
CALIFORNIA6
COLORADO6
CONNECTICUT6
DELAWARE6
DISTRICT OF COLUMBIA6
FLORIDA6
# count number of years
peacecorps %>%
  count(year)
ABCDEFGHIJ0123456789
year
<dbl>
n
<int>
200651
200751
200851
200951
201051
201151
# get number of distinct states and years
peacecorps %>%
  summarize(Num_states = n_distinct(state),
            Num_years = n_distinct(year))
ABCDEFGHIJ0123456789
Num_states
<int>
Num_years
<int>
516
# get distinct values of states
peacecorps %>%
  distinct(state)
ABCDEFGHIJ0123456789
state
<chr>
ALABAMA
ALASKA
ARIZONA
ARKANSAS
CALIFORNIA
COLORADO
CONNECTICUT
DELAWARE
DISTRICT OF COLUMBIA
FLORIDA
# get distinct values of years
peacecorps %>%
  distinct(year)
ABCDEFGHIJ0123456789
year
<dbl>
2006
2007
2008
2009
2010
2011

Part C

Continuing our pre-analysis inspection, (install, and) load the plm package, and check the dimensions of the data with pdim.3


# install.packages("plm")

# load plm
library(plm)
## 
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead
pdim(peacecorps, index=c("state", "year"))
## Balanced Panel: n = 51, T = 6, N = 306

Part D

Create a scatterplot of appspc (Y) on unemployrate (X). Which State is an outlier? How would this affect the pooled regression estimates? Create a second scatterplot that does not include this State.


ggplot(data = peacecorps)+
    aes(x = unemployrate,
        y = appspc)+
    geom_point(aes(color = as.factor(state)))+
    geom_smooth(method = "lm")+
  scale_x_continuous(breaks=seq(0,20,2),
                     labels = function(x){paste(x,"%", sep="")})+
  labs(x = "Unemployment Rate",
       y = "Peace Corps Applications (per capita)")+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size=18)+
  theme(legend.position = "none") # remove legend for State colors

We can see there are very clear outliers at the top. Let’s plot text instead of points, using the stateshort to see which observations are which states:

ggplot(data = peacecorps)+
    aes(x = unemployrate,
        y = appspc)+
    geom_text(aes(color = as.factor(state), label = stateshort))+ #<<
    geom_smooth(method = "lm")+
  scale_x_continuous(breaks=seq(0,20,2),
                     labels = function(x){paste(x,"%", sep="")})+
  labs(x = "Unemployment Rate",
       y = "Peace Corps Applications (per capita)")+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size=18)+
  theme(legend.position = "none") # remove legend for State colors

Clearly DIS, which is District of Columbia, are the outliers. Let’s make a second scatterplot without them:

peacecorps %>%
  filter(state != "DISTRICT OF COLUMBIA") %>%
ggplot(data = .)+
    aes(x = unemployrate,
        y = appspc)+
    geom_point(aes(color = as.factor(state)))+
    geom_smooth(method = "lm")+
  scale_x_continuous(breaks=seq(0,20,2),
                     labels = function(x){paste(x,"%", sep="")})+
  labs(x = "Unemployment Rate",
       y = "Peace Corps Applications (per capita)")+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size=18)+
  theme(legend.position = "none") # remove legend for State colors


Part E

Run two pooled regressions, one with the outliers, and one without them. Write out the estimated regression equation for each. Interpret the coefficient, and comment on how it changes between the two regressions.


pooled<-lm(appspc ~ unemployrate, data = peacecorps)
summary(pooled)
## 
## Call:
## lm(formula = appspc ~ unemployrate, data = peacecorps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.616 -20.641  -9.719   5.171 310.328 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   46.2255     7.2857   6.345 8.07e-10 ***
## unemployrate   0.8353     1.0362   0.806    0.421    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.2 on 304 degrees of freedom
## Multiple R-squared:  0.002133,   Adjusted R-squared:  -0.00115 
## F-statistic: 0.6498 on 1 and 304 DF,  p-value: 0.4208

^Apps per capitait=46.23+0.84Unemployment Rateit

For every 1 percentage point increase in unemployment, there are 0.46 more applications to the Peace Corps per capita.

peacecorps_no_outliers<-peacecorps %>%
  filter(state != "DISTRICT OF COLUMBIA")

pooled_no_outliers<-lm(appspc ~ unemployrate, data = peacecorps_no_outliers)
summary(pooled_no_outliers)
## 
## Call:
## lm(formula = appspc ~ unemployrate, data = peacecorps_no_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.380 -14.544  -4.190   9.278  97.997 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   48.5891     3.5955  13.514   <2e-16 ***
## unemployrate  -0.3710     0.5133  -0.723     0.47    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.17 on 298 degrees of freedom
## Multiple R-squared:  0.001751,   Adjusted R-squared:  -0.001599 
## F-statistic: 0.5226 on 1 and 298 DF,  p-value: 0.4703

^Apps per capitait=48.590.37Unemployment Rateit

For every 1 percentage point increase in unemployment, there are 0.37 fewer applications to the Peace Corps per capita.

The coefficient changes signs and becomes a smaller magnitude by dropping the outliers (DC)!


Part F

Now run a regression with State fixed effects using the dummy variable method.4 Interpret the marginal effect of unemployrate on appspc. How did it change?


fe_reg<-lm(appspc ~ unemployrate + factor(state), data = peacecorps)
summary(fe_reg)
## 
## Call:
## lm(formula = appspc ~ unemployrate + factor(state), data = peacecorps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.652  -3.661  -0.393   3.492  33.262 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        12.6070     3.7663   3.347 0.000939 ***
## unemployrate                        0.6565     0.2330   2.817 0.005225 ** 
## factor(state)ALASKA                52.5875     4.8452  10.854  < 2e-16 ***
## factor(state)ARIZONA               18.0456     4.8463   3.724 0.000242 ***
## factor(state)ARKANSAS               4.9117     4.8447   1.014 0.311630    
## factor(state)CALIFORNIA            22.1455     4.8692   4.548 8.39e-06 ***
## factor(state)COLORADO              69.9201     4.8452  14.431  < 2e-16 ***
## factor(state)CONNECTICUT           28.9349     4.8446   5.973 7.84e-09 ***
## factor(state)DELAWARE              21.8317     4.8489   4.502 1.02e-05 ***
## factor(state)DISTRICT OF COLUMBIA 311.7036     4.8533  64.225  < 2e-16 ***
## factor(state)FLORIDA               13.0805     4.8492   2.697 0.007456 ** 
## factor(state)GEORGIA               18.7073     4.8486   3.858 0.000145 ***
## factor(state)HAWAII                35.7883     4.8617   7.361 2.53e-12 ***
## factor(state)IDAHO                 34.4465     4.8480   7.105 1.21e-11 ***
## factor(state)ILLINOIS              27.8559     4.8503   5.743 2.65e-08 ***
## factor(state)INDIANA               21.5183     4.8478   4.439 1.35e-05 ***
## factor(state)IOWA                  28.9038     4.8614   5.946 9.06e-09 ***
## factor(state)KANSAS                28.9728     4.8507   5.973 7.83e-09 ***
## factor(state)KENTUCKY               9.6535     4.8540   1.989 0.047800 *  
## factor(state)LOUISIANA              2.5206     4.8517   0.520 0.603844    
## factor(state)MAINE                 53.3394     4.8450  11.009  < 2e-16 ***
## factor(state)MARYLAND              45.1903     4.8513   9.315  < 2e-16 ***
## factor(state)MASSACHUSETTS         44.9452     4.8450   9.277  < 2e-16 ***
## factor(state)MICHIGAN              26.8719     4.8970   5.487 9.86e-08 ***
## factor(state)MINNESOTA             45.6153     4.8476   9.410  < 2e-16 ***
## factor(state)MISSISSIPPI           -5.6364     4.8607  -1.160 0.247307    
## factor(state)MISSOURI              19.9421     4.8458   4.115 5.23e-05 ***
## factor(state)MONTANA               66.2285     4.8583  13.632  < 2e-16 ***
## factor(state)NEBRASKA              26.6913     4.8904   5.458 1.14e-07 ***
## factor(state)NEVADA                12.7169     4.8767   2.608 0.009656 ** 
## factor(state)NEW HAMPSHIRE         54.0272     4.8658  11.103  < 2e-16 ***
## factor(state)NEW JERSEY            13.4382     4.8452   2.774 0.005956 ** 
## factor(state)NEW MEXICO            28.0890     4.8503   5.791 2.06e-08 ***
## factor(state)NEW YORK              21.5156     4.8446   4.441 1.34e-05 ***
## factor(state)NORTH CAROLINA        23.2097     4.8533   4.782 2.94e-06 ***
## factor(state)NORTH DAKOTA          19.6067     4.9034   3.999 8.36e-05 ***
## factor(state)OHIO                  25.2887     4.8500   5.214 3.83e-07 ***
## factor(state)OKLAHOMA              10.4850     4.8560   2.159 0.031774 *  
## factor(state)OREGON                72.7034     4.8545  14.976  < 2e-16 ***
## factor(state)PENNSYLVANIA          21.7160     4.8449   4.482 1.12e-05 ***
## factor(state)RHODE ISLAND          34.0396     4.8654   6.996 2.33e-11 ***
## factor(state)SOUTH CAROLINA        12.0780     4.8651   2.483 0.013690 *  
## factor(state)SOUTH DAKOTA          22.9175     4.8862   4.690 4.46e-06 ***
## factor(state)TENNESSEE              9.1572     4.8498   1.888 0.060142 .  
## factor(state)TEXAS                  9.2251     4.8455   1.904 0.058062 .  
## factor(state)UTAH                  18.8667     4.8571   3.884 0.000131 ***
## factor(state)VERMONT              104.7363     4.8580  21.560  < 2e-16 ***
## factor(state)VIRGINIA              46.9318     4.8607   9.655  < 2e-16 ***
## factor(state)WASHINGTON            63.8660     4.8460  13.179  < 2e-16 ***
## factor(state)WEST VIRGINIA          3.9514     4.8461   0.815 0.415620    
## factor(state)WISCONSIN             35.8930     4.8448   7.409 1.89e-12 ***
## factor(state)WYOMING               35.3036     4.8665   7.254 4.88e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.391 on 254 degrees of freedom
## Multiple R-squared:  0.9713, Adjusted R-squared:  0.9655 
## F-statistic: 168.4 on 51 and 254 DF,  p-value: < 2.2e-16

For every 1 percentage point increase in the unemployment rate, there are 0.66 more applications to the Peace Corps per capita. It remained positive, and shrank in size, from the original regression (with outliers).


Part G

Find the coefficient for Maryland and interpret it. How many applications per capita does Maryland have?


This is just like any model with dummy variables representing different categories. The constant (12.61) represents the reference category not put in the regression (to avoid the dummy variable trap), i.e. ALABAMA.

The coefficient on Maryland is 45.19. This means that, on average, there are 45.19 more applications per capita in Maryland than Alabama over the time period in the data. To get the average for Maryland, we add 12.61+45.19=57.80.

# if we want to extract it using R
library(broom)

fe_reg_tidy<-tidy(fe_reg)

AL<-fe_reg_tidy %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

MD_AL_diff<-fe_reg_tidy %>%
  filter(term == "factor(state)MARYLAND") %>%
  pull(estimate)

AL+MD_AL_diff
## [1] 57.79731

Part H

Now try using the plm() command, which de-means the data, and make sure you get the same results as Part F.5 Do you get the same marginal effect of unemployrate on appspc?


fe_reg_alt<-plm(appspc ~ unemployrate, index = "state", model = "within", data = peacecorps)
summary(fe_reg_alt)
## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = appspc ~ unemployrate, data = peacecorps, model = "within", 
##     index = "state")
## 
## Balanced Panel: n = 51, T = 6, N = 306
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -31.65231  -3.66134  -0.39349   3.49204  33.26228 
## 
## Coefficients:
##              Estimate Std. Error t-value Pr(>|t|)   
## unemployrate  0.65651    0.23303  2.8172 0.005225 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    18443
## Residual Sum of Squares: 17884
## R-Squared:      0.0303
## Adj. R-Squared: -0.1644
## F-statistic: 7.93677 on 1 and 254 DF, p-value: 0.0052247

Part I

Now include year fixed effects in your regression, using the dummy variable method. Interpret the marginal effect of unemployrate on appspc. How did it change?


fe_2way_reg<-lm(appspc ~ unemployrate + factor(state) + factor(year), data = peacecorps)
summary(fe_2way_reg)
## 
## Call:
## lm(formula = appspc ~ unemployrate + factor(state) + factor(year), 
##     data = peacecorps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.7457  -3.1669   0.1384   2.7666  31.1644 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        14.48800    3.91879   3.697 0.000268 ***
## unemployrate                        0.70525    0.54733   1.289 0.198760    
## factor(state)ALASKA                52.57126    4.08265  12.877  < 2e-16 ***
## factor(state)ARIZONA               18.01801    4.09035   4.405 1.57e-05 ***
## factor(state)ARKANSAS               4.91897    4.07940   1.206 0.229036    
## factor(state)CALIFORNIA            22.04317    4.23743   5.202 4.12e-07 ***
## factor(state)COLORADO              69.93711    4.08307  17.129  < 2e-16 ***
## factor(state)CONNECTICUT           28.93000    4.07894   7.093 1.36e-11 ***
## factor(state)DELAWARE              21.87470    4.10713   5.326 2.24e-07 ***
## factor(state)DISTRICT OF COLUMBIA 311.64268    4.13555  75.357  < 2e-16 ***
## factor(state)FLORIDA               13.03582    4.10931   3.172 0.001702 ** 
## factor(state)GEORGIA               18.66589    4.10502   4.547 8.50e-06 ***
## factor(state)HAWAII                35.87361    4.18953   8.563 1.16e-15 ***
## factor(state)IDAHO                 34.48472    4.10104   8.409 3.24e-15 ***
## factor(state)ILLINOIS              27.80635    4.11635   6.755 1.00e-10 ***
## factor(state)INDIANA               21.48091    4.10010   5.239 3.44e-07 ***
## factor(state)IOWA                  28.98828    4.18745   6.923 3.75e-11 ***
## factor(state)KANSAS                29.02399    4.11886   7.047 1.79e-11 ***
## factor(state)KENTUCKY               9.59010    4.14017   2.316 0.021351 *  
## factor(state)LOUISIANA              2.57585    4.12547   0.624 0.532951    
## factor(state)MAINE                 53.35320    4.08152  13.072  < 2e-16 ***
## factor(state)MARYLAND              45.24393    4.12277  10.974  < 2e-16 ***
## factor(state)MASSACHUSETTS         44.95986    4.08187  11.015  < 2e-16 ***
## factor(state)MICHIGAN              26.72242    4.41044   6.059 5.03e-09 ***
## factor(state)MINNESOTA             45.65107    4.09827  11.139  < 2e-16 ***
## factor(state)MISSISSIPPI           -5.71925    4.18336  -1.367 0.172813    
## factor(state)MISSOURI              19.91937    4.08656   4.874 1.95e-06 ***
## factor(state)MONTANA               66.30488    4.16773  15.909  < 2e-16 ***
## factor(state)NEBRASKA              26.83099    4.36996   6.140 3.24e-09 ***
## factor(state)NEVADA                12.59989    4.28489   2.941 0.003585 ** 
## factor(state)NEW HAMPSHIRE         54.12221    4.21590  12.838  < 2e-16 ***
## factor(state)NEW JERSEY            13.42198    4.08265   3.288 0.001156 ** 
## factor(state)NEW MEXICO            28.13855    4.11635   6.836 6.25e-11 ***
## factor(state)NEW YORK              21.52042    4.07894   5.276 2.87e-07 ***
## factor(state)NORTH CAROLINA        23.14879    4.13555   5.598 5.73e-08 ***
## factor(state)NORTH DAKOTA          19.76510    4.44960   4.442 1.34e-05 ***
## factor(state)OHIO                  25.24078    4.11393   6.135 3.32e-09 ***
## factor(state)OKLAHOMA              10.55484    4.15333   2.541 0.011652 *  
## factor(state)OREGON                72.63838    4.14334  17.531  < 2e-16 ***
## factor(state)PENNSYLVANIA          21.72903    4.08118   5.324 2.26e-07 ***
## factor(state)RHODE ISLAND          33.94537    4.21360   8.056 3.28e-14 ***
## factor(state)SOUTH CAROLINA        11.98462    4.21132   2.846 0.004799 ** 
## factor(state)SOUTH DAKOTA          23.05074    4.34429   5.306 2.48e-07 ***
## factor(state)TENNESSEE              9.11011    4.11274   2.215 0.027659 *  
## factor(state)TEXAS                  9.24536    4.08494   2.263 0.024480 *  
## factor(state)UTAH                  18.93985    4.16038   4.552 8.30e-06 ***
## factor(state)VERMONT              104.81188    4.16587  25.160  < 2e-16 ***
## factor(state)VIRGINIA              47.01469    4.18336  11.239  < 2e-16 ***
## factor(state)WASHINGTON            63.84082    4.08836  15.615  < 2e-16 ***
## factor(state)WEST VIRGINIA          3.97742    4.08900   0.973 0.331641    
## factor(state)WISCONSIN             35.90359    4.08029   8.799 2.35e-16 ***
## factor(state)WYOMING               35.40021    4.22056   8.388 3.73e-15 ***
## factor(year)2007                   -5.36842    1.39953  -3.836 0.000159 ***
## factor(year)2008                    0.02018    1.48030   0.014 0.989135    
## factor(year)2009                    4.23877    2.61984   1.618 0.106939    
## factor(year)2010                   -3.25521    2.76284  -1.178 0.239836    
## factor(year)2011                   -8.88542    2.47990  -3.583 0.000409 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.064 on 249 degrees of freedom
## Multiple R-squared:   0.98,  Adjusted R-squared:  0.9756 
## F-statistic: 218.3 on 56 and 249 DF,  p-value: < 2.2e-16

For every 1 percentage point increase in the unemployment rate, there are 0.71 applications per capita to the Peace Corps.


Part I

What would be the predicted number of applications in Maryland in 2011 at an unemployment rate of 5%?


^Apps per capitait=14.49+0.71Unemployment Rateit+45.24Maryland_i8.892007t^Apps per capitaMD,2011=14.49+0.71(5)+45.24(1)8.89(1)^Apps per capitaMD,201154.09

# if we want to extract it using R
fe_2way_reg<-tidy(fe_2way_reg)

constant<-fe_2way_reg %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

unemploy_me<-fe_2way_reg %>%
  filter(term == "unemployrate") %>%
  pull(estimate)

MD_dif<-fe_2way_reg %>%
  filter(term == "factor(state)MARYLAND") %>%
  pull(estimate)

dif_2007<-fe_2way_reg %>%
  filter(term == "factor(year)2007") %>%
  pull(estimate)

constant+unemploy_me*5+MD_dif+dif_2007
## [1] 57.88974
# some rounding error

Part K

Now try using the plm() command, which de-means the data, and make sure you get the same results as Part I.6 Do you get the same marginal effect of unemployrate on appspc?


fe_2way_reg_alt<-plm(appspc ~ unemployrate, index = c("state", "year"), data = peacecorps, effect = "twoways")
summary(fe_2way_reg_alt)
## Twoways effects Within Model
## 
## Call:
## plm(formula = appspc ~ unemployrate, data = peacecorps, effect = "twoways", 
##     index = c("state", "year"))
## 
## Balanced Panel: n = 51, T = 6, N = 306
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -26.74570  -3.16688   0.13837   2.76660  31.16439 
## 
## Coefficients:
##              Estimate Std. Error t-value Pr(>|t|)
## unemployrate  0.70525    0.54733  1.2885   0.1988
## 
## Total Sum of Squares:    12509
## Residual Sum of Squares: 12426
## R-Squared:      0.0066237
## Adj. R-Squared: -0.21679
## F-statistic: 1.66029 on 1 and 249 DF, p-value: 0.19876

Part L

Can there still be endogeneity in this model? Give some examples.


Yes. There may be other time-varying variables that are omitted and correlated with unemployment, but not picked up in the time-fixed effect since they are not the same in each state. Examples could be a local market bubble some years in Texas or Nevada, but not in Maine, or bad weather in Florida one year, but not in Wyoming, etc.7


Part M

Create a nice regression table (using huxtable) for comparison of the regressions in E, G, and I.


library(huxtable)
## 
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
## 
##     add_rownames
## The following object is masked from 'package:purrr':
## 
##     every
## The following object is masked from 'package:ggplot2':
## 
##     theme_grey
huxreg(pooled,
       pooled_no_outliers,
       fe_reg_alt,
       fe_2way_reg_alt,
         coefs = c("Constant" = "(Intercept)",
                 "Unemployment Rate" = "unemployrate"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       note = NULL, # suppress footnote for stars, to insert Fixed Effects Row below
       number_format = 3) %>%

add_rows(c("Outliers", "Yes", "No DC", "Yes", "Yes"), # add row for outliers
         after = 5) %>% # insert after 5th row
add_rows(c("Fixed Effects", "None", "None", "State", "State and Year"), # add fixed effects row
         after = 6) # insert after 6th row
(1) (2) (3) (4)
Constant 46.226 *** 48.589 ***              
(7.286)    (3.595)                 
Unemployment Rate 0.835     -0.371     0.657 ** 0.705 
(1.036)    (0.513)    (0.233)   (0.547)
Outliers Yes         No DC         Yes        Yes     
Fixed Effects None         None         State        State and Year     
N 306         300         306        306     
R-Squared 0.002     0.002     0.030    0.007 
SER 45.205     22.170                  

Question 4

Are teachers paid more when school board members are elected “off cycle” when there are not major national political elections (e.g. odd years) than “on cycle?” The argument is that during “off” years, without attention on state or national elections, voters will pay less attention to the election, and teachers can more effectively mobilize for higher pay, versus “on” years where voters are paying more attention. This data comes from Anzia, Sarah (2012), “The Election Timing Effect: Evidence from a Policy Intervention in Texas.” Quarterly Journal of Political Science 7(3): 277-297, and follows 1,020 Texas school board districts from 2003–2009.

From 2003–2006, all districts elected their school board members off-cycle. A change in Texas policy in 2006 led some, but not all, districts to elect their school board members on-cycle from 2007 onwards.

Variable Description
LnAvgSalary logged average salary of teachers in district
Year Year
OnCycle =1 if school boards elected on-cycle (e.g. same year as national and state elections), =0 if elected off-cycle
pol_freedom Political freedom index score (2018) from 1 (least) top 10 (most free)
CycleSwitch =1 if district switched from off- to on-cycle elections
AfterSwitch =1 if year is after 2006

Part A

Run a pooled regression model of LnAvgSalary on OnCycle. Write the estimated regression equation, and interpret the coefficient on OnCycle. Are there any sources of bias (consider in particular the argument in the question prompt)?


# load in data
texas<-read_csv("https://metricsf19.classes.ryansafner.com/data/TexasSchools.csv")
## Parsed with column specification:
## cols(
##   DistNumber = col_double(),
##   Year = col_double(),
##   OnCycle = col_double(),
##   LnAvgSalary = col_double(),
##   CycleSwitch = col_double()
## )
pooled_reg<-lm(LnAvgSalary~OnCycle, data = texas)
summary(pooled_reg)
## 
## Call:
## lm(formula = LnAvgSalary ~ OnCycle, data = texas)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28910 -0.05527 -0.00646  0.05196  0.65215 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 10.671363   0.001018 10478.815   <2e-16 ***
## OnCycle     -0.030621   0.003766    -8.131    5e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08284 on 7137 degrees of freedom
##   (64 observations deleted due to missingness)
## Multiple R-squared:  0.009178,   Adjusted R-squared:  0.009039 
## F-statistic: 66.11 on 1 and 7137 DF,  p-value: 4.995e-16

^log(Average Salary)it=10.670.03OnCycleit

Recognize this is a linear-log model and X is a dummy variable. School board elections that are held OnCycle (=1) lead to \hat{\beta_1 \times 100\% = −0.03 \times 100\% = −3\% lower teacher salaries. This is highly statistically significant.

This is almost certainly biased. There are probably things correlated at the district level between both whether or not districts vote (or switch to voting) on cycle/off cycle and with teacher salaries. Perhaps some districts have strong/weak teachers unions and that determines whether they choose to vote on or off cycle (off cycle would be preferable, and perhaps districts with strong teachers unions kept the elections off cycle).

Also, there could be changes in the entire State over time-perhaps all Texas teacher salaries were increasing or decreasing over the years.


Part B

Some schools decided to switch to an on-cycle election after 2006. Consider this, CycleSwitch the “treatment.” Create a variable to indicate post-treatment years (i.e. years after 2006). Call it After. Create a second, interaction variable to capture the interaction effect between those districts that switched, and after the treatment.


# create a variable, that is a factor, using ifelse() function
# # if the year is greater than 2006, mark "After" as 1,
# # If the year is NOT greater than 2006, market "After" a 0
texas<-texas %>%
  mutate(After = factor(ifelse(test = Year>2006,
                        yes = 1,
                        no = 0)))
head(texas)
DistNumber Year OnCycle LnAvgSalary CycleSwitch After
4.8e+06 2e+03        0 10.6 0 0
4.8e+06 2e+03        0 10.6 0 0
4.8e+06 2.00e+03 0 10.6 0 0
4.8e+06 2.01e+03 0 10.6 0 0
4.8e+06 2.01e+03 0 10.6 0 1
4.8e+06 2.01e+03 0 10.6 0 1

Part C

Now estimate a difference-in-difference model with your variables in Part B: CycleSwitch is the treatment variable, After is your post-treatment indicator, and add an interaction variable to capture the interaction effect between those districts that switched, and after the treatment. Write down the estimated regression equation (to four decimal places).


dnd<-lm(LnAvgSalary ~ CycleSwitch + After + CycleSwitch:After, data = texas)
summary(dnd)
## 
## Call:
## lm(formula = LnAvgSalary ~ CycleSwitch + After + CycleSwitch:After, 
##     data = texas)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29811 -0.05595 -0.00734  0.05284  0.65245 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        10.671068   0.001433 7447.598  < 2e-16 ***
## CycleSwitch        -0.023982   0.003397   -7.059 1.83e-12 ***
## After1              0.009303   0.002188    4.251 2.16e-05 ***
## CycleSwitch:After1 -0.008590   0.005189   -1.655   0.0979 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08334 on 7198 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.0183, Adjusted R-squared:  0.01789 
## F-statistic: 44.72 on 3 and 7198 DF,  p-value: < 2.2e-16

^Log(Average Salary)it=10.67110.0240Cycle Switchi+0.0093Aftert0.0086(Cycle Switchi×Aftert)


Part D

Interpret what each coefficient means from Part C.


  • ^β0: districts before 2007 that did not switch had ln(AvgSalary) of 10.6711
  • ^β1: districts before 2007 that did switch had ln(AvgSalary) 0.0240 lower than districts that did not switch
  • ^β2: all districts after the change had ln(AvgSalary) 0.0093 higher than before the change
  • ^β3: districts that made a switch had a 0.0086 lower change in ln(AvgSalary) over time than districts that did not switch

Part E

Using your regression equation in Part C, calculate the expected logged average salary (Y) for districts in Texas:

    1. Before the switch that did not switch
    1. After the switch that did not switch
    1. Before the switch that did switch
    1. After the switch that did switch

    1. Before change and did not switch: ^β0=10.6711
    1. After change and did not switch: ^β0+^β2=10.6711+0.0093=10.6804
    1. Before change and did switch: ^β0+^β1=10.67110.0240=10.6471
    1. After change and did switch: ^β0+^β1+^β2+^β3=10.67110.0240+0.00930.0086=10.6478

Part F

Confirm your estimates in Part E by finding the mean logged average salary for each of those four groups in the data.8


# avg log salary for districts BEFORE that did NOT switch
texas %>%
  filter(CycleSwitch == 0,
         After == 0) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's
average
10.7
# avg log salary for districts AFTER that did NOT switch
texas %>%
  filter(CycleSwitch == 0,
         After == 1) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's
average
10.7
# avg log salary for districts BEFORE that DID switch
texas %>%
  filter(CycleSwitch == 1,
         After == 0) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's
average
10.6
# avg log salary for districts AFTER that DID switch
texas %>%
  filter(CycleSwitch == 1,
         After == 1) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's
average
10.6

Part G

Write out the difference-in-difference equation, and calculate the difference-in-difference. Make sure it matches your estimate from the regression.


ΔΔln(AvgSalary)=(SwitchedafterSwitchedbefore)(DidntafterDidntbefore)=(10.647810.6471)(10.680410.6711)=(0.0007)(0.0093)=0.0086

This is β3.


Part H

Can we say anything about the types of districts that switched? Can we say anything about all salaries in the districts in the years after the switch?


We can see that the districts that switched paid 2.3% less to their teachers to begin with. β2 shows the difference, before treatment, between the treated (districts that switched) and control groups (districts that did not switch).

We can see that all districts saw a 0.93% increase in teacher salaries after the years after the switch. β3 shows the difference, for both the treatment and control, of salaries over time (before and after the switch).


Part I

Now let’s generalize the diff-in-diff model. Instead of the treatment and post-treatment dummies, use district-and year-fixed effects and the interaction term. Interpret the coefficient on the interaction term9


# make sure DistNumber and Year are factors for plm
texas<-texas %>%
  mutate_at(c("DistNumber", "Year"), factor)
head(texas) # make sure it worked
DistNumber Year OnCycle LnAvgSalary CycleSwitch After
4800001 2003 0 10.6 0 0
4800001 2004 0 10.6 0 0
4800001 2005 0 10.6 0 0
4800001 2006 0 10.6 0 0
4800001 2007 0 10.6 0 1
4800001 2008 0 10.6 0 1
dnd_fe_reg<-plm(LnAvgSalary ~ CycleSwitch*After, index=c("DistNumber", "Year"), effect = "twoways", data = texas)
summary(dnd_fe_reg)
## Twoways effects Within Model
## 
## Call:
## plm(formula = LnAvgSalary ~ CycleSwitch * After, data = texas, 
##     effect = "twoways", index = c("DistNumber", "Year"))
## 
## Unbalanced Panel: n = 1029, T = 6-7, N = 7202
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -0.33262311 -0.01530044  0.00033517  0.01551322  0.57548546 
## 
## Coefficients:
##                      Estimate Std. Error t-value  Pr(>|t|)    
## CycleSwitch:After1 -0.0085935  0.0019861 -4.3268 1.537e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    6.2924
## Residual Sum of Squares: 6.2733
## R-Squared:      0.0030269
## Adj. R-Squared: -0.16432
## F-statistic: 18.7208 on 1 and 6166 DF, p-value: 1.5371e-05

Districts that switched to On Cycle saw teacher salaries decrease by 8.67%. No change from our original model!


Part J

Create a nice regression table (using huxtable) for comparison of the regressions in (a), (c), and (i).


huxreg(pooled_reg,
       dnd,
       dnd_fe_reg,
         coefs = c("Constant" = "(Intercept)",
                 "On Cycle" = "OnCycle",
                 "Switched" = "CycleSwitch",
                 "After Switch" = "After1",
                 "Interaction" = "CycleSwitch:After1"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       note = NULL, # suppress footnote for stars, to insert Fixed Effects Row below
       number_format = 3) %>%
add_rows(c("Fixed Effects", "None", "None", "District and Year"), # add fixed effects row
         after = 11) # insert after 11th row
(1) (2) (3)
Constant 10.671 *** 10.671 ***         
(0.001)    (0.001)            
On Cycle -0.031 ***                  
(0.004)                     
Switched          -0.024 ***         
         (0.003)            
After Switch          0.009 ***         
         (0.002)            
Interaction          -0.009     -0.009 ***
         (0.005)    (0.002)   
Fixed Effects None         None         District and Year        
N 7139         7202         7202        
R-Squared 0.009     0.018     0.003    
SER 0.083     0.083             

  1. Do this inside the summarize() command↩︎

  2. Don’t use the summarize() command for this part↩︎

  3. Set index=c("state","year") to indicate the group and time dimensions.↩︎

  4. Ensure that state is a factor variable, and insert in the regression. You can either mutate() it into a factor beforehand, or just do as.factor(state) in the lm command.↩︎

  5. Inside plm(), set index = "state" to indicate variable, and model = "within" to indicate a fixed effects model.↩︎

  6. Inside plm(), set index = c("state", "year") to indicate both variables, and effect = "twoways" to indicate a 2-way fixed effects model.↩︎

  7. It’s better to use the plm()-generated regressions so as to avoid the multitude of coefficients for the state and year dummies! You certainly could use the dummy variable ones and manually list all of the variables to suppress in the table inside omit_coefs()↩︎

  8. Hint, filter() properly then summarize().↩︎

  9. This is doable with the dummy variable method, but there will be a lot of dummies! I suggest using plm().↩︎