Instructions

Theory and Concepts

Question 1
Question 2

R Questions

Question 3

Part A
Part B
Part C
Part D
Part E
Part F
Part G
Part H
Part I
Part I
Part K
Part L
Part M

Question 4

Part A
Part B
Part C
Part D
Part E
Part F
Part G
Part H
Part I
Part J

Due by Thursday, November 21, 2019

Instructions

Answers may be longer than I would deem sufficient on an exam. Some might vary slightly based on points of interest, examples, or personal experience. These suggested answers are designed to give you both the answer and a short explanation of why it is the answer.

Theory and Concepts

Question 1

In your own words, describe what fixed effects are, when we can use them, and how they remove endogeneity.

Fixed effects can be used for panel data, where each observation belongs to a group and a time period . Running a simple OLS model as usual, known as a pooled model would be biased due to systematic relationships within and between groups and time periods that cause to be endogenous:

Group-fixed effects ) pull out of the error term all of the factors that explain that are stable over time within each individual group (), but vary across groups. For example, if groups are U.S. states, state-fixed effects pull out all of the error term all of the idiosyncrasies of each state that do not change over time, such as cultural differences, geographical differences, historical differences, population differences, etc. Group fixed effects do not pull out factors that change over time, such as federal laws passed, recessions affecting all states, etc.

Time-fixed effects pull out of the error term all of the factors that explain that change over time but do not vary across groups. For example, if groups are U.S. states, and time is in years, year-fixed effects pull out all of the error term all of the changes over the years that affect all U.S. states, such as recessions, inflation, national population growth, national immigration trends, or federal laws passed that affect all states. Time-fixed effects do not pull out factors that do not change over time.

Mechanically, OLS estimates a separate constant for each group (and/or each time period), giving the expected value of for that group or time-period. This can be done by either de-meaning the data and calculating OLS coefficients by exploiting the variation within each group and/or time period (which is why fixed effects estimators are called “within” estimators), or by creating a dummy variable for each group and/or time period (and omitting one of each, to avoid the dummy variable trap).

Question 2

In your own words, describe the logic of a difference-in-difference model: what is it comparing against what, and how does it estimate the effect of treatment? What assumption must be made about the treatment and control group for the model to be valid?

A difference-in-difference model compares the difference before and after a treatment between a group that receives the treatment and a group that does not. By doing so, we can isolate the effect of the treatment. It is easiest to see in the following equation:

In OLS regression, we can simply make two dummy variables for observations depending on the group and the time period each observation is:

Lastly, we make an interaction term which isolates the treatment effect, captured by the coefficient on this term, .

Diff-in-diff models assume a counterfactual that if the treated group did not receive treatment, it would have experience the same change as the control group over the pre- to post-treatment period.

A classic example is if a state(s) passes a law and others do not. We want to compare the difference in the difference before and after the law was passed between states that passed the law and states that did not:

“Treatment” and time-fixed effects for “After” to estimate the treatment effect (still ):

R Questions

Answer the following questions using R. When necessary, please write answers in the same document (knitted Rmd to html or pdf, typed .doc(x), or handwritten) as your answers to the above questions. Be sure to include (email or print an .R file, or show in your knitted markdown) your code and the outputs of your code with the rest of your answers.

Question 3

PeaceCorps.csv

How do people respond to changes in economic conditions? Are they more likely to pursue public service when private sector jobs are scarce? This dataset contains variables at the U.S. State (& D.C.) level:

Variable	Description
`state`	U.S. State
`year`	Year
`appspc`	Applications to the Peace Corps (per capita) in State
`unemployrate`	State unemployment rate

Do more people apply to the Peace Corps when unemployment increases (and reduces other opportunities)?

Part A

Before looking at the data, what does your economic intuition tell you? Explain your hypothesis.

If joining the Peace Corps is a substitute for private sector jobs, then as unemployment rises, so too should applications. Although, it could also be possible that people also are more willing to opt out of the workforce when the economy is strong, so we should examine this empirically to be sure.

Part B

To get the hang of the data we’re working with, count (separately) the number of states, and the number of years. Get the number of n_distinct() states and years¹, as well as the distinct() values of each².

library(tidyverse)

## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.3
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

# load in data
peacecorps<-read_csv("https://metricsf19.classes.ryansafner.com/data/PeaceCorps.csv")

## Parsed with column specification:
## cols(
##   state = col_character(),
##   year = col_double(),
##   unemployrate = col_double(),
##   appspc = col_double(),
##   stcode = col_double(),
##   stateshort = col_character()
## )

# count number of states
peacecorps %>%
  count(state)

ABCDEFGHIJ0123456789

state <chr>	n <int>
ALABAMA	6
ALASKA	6
ARIZONA	6
ARKANSAS	6
CALIFORNIA	6
COLORADO	6
CONNECTICUT	6
DELAWARE	6
DISTRICT OF COLUMBIA	6
FLORIDA	6

# count number of years
peacecorps %>%
  count(year)

ABCDEFGHIJ0123456789

year <dbl>	n <int>
2006	51
2007	51
2008	51
2009	51
2010	51
2011	51

# get number of distinct states and years
peacecorps %>%
  summarize(Num_states = n_distinct(state),
            Num_years = n_distinct(year))

ABCDEFGHIJ0123456789

Num_states <int>	Num_years <int>
51	6

# get distinct values of states
peacecorps %>%
  distinct(state)

ABCDEFGHIJ0123456789

state <chr>
ALABAMA
ALASKA
ARIZONA
ARKANSAS
CALIFORNIA
COLORADO
CONNECTICUT
DELAWARE
DISTRICT OF COLUMBIA
FLORIDA

# get distinct values of years
peacecorps %>%
  distinct(year)

ABCDEFGHIJ0123456789

year <dbl>
2006
2007
2008
2009
2010
2011

Part C

Continuing our pre-analysis inspection, (install, and) load the plm package, and check the dimensions of the data with pdim.³

# install.packages("plm")

# load plm
library(plm)

## 
## Attaching package: 'plm'

## The following objects are masked from 'package:dplyr':
## 
##     between, lag, lead

pdim(peacecorps, index=c("state", "year"))

## Balanced Panel: n = 51, T = 6, N = 306

Part D

Create a scatterplot of appspc (Y) on unemployrate (X). Which State is an outlier? How would this affect the pooled regression estimates? Create a second scatterplot that does not include this State.

ggplot(data = peacecorps)+
    aes(x = unemployrate,
        y = appspc)+
    geom_point(aes(color = as.factor(state)))+
    geom_smooth(method = "lm")+
  scale_x_continuous(breaks=seq(0,20,2),
                     labels = function(x){paste(x,"%", sep="")})+
  labs(x = "Unemployment Rate",
       y = "Peace Corps Applications (per capita)")+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size=18)+
  theme(legend.position = "none") # remove legend for State colors

We can see there are very clear outliers at the top. Let’s plot text instead of points, using the stateshort to see which observations are which states:

ggplot(data = peacecorps)+
    aes(x = unemployrate,
        y = appspc)+
    geom_text(aes(color = as.factor(state), label = stateshort))+ #<<
    geom_smooth(method = "lm")+
  scale_x_continuous(breaks=seq(0,20,2),
                     labels = function(x){paste(x,"%", sep="")})+
  labs(x = "Unemployment Rate",
       y = "Peace Corps Applications (per capita)")+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size=18)+
  theme(legend.position = "none") # remove legend for State colors

Clearly DIS, which is District of Columbia, are the outliers. Let’s make a second scatterplot without them:

peacecorps %>%
  filter(state != "DISTRICT OF COLUMBIA") %>%
ggplot(data = .)+
    aes(x = unemployrate,
        y = appspc)+
    geom_point(aes(color = as.factor(state)))+
    geom_smooth(method = "lm")+
  scale_x_continuous(breaks=seq(0,20,2),
                     labels = function(x){paste(x,"%", sep="")})+
  labs(x = "Unemployment Rate",
       y = "Peace Corps Applications (per capita)")+
  theme_classic(base_family = "Fira Sans Condensed",
           base_size=18)+
  theme(legend.position = "none") # remove legend for State colors

Part E

Run two pooled regressions, one with the outliers, and one without them. Write out the estimated regression equation for each. Interpret the coefficient, and comment on how it changes between the two regressions.

pooled<-lm(appspc ~ unemployrate, data = peacecorps)
summary(pooled)

## 
## Call:
## lm(formula = appspc ~ unemployrate, data = peacecorps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -43.616 -20.641  -9.719   5.171 310.328 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   46.2255     7.2857   6.345 8.07e-10 ***
## unemployrate   0.8353     1.0362   0.806    0.421    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 45.2 on 304 degrees of freedom
## Multiple R-squared:  0.002133,   Adjusted R-squared:  -0.00115 
## F-statistic: 0.6498 on 1 and 304 DF,  p-value: 0.4208

For every 1 percentage point increase in unemployment, there are 0.46 more applications to the Peace Corps per capita.

peacecorps_no_outliers<-peacecorps %>%
  filter(state != "DISTRICT OF COLUMBIA")

pooled_no_outliers<-lm(appspc ~ unemployrate, data = peacecorps_no_outliers)
summary(pooled_no_outliers)

## 
## Call:
## lm(formula = appspc ~ unemployrate, data = peacecorps_no_outliers)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -38.380 -14.544  -4.190   9.278  97.997 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   48.5891     3.5955  13.514   <2e-16 ***
## unemployrate  -0.3710     0.5133  -0.723     0.47    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 22.17 on 298 degrees of freedom
## Multiple R-squared:  0.001751,   Adjusted R-squared:  -0.001599 
## F-statistic: 0.5226 on 1 and 298 DF,  p-value: 0.4703

For every 1 percentage point increase in unemployment, there are 0.37 fewer applications to the Peace Corps per capita.

The coefficient changes signs and becomes a smaller magnitude by dropping the outliers (DC)!

Part F

Now run a regression with State fixed effects using the dummy variable method.⁴ Interpret the marginal effect of unemployrate on appspc. How did it change?

fe_reg<-lm(appspc ~ unemployrate + factor(state), data = peacecorps)
summary(fe_reg)

## 
## Call:
## lm(formula = appspc ~ unemployrate + factor(state), data = peacecorps)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -31.652  -3.661  -0.393   3.492  33.262 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        12.6070     3.7663   3.347 0.000939 ***
## unemployrate                        0.6565     0.2330   2.817 0.005225 ** 
## factor(state)ALASKA                52.5875     4.8452  10.854  < 2e-16 ***
## factor(state)ARIZONA               18.0456     4.8463   3.724 0.000242 ***
## factor(state)ARKANSAS               4.9117     4.8447   1.014 0.311630    
## factor(state)CALIFORNIA            22.1455     4.8692   4.548 8.39e-06 ***
## factor(state)COLORADO              69.9201     4.8452  14.431  < 2e-16 ***
## factor(state)CONNECTICUT           28.9349     4.8446   5.973 7.84e-09 ***
## factor(state)DELAWARE              21.8317     4.8489   4.502 1.02e-05 ***
## factor(state)DISTRICT OF COLUMBIA 311.7036     4.8533  64.225  < 2e-16 ***
## factor(state)FLORIDA               13.0805     4.8492   2.697 0.007456 ** 
## factor(state)GEORGIA               18.7073     4.8486   3.858 0.000145 ***
## factor(state)HAWAII                35.7883     4.8617   7.361 2.53e-12 ***
## factor(state)IDAHO                 34.4465     4.8480   7.105 1.21e-11 ***
## factor(state)ILLINOIS              27.8559     4.8503   5.743 2.65e-08 ***
## factor(state)INDIANA               21.5183     4.8478   4.439 1.35e-05 ***
## factor(state)IOWA                  28.9038     4.8614   5.946 9.06e-09 ***
## factor(state)KANSAS                28.9728     4.8507   5.973 7.83e-09 ***
## factor(state)KENTUCKY               9.6535     4.8540   1.989 0.047800 *  
## factor(state)LOUISIANA              2.5206     4.8517   0.520 0.603844    
## factor(state)MAINE                 53.3394     4.8450  11.009  < 2e-16 ***
## factor(state)MARYLAND              45.1903     4.8513   9.315  < 2e-16 ***
## factor(state)MASSACHUSETTS         44.9452     4.8450   9.277  < 2e-16 ***
## factor(state)MICHIGAN              26.8719     4.8970   5.487 9.86e-08 ***
## factor(state)MINNESOTA             45.6153     4.8476   9.410  < 2e-16 ***
## factor(state)MISSISSIPPI           -5.6364     4.8607  -1.160 0.247307    
## factor(state)MISSOURI              19.9421     4.8458   4.115 5.23e-05 ***
## factor(state)MONTANA               66.2285     4.8583  13.632  < 2e-16 ***
## factor(state)NEBRASKA              26.6913     4.8904   5.458 1.14e-07 ***
## factor(state)NEVADA                12.7169     4.8767   2.608 0.009656 ** 
## factor(state)NEW HAMPSHIRE         54.0272     4.8658  11.103  < 2e-16 ***
## factor(state)NEW JERSEY            13.4382     4.8452   2.774 0.005956 ** 
## factor(state)NEW MEXICO            28.0890     4.8503   5.791 2.06e-08 ***
## factor(state)NEW YORK              21.5156     4.8446   4.441 1.34e-05 ***
## factor(state)NORTH CAROLINA        23.2097     4.8533   4.782 2.94e-06 ***
## factor(state)NORTH DAKOTA          19.6067     4.9034   3.999 8.36e-05 ***
## factor(state)OHIO                  25.2887     4.8500   5.214 3.83e-07 ***
## factor(state)OKLAHOMA              10.4850     4.8560   2.159 0.031774 *  
## factor(state)OREGON                72.7034     4.8545  14.976  < 2e-16 ***
## factor(state)PENNSYLVANIA          21.7160     4.8449   4.482 1.12e-05 ***
## factor(state)RHODE ISLAND          34.0396     4.8654   6.996 2.33e-11 ***
## factor(state)SOUTH CAROLINA        12.0780     4.8651   2.483 0.013690 *  
## factor(state)SOUTH DAKOTA          22.9175     4.8862   4.690 4.46e-06 ***
## factor(state)TENNESSEE              9.1572     4.8498   1.888 0.060142 .  
## factor(state)TEXAS                  9.2251     4.8455   1.904 0.058062 .  
## factor(state)UTAH                  18.8667     4.8571   3.884 0.000131 ***
## factor(state)VERMONT              104.7363     4.8580  21.560  < 2e-16 ***
## factor(state)VIRGINIA              46.9318     4.8607   9.655  < 2e-16 ***
## factor(state)WASHINGTON            63.8660     4.8460  13.179  < 2e-16 ***
## factor(state)WEST VIRGINIA          3.9514     4.8461   0.815 0.415620    
## factor(state)WISCONSIN             35.8930     4.8448   7.409 1.89e-12 ***
## factor(state)WYOMING               35.3036     4.8665   7.254 4.88e-12 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.391 on 254 degrees of freedom
## Multiple R-squared:  0.9713, Adjusted R-squared:  0.9655 
## F-statistic: 168.4 on 51 and 254 DF,  p-value: < 2.2e-16

For every 1 percentage point increase in the unemployment rate, there are 0.66 more applications to the Peace Corps per capita. It remained positive, and shrank in size, from the original regression (with outliers).

Part G

Find the coefficient for Maryland and interpret it. How many applications per capita does Maryland have?

This is just like any model with dummy variables representing different categories. The constant (12.61) represents the reference category not put in the regression (to avoid the dummy variable trap), i.e. ALABAMA.

The coefficient on Maryland is 45.19. This means that, on average, there are 45.19 more applications per capita in Maryland than Alabama over the time period in the data. To get the average for Maryland, we add .

# if we want to extract it using R
library(broom)

fe_reg_tidy<-tidy(fe_reg)

AL<-fe_reg_tidy %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

MD_AL_diff<-fe_reg_tidy %>%
  filter(term == "factor(state)MARYLAND") %>%
  pull(estimate)

AL+MD_AL_diff

## [1] 57.79731

Part H

Now try using the plm() command, which de-means the data, and make sure you get the same results as Part F.⁵ Do you get the same marginal effect of unemployrate on appspc?

fe_reg_alt<-plm(appspc ~ unemployrate, index = "state", model = "within", data = peacecorps)
summary(fe_reg_alt)

## Oneway (individual) effect Within Model
## 
## Call:
## plm(formula = appspc ~ unemployrate, data = peacecorps, model = "within", 
##     index = "state")
## 
## Balanced Panel: n = 51, T = 6, N = 306
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -31.65231  -3.66134  -0.39349   3.49204  33.26228 
## 
## Coefficients:
##              Estimate Std. Error t-value Pr(>|t|)   
## unemployrate  0.65651    0.23303  2.8172 0.005225 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    18443
## Residual Sum of Squares: 17884
## R-Squared:      0.0303
## Adj. R-Squared: -0.1644
## F-statistic: 7.93677 on 1 and 254 DF, p-value: 0.0052247

Part I

Now include year fixed effects in your regression, using the dummy variable method. Interpret the marginal effect of unemployrate on appspc. How did it change?

fe_2way_reg<-lm(appspc ~ unemployrate + factor(state) + factor(year), data = peacecorps)
summary(fe_2way_reg)

## 
## Call:
## lm(formula = appspc ~ unemployrate + factor(state) + factor(year), 
##     data = peacecorps)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.7457  -3.1669   0.1384   2.7666  31.1644 
## 
## Coefficients:
##                                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                        14.48800    3.91879   3.697 0.000268 ***
## unemployrate                        0.70525    0.54733   1.289 0.198760    
## factor(state)ALASKA                52.57126    4.08265  12.877  < 2e-16 ***
## factor(state)ARIZONA               18.01801    4.09035   4.405 1.57e-05 ***
## factor(state)ARKANSAS               4.91897    4.07940   1.206 0.229036    
## factor(state)CALIFORNIA            22.04317    4.23743   5.202 4.12e-07 ***
## factor(state)COLORADO              69.93711    4.08307  17.129  < 2e-16 ***
## factor(state)CONNECTICUT           28.93000    4.07894   7.093 1.36e-11 ***
## factor(state)DELAWARE              21.87470    4.10713   5.326 2.24e-07 ***
## factor(state)DISTRICT OF COLUMBIA 311.64268    4.13555  75.357  < 2e-16 ***
## factor(state)FLORIDA               13.03582    4.10931   3.172 0.001702 ** 
## factor(state)GEORGIA               18.66589    4.10502   4.547 8.50e-06 ***
## factor(state)HAWAII                35.87361    4.18953   8.563 1.16e-15 ***
## factor(state)IDAHO                 34.48472    4.10104   8.409 3.24e-15 ***
## factor(state)ILLINOIS              27.80635    4.11635   6.755 1.00e-10 ***
## factor(state)INDIANA               21.48091    4.10010   5.239 3.44e-07 ***
## factor(state)IOWA                  28.98828    4.18745   6.923 3.75e-11 ***
## factor(state)KANSAS                29.02399    4.11886   7.047 1.79e-11 ***
## factor(state)KENTUCKY               9.59010    4.14017   2.316 0.021351 *  
## factor(state)LOUISIANA              2.57585    4.12547   0.624 0.532951    
## factor(state)MAINE                 53.35320    4.08152  13.072  < 2e-16 ***
## factor(state)MARYLAND              45.24393    4.12277  10.974  < 2e-16 ***
## factor(state)MASSACHUSETTS         44.95986    4.08187  11.015  < 2e-16 ***
## factor(state)MICHIGAN              26.72242    4.41044   6.059 5.03e-09 ***
## factor(state)MINNESOTA             45.65107    4.09827  11.139  < 2e-16 ***
## factor(state)MISSISSIPPI           -5.71925    4.18336  -1.367 0.172813    
## factor(state)MISSOURI              19.91937    4.08656   4.874 1.95e-06 ***
## factor(state)MONTANA               66.30488    4.16773  15.909  < 2e-16 ***
## factor(state)NEBRASKA              26.83099    4.36996   6.140 3.24e-09 ***
## factor(state)NEVADA                12.59989    4.28489   2.941 0.003585 ** 
## factor(state)NEW HAMPSHIRE         54.12221    4.21590  12.838  < 2e-16 ***
## factor(state)NEW JERSEY            13.42198    4.08265   3.288 0.001156 ** 
## factor(state)NEW MEXICO            28.13855    4.11635   6.836 6.25e-11 ***
## factor(state)NEW YORK              21.52042    4.07894   5.276 2.87e-07 ***
## factor(state)NORTH CAROLINA        23.14879    4.13555   5.598 5.73e-08 ***
## factor(state)NORTH DAKOTA          19.76510    4.44960   4.442 1.34e-05 ***
## factor(state)OHIO                  25.24078    4.11393   6.135 3.32e-09 ***
## factor(state)OKLAHOMA              10.55484    4.15333   2.541 0.011652 *  
## factor(state)OREGON                72.63838    4.14334  17.531  < 2e-16 ***
## factor(state)PENNSYLVANIA          21.72903    4.08118   5.324 2.26e-07 ***
## factor(state)RHODE ISLAND          33.94537    4.21360   8.056 3.28e-14 ***
## factor(state)SOUTH CAROLINA        11.98462    4.21132   2.846 0.004799 ** 
## factor(state)SOUTH DAKOTA          23.05074    4.34429   5.306 2.48e-07 ***
## factor(state)TENNESSEE              9.11011    4.11274   2.215 0.027659 *  
## factor(state)TEXAS                  9.24536    4.08494   2.263 0.024480 *  
## factor(state)UTAH                  18.93985    4.16038   4.552 8.30e-06 ***
## factor(state)VERMONT              104.81188    4.16587  25.160  < 2e-16 ***
## factor(state)VIRGINIA              47.01469    4.18336  11.239  < 2e-16 ***
## factor(state)WASHINGTON            63.84082    4.08836  15.615  < 2e-16 ***
## factor(state)WEST VIRGINIA          3.97742    4.08900   0.973 0.331641    
## factor(state)WISCONSIN             35.90359    4.08029   8.799 2.35e-16 ***
## factor(state)WYOMING               35.40021    4.22056   8.388 3.73e-15 ***
## factor(year)2007                   -5.36842    1.39953  -3.836 0.000159 ***
## factor(year)2008                    0.02018    1.48030   0.014 0.989135    
## factor(year)2009                    4.23877    2.61984   1.618 0.106939    
## factor(year)2010                   -3.25521    2.76284  -1.178 0.239836    
## factor(year)2011                   -8.88542    2.47990  -3.583 0.000409 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.064 on 249 degrees of freedom
## Multiple R-squared:   0.98,  Adjusted R-squared:  0.9756 
## F-statistic: 218.3 on 56 and 249 DF,  p-value: < 2.2e-16

For every 1 percentage point increase in the unemployment rate, there are 0.71 applications per capita to the Peace Corps.

Part I

What would be the predicted number of applications in Maryland in 2011 at an unemployment rate of 5%?

# if we want to extract it using R
fe_2way_reg<-tidy(fe_2way_reg)

constant<-fe_2way_reg %>%
  filter(term == "(Intercept)") %>%
  pull(estimate)

unemploy_me<-fe_2way_reg %>%
  filter(term == "unemployrate") %>%
  pull(estimate)

MD_dif<-fe_2way_reg %>%
  filter(term == "factor(state)MARYLAND") %>%
  pull(estimate)

dif_2007<-fe_2way_reg %>%
  filter(term == "factor(year)2007") %>%
  pull(estimate)

constant+unemploy_me*5+MD_dif+dif_2007

## [1] 57.88974

# some rounding error

Part K

Now try using the plm() command, which de-means the data, and make sure you get the same results as Part I.⁶ Do you get the same marginal effect of unemployrate on appspc?

fe_2way_reg_alt<-plm(appspc ~ unemployrate, index = c("state", "year"), data = peacecorps, effect = "twoways")
summary(fe_2way_reg_alt)

## Twoways effects Within Model
## 
## Call:
## plm(formula = appspc ~ unemployrate, data = peacecorps, effect = "twoways", 
##     index = c("state", "year"))
## 
## Balanced Panel: n = 51, T = 6, N = 306
## 
## Residuals:
##      Min.   1st Qu.    Median   3rd Qu.      Max. 
## -26.74570  -3.16688   0.13837   2.76660  31.16439 
## 
## Coefficients:
##              Estimate Std. Error t-value Pr(>|t|)
## unemployrate  0.70525    0.54733  1.2885   0.1988
## 
## Total Sum of Squares:    12509
## Residual Sum of Squares: 12426
## R-Squared:      0.0066237
## Adj. R-Squared: -0.21679
## F-statistic: 1.66029 on 1 and 249 DF, p-value: 0.19876

Part L

Can there still be endogeneity in this model? Give some examples.

Yes. There may be other time-varying variables that are omitted and correlated with unemployment, but not picked up in the time-fixed effect since they are not the same in each state. Examples could be a local market bubble some years in Texas or Nevada, but not in Maine, or bad weather in Florida one year, but not in Wyoming, etc.⁷

Part M

Create a nice regression table (using huxtable) for comparison of the regressions in E, G, and I.

library(huxtable)

## 
## Attaching package: 'huxtable'

## The following object is masked from 'package:dplyr':
## 
##     add_rownames

## The following object is masked from 'package:purrr':
## 
##     every

## The following object is masked from 'package:ggplot2':
## 
##     theme_grey

huxreg(pooled,
       pooled_no_outliers,
       fe_reg_alt,
       fe_2way_reg_alt,
         coefs = c("Constant" = "(Intercept)",
                 "Unemployment Rate" = "unemployrate"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       note = NULL, # suppress footnote for stars, to insert Fixed Effects Row below
       number_format = 3) %>%

add_rows(c("Outliers", "Yes", "No DC", "Yes", "Yes"), # add row for outliers
         after = 5) %>% # insert after 5th row
add_rows(c("Fixed Effects", "None", "None", "State", "State and Year"), # add fixed effects row
         after = 6) # insert after 6th row

	(1)	(2)	(3)	(4)
Constant	46.226 ***	48.589 ***
	(7.286)	(3.595)
Unemployment Rate	0.835	-0.371	0.657 **	0.705
	(1.036)	(0.513)	(0.233)	(0.547)
Outliers	Yes	No DC	Yes	Yes
Fixed Effects	None	None	State	State and Year
N	306	300	306	306
R-Squared	0.002	0.002	0.030	0.007
SER	45.205	22.170

Question 4

TexasSchools.csv

Are teachers paid more when school board members are elected “off cycle” when there are not major national political elections (e.g. odd years) than “on cycle?” The argument is that during “off” years, without attention on state or national elections, voters will pay less attention to the election, and teachers can more effectively mobilize for higher pay, versus “on” years where voters are paying more attention. This data comes from Anzia, Sarah (2012), “The Election Timing Effect: Evidence from a Policy Intervention in Texas.” Quarterly Journal of Political Science 7(3): 277-297, and follows 1,020 Texas school board districts from 2003–2009.

From 2003–2006, all districts elected their school board members off-cycle. A change in Texas policy in 2006 led some, but not all, districts to elect their school board members on-cycle from 2007 onwards.

Variable	Description
`LnAvgSalary`	logged average salary of teachers in district
`Year`	Year
`OnCycle`	`=1` if school boards elected on-cycle (e.g. same year as national and state elections), `=0` if elected off-cycle
`pol_freedom`	Political freedom index score (2018) from 1 (least) top 10 (most free)
`CycleSwitch`	`=1` if district switched from off- to on-cycle elections
`AfterSwitch`	`=1` if year is after 2006

Part A

Run a pooled regression model of LnAvgSalary on OnCycle. Write the estimated regression equation, and interpret the coefficient on OnCycle. Are there any sources of bias (consider in particular the argument in the question prompt)?

# load in data
texas<-read_csv("https://metricsf19.classes.ryansafner.com/data/TexasSchools.csv")

## Parsed with column specification:
## cols(
##   DistNumber = col_double(),
##   Year = col_double(),
##   OnCycle = col_double(),
##   LnAvgSalary = col_double(),
##   CycleSwitch = col_double()
## )

pooled_reg<-lm(LnAvgSalary~OnCycle, data = texas)
summary(pooled_reg)

## 
## Call:
## lm(formula = LnAvgSalary ~ OnCycle, data = texas)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.28910 -0.05527 -0.00646  0.05196  0.65215 
## 
## Coefficients:
##              Estimate Std. Error   t value Pr(>|t|)    
## (Intercept) 10.671363   0.001018 10478.815   <2e-16 ***
## OnCycle     -0.030621   0.003766    -8.131    5e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08284 on 7137 degrees of freedom
##   (64 observations deleted due to missingness)
## Multiple R-squared:  0.009178,   Adjusted R-squared:  0.009039 
## F-statistic: 66.11 on 1 and 7137 DF,  p-value: 4.995e-16

Recognize this is a linear-log model and is a dummy variable. School board elections that are held OnCycle lead to lower teacher salaries. This is highly statistically significant.

This is almost certainly biased. There are probably things correlated at the district level between both whether or not districts vote (or switch to voting) on cycle/off cycle and with teacher salaries. Perhaps some districts have strong/weak teachers unions and that determines whether they choose to vote on or off cycle (off cycle would be preferable, and perhaps districts with strong teachers unions kept the elections off cycle).

Also, there could be changes in the entire State over time-perhaps all Texas teacher salaries were increasing or decreasing over the years.

Part B

Some schools decided to switch to an on-cycle election after 2006. Consider this, CycleSwitch the “treatment.” Create a variable to indicate post-treatment years (i.e. years after 2006). Call it After. Create a second, interaction variable to capture the interaction effect between those districts that switched, and after the treatment.

# create a variable, that is a factor, using ifelse() function
# # if the year is greater than 2006, mark "After" as 1,
# # If the year is NOT greater than 2006, market "After" a 0
texas<-texas %>%
  mutate(After = factor(ifelse(test = Year>2006,
                        yes = 1,
                        no = 0)))
head(texas)

DistNumber	Year	OnCycle	LnAvgSalary	CycleSwitch	After
4.8e+06	2e+03	0	10.6	0	0
4.8e+06	2e+03	0	10.6	0	0
4.8e+06	2.00e+03	0	10.6	0	0
4.8e+06	2.01e+03	0	10.6	0	0
4.8e+06	2.01e+03	0	10.6	0	1
4.8e+06	2.01e+03	0	10.6	0	1

Part C

Now estimate a difference-in-difference model with your variables in Part B: CycleSwitch is the treatment variable, After is your post-treatment indicator, and add an interaction variable to capture the interaction effect between those districts that switched, and after the treatment. Write down the estimated regression equation (to four decimal places).

dnd<-lm(LnAvgSalary ~ CycleSwitch + After + CycleSwitch:After, data = texas)
summary(dnd)

## 
## Call:
## lm(formula = LnAvgSalary ~ CycleSwitch + After + CycleSwitch:After, 
##     data = texas)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.29811 -0.05595 -0.00734  0.05284  0.65245 
## 
## Coefficients:
##                     Estimate Std. Error  t value Pr(>|t|)    
## (Intercept)        10.671068   0.001433 7447.598  < 2e-16 ***
## CycleSwitch        -0.023982   0.003397   -7.059 1.83e-12 ***
## After1              0.009303   0.002188    4.251 2.16e-05 ***
## CycleSwitch:After1 -0.008590   0.005189   -1.655   0.0979 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.08334 on 7198 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.0183, Adjusted R-squared:  0.01789 
## F-statistic: 44.72 on 3 and 7198 DF,  p-value: < 2.2e-16

Part D

Interpret what each coefficient means from Part C.

: districts before 2007 that did not switch had ln(AvgSalary) of 10.6711
: districts before 2007 that did switch had ln(AvgSalary) 0.0240 lower than districts that did not switch
: all districts after the change had ln(AvgSalary) 0.0093 higher than before the change
: districts that made a switch had a 0.0086 lower change in ln(AvgSalary) over time than districts that did not switch

Part E

Using your regression equation in Part C, calculate the expected logged average salary for districts in Texas:

1. Before the switch that did not switch
1. After the switch that did not switch
1. Before the switch that did switch
1. After the switch that did switch

1. Before change and did not switch:
1. After change and did not switch:
1. Before change and did switch:
1. After change and did switch:

Part F

Confirm your estimates in Part E by finding the mean logged average salary for each of those four groups in the data.⁸

# avg log salary for districts BEFORE that did NOT switch
texas %>%
  filter(CycleSwitch == 0,
         After == 0) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's

average

10.7

# avg log salary for districts AFTER that did NOT switch
texas %>%
  filter(CycleSwitch == 0,
         After == 1) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's

average

10.7

# avg log salary for districts BEFORE that DID switch
texas %>%
  filter(CycleSwitch == 1,
         After == 0) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's

average

10.6

# avg log salary for districts AFTER that DID switch
texas %>%
  filter(CycleSwitch == 1,
         After == 1) %>%
  summarize(average = mean(LnAvgSalary, na.rm=T)) # need to remove NA's

average

10.6

Part G

Write out the difference-in-difference equation, and calculate the difference-in-difference. Make sure it matches your estimate from the regression.

This is .

Part H

Can we say anything about the types of districts that switched? Can we say anything about all salaries in the districts in the years after the switch?

We can see that the districts that switched paid 2.3% less to their teachers to begin with. shows the difference, before treatment, between the treated (districts that switched) and control groups (districts that did not switch).

We can see that all districts saw a 0.93% increase in teacher salaries after the years after the switch. shows the difference, for both the treatment and control, of salaries over time (before and after the switch).

Part I

Now let’s generalize the diff-in-diff model. Instead of the treatment and post-treatment dummies, use district-and year-fixed effects and the interaction term. Interpret the coefficient on the interaction term⁹

# make sure DistNumber and Year are factors for plm
texas<-texas %>%
  mutate_at(c("DistNumber", "Year"), factor)
head(texas) # make sure it worked

DistNumber	Year	OnCycle	LnAvgSalary	CycleSwitch	After
4800001	2003	0	10.6	0	0
4800001	2004	0	10.6	0	0
4800001	2005	0	10.6	0	0
4800001	2006	0	10.6	0	0
4800001	2007	0	10.6	0	1
4800001	2008	0	10.6	0	1

dnd_fe_reg<-plm(LnAvgSalary ~ CycleSwitch*After, index=c("DistNumber", "Year"), effect = "twoways", data = texas)
summary(dnd_fe_reg)

## Twoways effects Within Model
## 
## Call:
## plm(formula = LnAvgSalary ~ CycleSwitch * After, data = texas, 
##     effect = "twoways", index = c("DistNumber", "Year"))
## 
## Unbalanced Panel: n = 1029, T = 6-7, N = 7202
## 
## Residuals:
##        Min.     1st Qu.      Median     3rd Qu.        Max. 
## -0.33262311 -0.01530044  0.00033517  0.01551322  0.57548546 
## 
## Coefficients:
##                      Estimate Std. Error t-value  Pr(>|t|)    
## CycleSwitch:After1 -0.0085935  0.0019861 -4.3268 1.537e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Total Sum of Squares:    6.2924
## Residual Sum of Squares: 6.2733
## R-Squared:      0.0030269
## Adj. R-Squared: -0.16432
## F-statistic: 18.7208 on 1 and 6166 DF, p-value: 1.5371e-05

Districts that switched to On Cycle saw teacher salaries decrease by 8.67%. No change from our original model!

Part J

Create a nice regression table (using huxtable) for comparison of the regressions in (a), (c), and (i).

huxreg(pooled_reg,
       dnd,
       dnd_fe_reg,
         coefs = c("Constant" = "(Intercept)",
                 "On Cycle" = "OnCycle",
                 "Switched" = "CycleSwitch",
                 "After Switch" = "After1",
                 "Interaction" = "CycleSwitch:After1"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       note = NULL, # suppress footnote for stars, to insert Fixed Effects Row below
       number_format = 3) %>%
add_rows(c("Fixed Effects", "None", "None", "District and Year"), # add fixed effects row
         after = 11) # insert after 11th row

	(1)	(2)	(3)
Constant	10.671 ***	10.671 ***
	(0.001)	(0.001)
On Cycle	-0.031 ***
	(0.004)
Switched		-0.024 ***
		(0.003)
After Switch		0.009 ***
		(0.002)
Interaction		-0.009	-0.009 ***
		(0.005)	(0.002)
Fixed Effects	None	None	District and Year
N	7139	7202	7202
R-Squared	0.009	0.018	0.003
SER	0.083	0.083

Do this inside the summarize() command↩︎
Don’t use the summarize() command for this part↩︎
Set index=c("state","year") to indicate the group and time dimensions.↩︎
Ensure that state is a factor variable, and insert in the regression. You can either mutate() it into a factor beforehand, or just do as.factor(state) in the lm command.↩︎
Inside plm(), set index = "state" to indicate variable, and model = "within" to indicate a fixed effects model.↩︎
Inside plm(), set index = c("state", "year") to indicate both variables, and effect = "twoways" to indicate a 2-way fixed effects model.↩︎
It’s better to use the plm()-generated regressions so as to avoid the multitude of coefficients for the state and year dummies! You certainly could use the dummy variable ones and manually list all of the variables to suppress in the table inside omit_coefs()…↩︎
Hint, filter() properly then summarize().↩︎
This is doable with the dummy variable method, but there will be a lot of dummies! I suggest using plm().↩︎

Problem Set 6 (Answers)

Ryan Safner

ECON 480 - Fall 2019

Instructions

Theory and Concepts

Question 1

Question 2

R Questions

Question 3

Part A

Part B

Part C

Part D

Part E

Part F

Part G

Part H

Part I

Part I

Part K

Part L

Part M

Question 4

Part A

Part B

Part C

Part D

Part E

Part F

Part G

Part H

Part I

Part J