Due by Thursday, November 21, 2019
Answers may be longer than I would deem sufficient on an exam. Some might vary slightly based on points of interest, examples, or personal experience. These suggested answers are designed to give you both the answer and a short explanation of why it is the answer.
In your own words, describe what fixed effects are, when we can use them, and how they remove endogeneity.
Fixed effects can be used for panel data, where each observationit belongs to a group i and a time period t. Running a simple OLS model as usual, known as a pooled model would be biased due to systematic relationships within and between groups and time periods that cause Xit to be endogenous:
^Yit=β0+β1Xit+uit
Group-fixed effects (αi) pull out of the error term all of the factors that explain Yit that are stable over time within each individual group (i), but vary across groups. For example, if groups are U.S. states, state-fixed effects pull out all of the error term all of the idiosyncrasies of each state that do not change over time, such as cultural differences, geographical differences, historical differences, population differences, etc. Group fixed effects do not pull out factors that change over time, such as federal laws passed, recessions affecting all states, etc.
Time-fixed effects (θi) pull out of the error term all of the factors that explain Yit that change over time but do not vary across groups. For example, if groups are U.S. states, and time is in years, year-fixed effects pull out all of the error term all of the changes over the years that affect all U.S. states, such as recessions, inflation, national population growth, national immigration trends, or federal laws passed that affect all states. Time-fixed effects do not pull out factors that do not change over time.
^Yit=β0+β1Xit+αi+θt+ϵit
Mechanically, OLS estimates a separate constant for each group (and/or each time period), giving the expected value of Yit for that group or time-period. This can be done by either de-meaning the data and calculating OLS coefficients by exploiting the variation within each group and/or time period (which is why fixed effects estimators are called “within” estimators), or by creating a dummy variable for each group and/or time period (and omitting one of each, to avoid the dummy variable trap).
In your own words, describe the logic of a difference-in-difference model: what is it comparing against what, and how does it estimate the effect of treatment? What assumption must be made about the treatment and control group for the model to be valid?
^Yit=β0+β1Treatedi+β2Aftert+β3(Treatedi×Aftert)+uit
A difference-in-difference model compares the difference before and after a treatment between a group that receives the treatment and a group that does not. By doing so, we can isolate the effect of the treatment. It is easiest to see in the following equation:
ΔΔY=(Treatedafter−Treatedbefore)−(Controlafter−Controlbefore)
In OLS regression, we can simply make two dummy variables for observationsit depending on the group and the time period each observation is:
Treatedi={1if i receives treatment0if i does not receive treatment
Aftert={1if tis after treatment period0if tis before treatment period
Lastly, we make an interaction term Treatedi∗Aftert which isolates the treatment effect, captured by the coefficient on this term, β3.
Diff-in-diff models assume a counterfactual that if the treated group did not receive treatment, it would have experience the same change as the control group over the pre- to post-treatment period.
A classic example is if a state(s) passes a law and others do not. We want to compare the difference in the difference before and after the law was passed between states that passed the law and states that did not:
ΔiΔtOutcome=(Passedafter−Passedbefore)−(Didn'tafter−Didn'tbefore)
“Treatmenti” and time-fixed effects for “Aftert” to estimate the treatment effect (still β3): ^Yit=αi+θt+β3(Treatedi×Aftert)+ϵit
Answer the following questions using R
. When necessary, please write answers in the same document (knitted Rmd
to html
or pdf
, typed .doc(x)
, or handwritten) as your answers to the above questions. Be sure to include (email or print an .R
file, or show in your knitted markdown
) your code and the outputs of your code with the rest of your answers.
How do people respond to changes in economic conditions? Are they more likely to pursue public service when private sector jobs are scarce? This dataset contains variables at the U.S. State (& D.C.) level:
Variable | Description |
---|---|
state |
U.S. State |
year |
Year |
appspc |
Applications to the Peace Corps (per capita) in State |
unemployrate |
State unemployment rate |
Do more people apply to the Peace Corps when unemployment increases (and reduces other opportunities)?
Before looking at the data, what does your economic intuition tell you? Explain your hypothesis.
If joining the Peace Corps is a substitute for private sector jobs, then as unemployment rises, so too should applications. Although, it could also be possible that people also are more willing to opt out of the workforce when the economy is strong, so we should examine this empirically to be sure.
To get the hang of the data we’re working with, count
(separately) the number of state
s, and the number of year
s. Get the number of n_distinct()
state
s and year
s1, as well as the distinct()
values of each2.
## ── Attaching packages ────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.3
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ───────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## Parsed with column specification:
## cols(
## state = col_character(),
## year = col_double(),
## unemployrate = col_double(),
## appspc = col_double(),
## stcode = col_double(),
## stateshort = col_character()
## )
state <chr> | n <int> | |||
---|---|---|---|---|
ALABAMA | 6 | |||
ALASKA | 6 | |||
ARIZONA | 6 | |||
ARKANSAS | 6 | |||
CALIFORNIA | 6 | |||
COLORADO | 6 | |||
CONNECTICUT | 6 | |||
DELAWARE | 6 | |||
DISTRICT OF COLUMBIA | 6 | |||
FLORIDA | 6 |
year <dbl> | n <int> | |||
---|---|---|---|---|
2006 | 51 | |||
2007 | 51 | |||
2008 | 51 | |||
2009 | 51 | |||
2010 | 51 | |||
2011 | 51 |
Num_states <int> | Num_years <int> | |||
---|---|---|---|---|
51 | 6 |
state <chr> | ||||
---|---|---|---|---|
ALABAMA | ||||
ALASKA | ||||
ARIZONA | ||||
ARKANSAS | ||||
CALIFORNIA | ||||
COLORADO | ||||
CONNECTICUT | ||||
DELAWARE | ||||
DISTRICT OF COLUMBIA | ||||
FLORIDA |
year <dbl> | ||||
---|---|---|---|---|
2006 | ||||
2007 | ||||
2008 | ||||
2009 | ||||
2010 | ||||
2011 |
Continuing our pre-analysis inspection, (install, and) load the plm
package, and check the dimensions of the data with pdim
.3
##
## Attaching package: 'plm'
## The following objects are masked from 'package:dplyr':
##
## between, lag, lead
## Balanced Panel: n = 51, T = 6, N = 306
Create a scatterplot of appspc
(Y) on unemployrate
(X). Which State is an outlier? How would this affect the pooled regression estimates? Create a second scatterplot that does not include this State.
ggplot(data = peacecorps)+
aes(x = unemployrate,
y = appspc)+
geom_point(aes(color = as.factor(state)))+
geom_smooth(method = "lm")+
scale_x_continuous(breaks=seq(0,20,2),
labels = function(x){paste(x,"%", sep="")})+
labs(x = "Unemployment Rate",
y = "Peace Corps Applications (per capita)")+
theme_classic(base_family = "Fira Sans Condensed",
base_size=18)+
theme(legend.position = "none") # remove legend for State colors
We can see there are very clear outliers at the top. Let’s plot text
instead of point
s, using the stateshort
to see which observations are which states:
ggplot(data = peacecorps)+
aes(x = unemployrate,
y = appspc)+
geom_text(aes(color = as.factor(state), label = stateshort))+ #<<
geom_smooth(method = "lm")+
scale_x_continuous(breaks=seq(0,20,2),
labels = function(x){paste(x,"%", sep="")})+
labs(x = "Unemployment Rate",
y = "Peace Corps Applications (per capita)")+
theme_classic(base_family = "Fira Sans Condensed",
base_size=18)+
theme(legend.position = "none") # remove legend for State colors
Clearly DIS
, which is District of Columbia, are the outliers. Let’s make a second scatterplot without them:
peacecorps %>%
filter(state != "DISTRICT OF COLUMBIA") %>%
ggplot(data = .)+
aes(x = unemployrate,
y = appspc)+
geom_point(aes(color = as.factor(state)))+
geom_smooth(method = "lm")+
scale_x_continuous(breaks=seq(0,20,2),
labels = function(x){paste(x,"%", sep="")})+
labs(x = "Unemployment Rate",
y = "Peace Corps Applications (per capita)")+
theme_classic(base_family = "Fira Sans Condensed",
base_size=18)+
theme(legend.position = "none") # remove legend for State colors
Run two pooled regressions, one with the outliers, and one without them. Write out the estimated regression equation for each. Interpret the coefficient, and comment on how it changes between the two regressions.
##
## Call:
## lm(formula = appspc ~ unemployrate, data = peacecorps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -43.616 -20.641 -9.719 5.171 310.328
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.2255 7.2857 6.345 8.07e-10 ***
## unemployrate 0.8353 1.0362 0.806 0.421
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 45.2 on 304 degrees of freedom
## Multiple R-squared: 0.002133, Adjusted R-squared: -0.00115
## F-statistic: 0.6498 on 1 and 304 DF, p-value: 0.4208
^Apps per capitait=46.23+0.84Unemployment Rateit
For every 1 percentage point increase in unemployment, there are 0.46 more applications to the Peace Corps per capita.
##
## Call:
## lm(formula = appspc ~ unemployrate, data = peacecorps_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -38.380 -14.544 -4.190 9.278 97.997
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 48.5891 3.5955 13.514 <2e-16 ***
## unemployrate -0.3710 0.5133 -0.723 0.47
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.17 on 298 degrees of freedom
## Multiple R-squared: 0.001751, Adjusted R-squared: -0.001599
## F-statistic: 0.5226 on 1 and 298 DF, p-value: 0.4703
^Apps per capitait=48.59−0.37Unemployment Rateit
For every 1 percentage point increase in unemployment, there are 0.37 fewer applications to the Peace Corps per capita.
The coefficient changes signs and becomes a smaller magnitude by dropping the outliers (DC)!
Now run a regression with State fixed effects using the dummy variable method.4 Interpret the marginal effect of unemployrate
on appspc
. How did it change?
##
## Call:
## lm(formula = appspc ~ unemployrate + factor(state), data = peacecorps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -31.652 -3.661 -0.393 3.492 33.262
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.6070 3.7663 3.347 0.000939 ***
## unemployrate 0.6565 0.2330 2.817 0.005225 **
## factor(state)ALASKA 52.5875 4.8452 10.854 < 2e-16 ***
## factor(state)ARIZONA 18.0456 4.8463 3.724 0.000242 ***
## factor(state)ARKANSAS 4.9117 4.8447 1.014 0.311630
## factor(state)CALIFORNIA 22.1455 4.8692 4.548 8.39e-06 ***
## factor(state)COLORADO 69.9201 4.8452 14.431 < 2e-16 ***
## factor(state)CONNECTICUT 28.9349 4.8446 5.973 7.84e-09 ***
## factor(state)DELAWARE 21.8317 4.8489 4.502 1.02e-05 ***
## factor(state)DISTRICT OF COLUMBIA 311.7036 4.8533 64.225 < 2e-16 ***
## factor(state)FLORIDA 13.0805 4.8492 2.697 0.007456 **
## factor(state)GEORGIA 18.7073 4.8486 3.858 0.000145 ***
## factor(state)HAWAII 35.7883 4.8617 7.361 2.53e-12 ***
## factor(state)IDAHO 34.4465 4.8480 7.105 1.21e-11 ***
## factor(state)ILLINOIS 27.8559 4.8503 5.743 2.65e-08 ***
## factor(state)INDIANA 21.5183 4.8478 4.439 1.35e-05 ***
## factor(state)IOWA 28.9038 4.8614 5.946 9.06e-09 ***
## factor(state)KANSAS 28.9728 4.8507 5.973 7.83e-09 ***
## factor(state)KENTUCKY 9.6535 4.8540 1.989 0.047800 *
## factor(state)LOUISIANA 2.5206 4.8517 0.520 0.603844
## factor(state)MAINE 53.3394 4.8450 11.009 < 2e-16 ***
## factor(state)MARYLAND 45.1903 4.8513 9.315 < 2e-16 ***
## factor(state)MASSACHUSETTS 44.9452 4.8450 9.277 < 2e-16 ***
## factor(state)MICHIGAN 26.8719 4.8970 5.487 9.86e-08 ***
## factor(state)MINNESOTA 45.6153 4.8476 9.410 < 2e-16 ***
## factor(state)MISSISSIPPI -5.6364 4.8607 -1.160 0.247307
## factor(state)MISSOURI 19.9421 4.8458 4.115 5.23e-05 ***
## factor(state)MONTANA 66.2285 4.8583 13.632 < 2e-16 ***
## factor(state)NEBRASKA 26.6913 4.8904 5.458 1.14e-07 ***
## factor(state)NEVADA 12.7169 4.8767 2.608 0.009656 **
## factor(state)NEW HAMPSHIRE 54.0272 4.8658 11.103 < 2e-16 ***
## factor(state)NEW JERSEY 13.4382 4.8452 2.774 0.005956 **
## factor(state)NEW MEXICO 28.0890 4.8503 5.791 2.06e-08 ***
## factor(state)NEW YORK 21.5156 4.8446 4.441 1.34e-05 ***
## factor(state)NORTH CAROLINA 23.2097 4.8533 4.782 2.94e-06 ***
## factor(state)NORTH DAKOTA 19.6067 4.9034 3.999 8.36e-05 ***
## factor(state)OHIO 25.2887 4.8500 5.214 3.83e-07 ***
## factor(state)OKLAHOMA 10.4850 4.8560 2.159 0.031774 *
## factor(state)OREGON 72.7034 4.8545 14.976 < 2e-16 ***
## factor(state)PENNSYLVANIA 21.7160 4.8449 4.482 1.12e-05 ***
## factor(state)RHODE ISLAND 34.0396 4.8654 6.996 2.33e-11 ***
## factor(state)SOUTH CAROLINA 12.0780 4.8651 2.483 0.013690 *
## factor(state)SOUTH DAKOTA 22.9175 4.8862 4.690 4.46e-06 ***
## factor(state)TENNESSEE 9.1572 4.8498 1.888 0.060142 .
## factor(state)TEXAS 9.2251 4.8455 1.904 0.058062 .
## factor(state)UTAH 18.8667 4.8571 3.884 0.000131 ***
## factor(state)VERMONT 104.7363 4.8580 21.560 < 2e-16 ***
## factor(state)VIRGINIA 46.9318 4.8607 9.655 < 2e-16 ***
## factor(state)WASHINGTON 63.8660 4.8460 13.179 < 2e-16 ***
## factor(state)WEST VIRGINIA 3.9514 4.8461 0.815 0.415620
## factor(state)WISCONSIN 35.8930 4.8448 7.409 1.89e-12 ***
## factor(state)WYOMING 35.3036 4.8665 7.254 4.88e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.391 on 254 degrees of freedom
## Multiple R-squared: 0.9713, Adjusted R-squared: 0.9655
## F-statistic: 168.4 on 51 and 254 DF, p-value: < 2.2e-16
For every 1 percentage point increase in the unemployment rate, there are 0.66 more applications to the Peace Corps per capita. It remained positive, and shrank in size, from the original regression (with outliers).
Find the coefficient for Maryland and interpret it. How many applications per capita does Maryland have?
This is just like any model with dummy variables representing different categories. The constant (12.61) represents the reference category not put in the regression (to avoid the dummy variable trap), i.e. ALABAMA.
The coefficient on Maryland is 45.19. This means that, on average, there are 45.19 more applications per capita in Maryland than Alabama over the time period in the data. To get the average for Maryland, we add 12.61+45.19=57.80.
## [1] 57.79731
Now try using the plm()
command, which de-means the data, and make sure you get the same results as Part F.5 Do you get the same marginal effect of unemployrate
on appspc
?
## Oneway (individual) effect Within Model
##
## Call:
## plm(formula = appspc ~ unemployrate, data = peacecorps, model = "within",
## index = "state")
##
## Balanced Panel: n = 51, T = 6, N = 306
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -31.65231 -3.66134 -0.39349 3.49204 33.26228
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## unemployrate 0.65651 0.23303 2.8172 0.005225 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 18443
## Residual Sum of Squares: 17884
## R-Squared: 0.0303
## Adj. R-Squared: -0.1644
## F-statistic: 7.93677 on 1 and 254 DF, p-value: 0.0052247
Now include year fixed effects in your regression, using the dummy variable method. Interpret the marginal effect of unemployrate
on appspc
. How did it change?
##
## Call:
## lm(formula = appspc ~ unemployrate + factor(state) + factor(year),
## data = peacecorps)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.7457 -3.1669 0.1384 2.7666 31.1644
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 14.48800 3.91879 3.697 0.000268 ***
## unemployrate 0.70525 0.54733 1.289 0.198760
## factor(state)ALASKA 52.57126 4.08265 12.877 < 2e-16 ***
## factor(state)ARIZONA 18.01801 4.09035 4.405 1.57e-05 ***
## factor(state)ARKANSAS 4.91897 4.07940 1.206 0.229036
## factor(state)CALIFORNIA 22.04317 4.23743 5.202 4.12e-07 ***
## factor(state)COLORADO 69.93711 4.08307 17.129 < 2e-16 ***
## factor(state)CONNECTICUT 28.93000 4.07894 7.093 1.36e-11 ***
## factor(state)DELAWARE 21.87470 4.10713 5.326 2.24e-07 ***
## factor(state)DISTRICT OF COLUMBIA 311.64268 4.13555 75.357 < 2e-16 ***
## factor(state)FLORIDA 13.03582 4.10931 3.172 0.001702 **
## factor(state)GEORGIA 18.66589 4.10502 4.547 8.50e-06 ***
## factor(state)HAWAII 35.87361 4.18953 8.563 1.16e-15 ***
## factor(state)IDAHO 34.48472 4.10104 8.409 3.24e-15 ***
## factor(state)ILLINOIS 27.80635 4.11635 6.755 1.00e-10 ***
## factor(state)INDIANA 21.48091 4.10010 5.239 3.44e-07 ***
## factor(state)IOWA 28.98828 4.18745 6.923 3.75e-11 ***
## factor(state)KANSAS 29.02399 4.11886 7.047 1.79e-11 ***
## factor(state)KENTUCKY 9.59010 4.14017 2.316 0.021351 *
## factor(state)LOUISIANA 2.57585 4.12547 0.624 0.532951
## factor(state)MAINE 53.35320 4.08152 13.072 < 2e-16 ***
## factor(state)MARYLAND 45.24393 4.12277 10.974 < 2e-16 ***
## factor(state)MASSACHUSETTS 44.95986 4.08187 11.015 < 2e-16 ***
## factor(state)MICHIGAN 26.72242 4.41044 6.059 5.03e-09 ***
## factor(state)MINNESOTA 45.65107 4.09827 11.139 < 2e-16 ***
## factor(state)MISSISSIPPI -5.71925 4.18336 -1.367 0.172813
## factor(state)MISSOURI 19.91937 4.08656 4.874 1.95e-06 ***
## factor(state)MONTANA 66.30488 4.16773 15.909 < 2e-16 ***
## factor(state)NEBRASKA 26.83099 4.36996 6.140 3.24e-09 ***
## factor(state)NEVADA 12.59989 4.28489 2.941 0.003585 **
## factor(state)NEW HAMPSHIRE 54.12221 4.21590 12.838 < 2e-16 ***
## factor(state)NEW JERSEY 13.42198 4.08265 3.288 0.001156 **
## factor(state)NEW MEXICO 28.13855 4.11635 6.836 6.25e-11 ***
## factor(state)NEW YORK 21.52042 4.07894 5.276 2.87e-07 ***
## factor(state)NORTH CAROLINA 23.14879 4.13555 5.598 5.73e-08 ***
## factor(state)NORTH DAKOTA 19.76510 4.44960 4.442 1.34e-05 ***
## factor(state)OHIO 25.24078 4.11393 6.135 3.32e-09 ***
## factor(state)OKLAHOMA 10.55484 4.15333 2.541 0.011652 *
## factor(state)OREGON 72.63838 4.14334 17.531 < 2e-16 ***
## factor(state)PENNSYLVANIA 21.72903 4.08118 5.324 2.26e-07 ***
## factor(state)RHODE ISLAND 33.94537 4.21360 8.056 3.28e-14 ***
## factor(state)SOUTH CAROLINA 11.98462 4.21132 2.846 0.004799 **
## factor(state)SOUTH DAKOTA 23.05074 4.34429 5.306 2.48e-07 ***
## factor(state)TENNESSEE 9.11011 4.11274 2.215 0.027659 *
## factor(state)TEXAS 9.24536 4.08494 2.263 0.024480 *
## factor(state)UTAH 18.93985 4.16038 4.552 8.30e-06 ***
## factor(state)VERMONT 104.81188 4.16587 25.160 < 2e-16 ***
## factor(state)VIRGINIA 47.01469 4.18336 11.239 < 2e-16 ***
## factor(state)WASHINGTON 63.84082 4.08836 15.615 < 2e-16 ***
## factor(state)WEST VIRGINIA 3.97742 4.08900 0.973 0.331641
## factor(state)WISCONSIN 35.90359 4.08029 8.799 2.35e-16 ***
## factor(state)WYOMING 35.40021 4.22056 8.388 3.73e-15 ***
## factor(year)2007 -5.36842 1.39953 -3.836 0.000159 ***
## factor(year)2008 0.02018 1.48030 0.014 0.989135
## factor(year)2009 4.23877 2.61984 1.618 0.106939
## factor(year)2010 -3.25521 2.76284 -1.178 0.239836
## factor(year)2011 -8.88542 2.47990 -3.583 0.000409 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.064 on 249 degrees of freedom
## Multiple R-squared: 0.98, Adjusted R-squared: 0.9756
## F-statistic: 218.3 on 56 and 249 DF, p-value: < 2.2e-16
For every 1 percentage point increase in the unemployment rate, there are 0.71 applications per capita to the Peace Corps.
What would be the predicted number of applications in Maryland in 2011 at an unemployment rate of 5%?
^Apps per capitait=14.49+0.71Unemployment Rateit+45.24Maryland_i−8.892007t^Apps per capitaMD,2011=14.49+0.71(5)+45.24(1)−8.89(1)^Apps per capitaMD,2011≈54.09
# if we want to extract it using R
fe_2way_reg<-tidy(fe_2way_reg)
constant<-fe_2way_reg %>%
filter(term == "(Intercept)") %>%
pull(estimate)
unemploy_me<-fe_2way_reg %>%
filter(term == "unemployrate") %>%
pull(estimate)
MD_dif<-fe_2way_reg %>%
filter(term == "factor(state)MARYLAND") %>%
pull(estimate)
dif_2007<-fe_2way_reg %>%
filter(term == "factor(year)2007") %>%
pull(estimate)
constant+unemploy_me*5+MD_dif+dif_2007
## [1] 57.88974
Now try using the plm()
command, which de-means the data, and make sure you get the same results as Part I.6 Do you get the same marginal effect of unemployrate
on appspc
?
## Twoways effects Within Model
##
## Call:
## plm(formula = appspc ~ unemployrate, data = peacecorps, effect = "twoways",
## index = c("state", "year"))
##
## Balanced Panel: n = 51, T = 6, N = 306
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -26.74570 -3.16688 0.13837 2.76660 31.16439
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## unemployrate 0.70525 0.54733 1.2885 0.1988
##
## Total Sum of Squares: 12509
## Residual Sum of Squares: 12426
## R-Squared: 0.0066237
## Adj. R-Squared: -0.21679
## F-statistic: 1.66029 on 1 and 249 DF, p-value: 0.19876
Can there still be endogeneity in this model? Give some examples.
Yes. There may be other time-varying variables that are omitted and correlated with unemployment, but not picked up in the time-fixed effect since they are not the same in each state. Examples could be a local market bubble some years in Texas or Nevada, but not in Maine, or bad weather in Florida one year, but not in Wyoming, etc.7
Create a nice regression table (using huxtable
) for comparison of the regressions in E, G, and I.
##
## Attaching package: 'huxtable'
## The following object is masked from 'package:dplyr':
##
## add_rownames
## The following object is masked from 'package:purrr':
##
## every
## The following object is masked from 'package:ggplot2':
##
## theme_grey
huxreg(pooled,
pooled_no_outliers,
fe_reg_alt,
fe_2way_reg_alt,
coefs = c("Constant" = "(Intercept)",
"Unemployment Rate" = "unemployrate"),
statistics = c("N" = "nobs",
"R-Squared" = "r.squared",
"SER" = "sigma"),
note = NULL, # suppress footnote for stars, to insert Fixed Effects Row below
number_format = 3) %>%
add_rows(c("Outliers", "Yes", "No DC", "Yes", "Yes"), # add row for outliers
after = 5) %>% # insert after 5th row
add_rows(c("Fixed Effects", "None", "None", "State", "State and Year"), # add fixed effects row
after = 6) # insert after 6th row
(1) | (2) | (3) | (4) | |
Constant | 46.226 *** | 48.589 *** | ||
(7.286) | (3.595) | |||
Unemployment Rate | 0.835 | -0.371 | 0.657 ** | 0.705 |
(1.036) | (0.513) | (0.233) | (0.547) | |
Outliers | Yes | No DC | Yes | Yes |
Fixed Effects | None | None | State | State and Year |
N | 306 | 300 | 306 | 306 |
R-Squared | 0.002 | 0.002 | 0.030 | 0.007 |
SER | 45.205 | 22.170 |
Are teachers paid more when school board members are elected “off cycle” when there are not major national political elections (e.g. odd years) than “on cycle?” The argument is that during “off” years, without attention on state or national elections, voters will pay less attention to the election, and teachers can more effectively mobilize for higher pay, versus “on” years where voters are paying more attention. This data comes from Anzia, Sarah (2012), “The Election Timing Effect: Evidence from a Policy Intervention in Texas.” Quarterly Journal of Political Science 7(3): 277-297, and follows 1,020 Texas school board districts from 2003–2009.
From 2003–2006, all districts elected their school board members off-cycle. A change in Texas policy in 2006 led some, but not all, districts to elect their school board members on-cycle from 2007 onwards.
Variable | Description |
---|---|
LnAvgSalary |
logged average salary of teachers in district |
Year |
Year |
OnCycle |
=1 if school boards elected on-cycle (e.g. same year as national and state elections), =0 if elected off-cycle |
pol_freedom |
Political freedom index score (2018) from 1 (least) top 10 (most free) |
CycleSwitch |
=1 if district switched from off- to on-cycle elections |
AfterSwitch |
=1 if year is after 2006 |
Run a pooled regression model of LnAvgSalary
on OnCycle
. Write the estimated regression equation, and interpret the coefficient on OnCycle.
Are there any sources of bias (consider in particular the argument in the question prompt)?
## Parsed with column specification:
## cols(
## DistNumber = col_double(),
## Year = col_double(),
## OnCycle = col_double(),
## LnAvgSalary = col_double(),
## CycleSwitch = col_double()
## )
##
## Call:
## lm(formula = LnAvgSalary ~ OnCycle, data = texas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.28910 -0.05527 -0.00646 0.05196 0.65215
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.671363 0.001018 10478.815 <2e-16 ***
## OnCycle -0.030621 0.003766 -8.131 5e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08284 on 7137 degrees of freedom
## (64 observations deleted due to missingness)
## Multiple R-squared: 0.009178, Adjusted R-squared: 0.009039
## F-statistic: 66.11 on 1 and 7137 DF, p-value: 4.995e-16
^log(Average Salary)it=10.67−0.03OnCycleit
Recognize this is a linear-log model and X is a dummy variable. School board elections that are held OnCycle
(=1) lead to \hat{\beta_1 \times 100\% = −0.03 \times 100\% = −3\% lower teacher salaries. This is highly statistically significant.
This is almost certainly biased. There are probably things correlated at the district level between both whether or not districts vote (or switch to voting) on cycle/off cycle and with teacher salaries. Perhaps some districts have strong/weak teachers unions and that determines whether they choose to vote on or off cycle (off cycle would be preferable, and perhaps districts with strong teachers unions kept the elections off cycle).
Also, there could be changes in the entire State over time-perhaps all Texas teacher salaries were increasing or decreasing over the years.
Some schools decided to switch to an on-cycle election after 2006. Consider this, CycleSwitch
the “treatment.” Create a variable to indicate post-treatment years (i.e. years after 2006). Call it After
. Create a second, interaction variable to capture the interaction effect between those districts that switched, and after the treatment.
DistNumber | Year | OnCycle | LnAvgSalary | CycleSwitch | After |
4.8e+06 | 2e+03 | 0 | 10.6 | 0 | 0 |
4.8e+06 | 2e+03 | 0 | 10.6 | 0 | 0 |
4.8e+06 | 2.00e+03 | 0 | 10.6 | 0 | 0 |
4.8e+06 | 2.01e+03 | 0 | 10.6 | 0 | 0 |
4.8e+06 | 2.01e+03 | 0 | 10.6 | 0 | 1 |
4.8e+06 | 2.01e+03 | 0 | 10.6 | 0 | 1 |
Now estimate a difference-in-difference model with your variables in Part B: CycleSwitch
is the treatment variable, After
is your post-treatment indicator, and add an interaction variable to capture the interaction effect between those districts that switched, and after the treatment. Write down the estimated regression equation (to four decimal places).
##
## Call:
## lm(formula = LnAvgSalary ~ CycleSwitch + After + CycleSwitch:After,
## data = texas)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.29811 -0.05595 -0.00734 0.05284 0.65245
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.671068 0.001433 7447.598 < 2e-16 ***
## CycleSwitch -0.023982 0.003397 -7.059 1.83e-12 ***
## After1 0.009303 0.002188 4.251 2.16e-05 ***
## CycleSwitch:After1 -0.008590 0.005189 -1.655 0.0979 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.08334 on 7198 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.0183, Adjusted R-squared: 0.01789
## F-statistic: 44.72 on 3 and 7198 DF, p-value: < 2.2e-16
^Log(Average Salary)it=10.6711−0.0240Cycle Switchi+0.0093Aftert−0.0086(Cycle Switchi×Aftert)
Interpret what each coefficient means from Part C.
ln(AvgSalary)
of 10.6711ln(AvgSalary)
0.0240 lower than districts that did not switchln(AvgSalary)
0.0093 higher than before the changeln(AvgSalary)
over time than districts that did not switchUsing your regression equation in Part C, calculate the expected logged average salary (Y) for districts in Texas:
Confirm your estimates in Part E by finding the mean logged average salary for each of those four groups in the data.8
average |
10.7 |
average |
10.7 |
average |
10.6 |
average |
10.6 |
Write out the difference-in-difference equation, and calculate the difference-in-difference. Make sure it matches your estimate from the regression.
ΔΔln(AvgSalary)=(Switchedafter−Switchedbefore)−(Didn′tafter−Didn′tbefore)=(10.6478−10.6471)−(10.6804−10.6711)=(0.0007)−(0.0093)=−0.0086
This is β3.
Can we say anything about the types of districts that switched? Can we say anything about all salaries in the districts in the years after the switch?
We can see that the districts that switched paid 2.3% less to their teachers to begin with. β2 shows the difference, before treatment, between the treated (districts that switched) and control groups (districts that did not switch).
We can see that all districts saw a 0.93% increase in teacher salaries after the years after the switch. β3 shows the difference, for both the treatment and control, of salaries over time (before and after the switch).
Now let’s generalize the diff-in-diff model. Instead of the treatment and post-treatment dummies, use district-and year-fixed effects and the interaction term. Interpret the coefficient on the interaction term9
DistNumber | Year | OnCycle | LnAvgSalary | CycleSwitch | After |
4800001 | 2003 | 0 | 10.6 | 0 | 0 |
4800001 | 2004 | 0 | 10.6 | 0 | 0 |
4800001 | 2005 | 0 | 10.6 | 0 | 0 |
4800001 | 2006 | 0 | 10.6 | 0 | 0 |
4800001 | 2007 | 0 | 10.6 | 0 | 1 |
4800001 | 2008 | 0 | 10.6 | 0 | 1 |
## Twoways effects Within Model
##
## Call:
## plm(formula = LnAvgSalary ~ CycleSwitch * After, data = texas,
## effect = "twoways", index = c("DistNumber", "Year"))
##
## Unbalanced Panel: n = 1029, T = 6-7, N = 7202
##
## Residuals:
## Min. 1st Qu. Median 3rd Qu. Max.
## -0.33262311 -0.01530044 0.00033517 0.01551322 0.57548546
##
## Coefficients:
## Estimate Std. Error t-value Pr(>|t|)
## CycleSwitch:After1 -0.0085935 0.0019861 -4.3268 1.537e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Total Sum of Squares: 6.2924
## Residual Sum of Squares: 6.2733
## R-Squared: 0.0030269
## Adj. R-Squared: -0.16432
## F-statistic: 18.7208 on 1 and 6166 DF, p-value: 1.5371e-05
Districts that switched to On Cycle saw teacher salaries decrease by 8.67%. No change from our original model!
Create a nice regression table (using huxtable
) for comparison of the regressions in (a), (c), and (i).
huxreg(pooled_reg,
dnd,
dnd_fe_reg,
coefs = c("Constant" = "(Intercept)",
"On Cycle" = "OnCycle",
"Switched" = "CycleSwitch",
"After Switch" = "After1",
"Interaction" = "CycleSwitch:After1"),
statistics = c("N" = "nobs",
"R-Squared" = "r.squared",
"SER" = "sigma"),
note = NULL, # suppress footnote for stars, to insert Fixed Effects Row below
number_format = 3) %>%
add_rows(c("Fixed Effects", "None", "None", "District and Year"), # add fixed effects row
after = 11) # insert after 11th row
(1) | (2) | (3) | |
Constant | 10.671 *** | 10.671 *** | |
(0.001) | (0.001) | ||
On Cycle | -0.031 *** | ||
(0.004) | |||
Switched | -0.024 *** | ||
(0.003) | |||
After Switch | 0.009 *** | ||
(0.002) | |||
Interaction | -0.009 | -0.009 *** | |
(0.005) | (0.002) | ||
Fixed Effects | None | None | District and Year |
N | 7139 | 7202 | 7202 |
R-Squared | 0.009 | 0.018 | 0.003 |
SER | 0.083 | 0.083 |
Do this inside the summarize()
command↩︎
Don’t use the summarize()
command for this part↩︎
Set index=c("state","year")
to indicate the group and time dimensions.↩︎
Ensure that state
is a factor variable, and insert in the regression. You can either mutate()
it into a factor
beforehand, or just do as.factor(state)
in the lm
command.↩︎
Inside plm()
, set index = "state"
to indicate variable, and model = "within"
to indicate a fixed effects model.↩︎
Inside plm()
, set index = c("state", "year")
to indicate both variables, and effect = "twoways"
to indicate a 2-way fixed effects model.↩︎
It’s better to use the plm()
-generated regressions so as to avoid the multitude of coefficients for the state and year dummies! You certainly could use the dummy variable ones and manually list all of the variables to suppress in the table inside omit_coefs()
…↩︎
Hint, filter()
properly then summarize()
.↩︎
This is doable with the dummy variable method, but there will be a lot of dummies! I suggest using plm()
.↩︎