The big challenge in applied econometrics is choosing how to specify a model to regress
Every dataset is different, every study has a different goal
But here are some helpful tips and frequent problems (and solutions)
1. Identify your question of interest: what do you want to know? What marginal effect(s) do you want to estimate?
1. Identify your question of interest: what do you want to know? What marginal effect(s) do you want to estimate?
2. Think about possible sources of endogeneity: what other variables would cause omitted variable bias if we left them out? Can we get data on them too?
1. Identify your question of interest: what do you want to know? What marginal effect(s) do you want to estimate?
2. Think about possible sources of endogeneity: what other variables would cause omitted variable bias if we left them out? Can we get data on them too?
3. Run multiple models and check the robustness of your results: does the size (or direction) of your marginal effect(s) of interest change as you change your model (i.e. add more variables)?
1. Identify your question of interest: what do you want to know? What marginal effect(s) do you want to estimate?
2. Think about possible sources of endogeneity: what other variables would cause omitted variable bias if we left them out? Can we get data on them too?
3. Run multiple models and check the robustness of your results: does the size (or direction) of your marginal effect(s) of interest change as you change your model (i.e. add more variables)?
4. Interpret your results
Ideally, we would want a randomized control experiment to assign individuals to treatment
But with observational data, ui depends on other factors
Ideally, we would want a randomized control experiment to assign individuals to treatment
But with observational data, ui depends on other factors
Example: Consider test scores and class sizes again. What about learning opportunities outside of school?
Example: Consider test scores and class sizes again. What about learning opportunities outside of school?
Example: Consider test scores and class sizes again. What about learning opportunities outside of school?
Probably a bias-causing omitted variable (affects test score and correlated with class size) but we can't measure it!
Suppose we can measure a variable V, and significantly, corr(V,Z)≠0
meal_pct
)Example: Consider test scores and class sizes again. What about learning opportunities outside of school?
Probably a bias-causing omitted variable (affects test score and correlated with class size) but we can't measure it!
Suppose we can measure a variable V, and significantly, corr(V,Z)≠0
meal_pct
)This is a good proxy for income-determined learning opportunities outside of school
meal_pct
is correlated with Income (which is likely correlated with both class size and test score)meal_pct
to be strongly negatively correlated with incomeWe assumed we don't have data on average district income, we would expect meal_pct
to be strongly negatively correlated with income
Just kidding, we do have data on avginc
, but we'll only use it to confirm our suspicion:
We assumed we don't have data on average district income, we would expect meal_pct
to be strongly negatively correlated with income
Just kidding, we do have data on avginc
, but we'll only use it to confirm our suspicion:
CASchool %>% select(testscr, str, el_pct, avginc,meal_pct) %>% cor()
## testscr str el_pct avginc meal_pct## testscr 1.0000000 -0.2263628 -0.6441237 0.7124308 -0.8687720## str -0.2263628 1.0000000 0.1876424 -0.2321937 0.1352034## el_pct -0.6441237 0.1876424 1.0000000 -0.3074195 0.6530607## avginc 0.7124308 -0.2321937 -0.3074195 1.0000000 -0.6844396## meal_pct -0.8687720 0.1352034 0.6530607 -0.6844396 1.0000000
We assumed we don't have data on average district income, we would expect meal_pct
to be strongly negatively correlated with income
Just kidding, we do have data on avginc
, but we'll only use it to confirm our suspicion:
CASchool %>% select(testscr, str, el_pct, avginc,meal_pct) %>% cor()
## testscr str el_pct avginc meal_pct## testscr 1.0000000 -0.2263628 -0.6441237 0.7124308 -0.8687720## str -0.2263628 1.0000000 0.1876424 -0.2321937 0.1352034## el_pct -0.6441237 0.1876424 1.0000000 -0.3074195 0.6530607## avginc 0.7124308 -0.2321937 -0.3074195 1.0000000 -0.6844396## meal_pct -0.8687720 0.1352034 0.6530607 -0.6844396 1.0000000
meal_pct
is strongly (negatively) correlated with income, as expectedproxyreg<-lm(testscr~str+el_pct+meal_pct, data=CASchool)summary(proxyreg)
## ## Call:## lm(formula = testscr ~ str + el_pct + meal_pct, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -32.849 -5.151 -0.308 5.243 31.501 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 700.14996 4.68569 149.423 < 2e-16 ***## str -0.99831 0.23875 -4.181 3.54e-05 ***## el_pct -0.12157 0.03232 -3.762 0.000193 ***## meal_pct -0.54735 0.02160 -25.341 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 9.08 on 416 degrees of freedom## Multiple R-squared: 0.7745, Adjusted R-squared: 0.7729 ## F-statistic: 476.3 on 3 and 416 DF, p-value: < 2.2e-16
^Test Scorei=700.15−1.00STRi−0.122%ELi−0.547meal%i
^Test Scorei=700.15−1.00STRi−0.122%ELi−0.547meal%i
meal%
causal? ^Test Scorei=700.15−1.00STRi−0.122%ELi−0.547meal%i
Is meal%
causal?
Getting rid of programs in districts where a large percentage of students need them would boost test scores A LOT! (So probably not causal...)
^Test Scorei=700.15−1.00STRi−0.122%ELi−0.547meal%i
Is meal%
causal?
Getting rid of programs in districts where a large percentage of students need them would boost test scores A LOT! (So probably not causal...)
meal%
likely correlated with other things in ui (like outside learning opportunities!).
^Test Scorei=700.15−1.00STRi−0.122%ELi−0.547meal%i
Is meal%
causal?
Getting rid of programs in districts where a large percentage of students need them would boost test scores A LOT! (So probably not causal...)
meal%
likely correlated with other things in ui (like outside learning opportunities!).
meal%
to be unbiased!str
to be unbiased!A control variable is a regressor variable note of interest, but included to hold factors constant that, if omitted, would bias the causal effect of interest
The control variable may still be correlated with omitted causal factors in u
A control variable is a regressor variable note of interest, but included to hold factors constant that, if omitted, would bias the causal effect of interest
The control variable may still be correlated with omitted causal factors in u
Sometimes it is helpful to make a coefficient plot to visualize the different marginal effects
Take broom
and tidy()
your regression with conf.int=TRUE
(and save as an object)
# run regressionreg<-lm(testscr~str+el_pct+meal_pct, data=CASchool)# load broomlibrary(broom)# tidy regression with confidence intervals# save as "reg_tidy"reg_tidy <- tidy(reg, conf.int = TRUE)
# look at itreg_tidy
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | conf.low <dbl> | conf.high <dbl> |
---|---|---|---|---|---|---|
(Intercept) | 700.1499647 | 4.68568657 | 149.423132 | 0.000000e+00 | 690.9393907 | 709.36053869 |
str | -0.9983092 | 0.23875427 | -4.181325 | 3.535856e-05 | -1.4676244 | -0.52899403 |
el_pct | -0.1215733 | 0.03231728 | -3.761867 | 1.928408e-04 | -0.1850988 | -0.05804777 |
meal_pct | -0.5473456 | 0.02159885 | -25.341424 | 2.302903e-86 | -0.5898020 | -0.50488907 |
# intercept is too big relative to others! let's leave it outreg_tidy_no_int<-reg_tidy %>% filter(term!="(Intercept)")
ggplot(data = reg_tidy_no_int)+ aes(x = estimate, y = term, color = term)+ geom_point()+ # point for estimate # now make "error bars" using conf. int. geom_segment(aes(x = conf.low, xend = conf.high, y=term, yend=term))+ # add line at 0 geom_vline(xintercept=0, size=1, color="black", linetype="dashed")+ scale_x_continuous(breaks=seq(-1.5,0.5,0.5), limits=c(-1.5,0.5))+ labs(x = expression(paste("Marginal effect, ", hat(beta[j]))), y = "Variable")+ guides(color=F)+ # hide legend theme_classic(base_family = "Fira Sans Condensed", base_size=20)
Do NOT just try to maximize R2 or ˉR2
A high R2 or ˉR2 means that the regressors explain variation in Y
A high R2 or ˉR2 does NOT mean you have eliminated omitted variable bias
A high R2 or ˉR2 does NOT mean you have an unbiased estimate of a causal effect
A high R2 or ˉR2 does NOT mean included variables are statistically significant
We care mostly about measuring a causal effect
This involves making the respective ^βj's as unbiased as possible (because βj is the causal effect of Xj→Y!
You will not get high R2's in social science
The big challenge in applied econometrics is choosing how to specify a model to regress
Every dataset is different, every study has a different goal
But here are some helpful tips and frequent problems (and solutions)
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |