To minimize confusion, I suggest creating a new R Project
(e.g. regression_practice
) and storing any data in that folder on your computer.
Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).
For each of the following DAGs:1
X
(treatment of interest) to Y
(outcome of interest).X
on Y
.Z needs to be controlled for, since it is opening a backdoor path.
Y=β0+β1X+β2Z
Nothing should be controlled for, since M is a mediator, and part of the effect of X on Y
Y=β0+β1X
Backdoor path 3 is closed by the collider at Z. Backdoor path 2 remains open, so we need to control for A. (If we blocked Z to close path 2, that would open up backdoor path 3!) Only A should be controlled for.
Y=β0+β1X+β2A
Path 2 is a front door path we want to leave open. Backdoor path 3 is closed by the collider at Z. Nothing needs to be controlled for!
Y=β0+β1X
Path 2 is a front door path we want to leave open. Backdoor path 3 is closed by the collider at Z. Backdoor path 4 is closed by the collider at Z. We don’t want to control for anything!
Y=β0+β1X
Install the wooldridge
package if you do not already have it installed.2 We will use the bwght
data from wooldridge
3
Let’s just look at the following variables:
Variable | Description |
---|---|
bwght |
Birth Weight (ounces) |
cigs |
Cigarettes smoked per day while pregnant (1988) |
motheduc |
Mother’s education (number of years) |
cigprice |
Price of cigarette pack (1988) |
faminc |
Family’s income in $1,000s (1988) |
We want to explore how a mother smoking during pregnancy affects the baby’s birthweight (which may have strong effects on outcomes over the child’s life).
Just to be explicit about it, assign this as some dataframe (feel free to change the name), i.e.:
Make a correlation table for our variables listed above.4
## bwght cigs motheduc cigprice faminc
## bwght 1.00000000 -0.15076180 0.06912704 0.04918790 0.10893684
## cigs -0.15076180 1.00000000 -0.21386510 0.00970419 -0.17304493
## motheduc 0.06912704 -0.21386510 1.00000000 0.07086587 0.45592970
## cigprice 0.04918790 0.00970419 0.07086587 1.00000000 0.09545581
## faminc 0.10893684 -0.17304493 0.45592970 0.09545581 1.00000000
Consider the following causal model:
Note what we are hypothesizing:
bwght
is caused by cigs
and inc
cigs
are caused by price
, educ
, and inc
inc
is caused by educ
See also how this is written into the notation in R to make the DAG.
Create this model on dagitty.net.
What does dagitty
tell us the testable implications of this causal model?
See the middle box on the right on dagitty:
bwght⊥price|cigs,inc: birthweight is independent of price, controlling for cigarettes and income
bwght⊥educ|cigs,inc: birthweight is independent of education, controlling for cigarettes and income
inc⊥price: income is independent of price
price⊥educ: cigarette price is independent of education
Test each implication given to you by dagitty.
For each, test against the null hypothesis that the relevant coefficient (^β1) is 0 (i.e. x and y are indeed independent).
Does this causal model hold up well?
If we run a regression of bwght
on cigprice
, including cigs
and faminc
as controls, there should not not be a statistically significant coefficient on cigprice
(i.e. there is no relationship between cigprice
and bwght
holding cigs
andfaminc
constant):
##
## Call:
## lm(formula = bwght ~ cigprice + cigs + faminc, data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -95.746 -11.516 0.784 13.175 149.956
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 106.02164 6.88671 15.395 < 2e-16 ***
## cigprice 0.08499 0.05282 1.609 0.1078
## cigs -0.46735 0.09156 -5.104 3.78e-07 ***
## faminc 0.08811 0.02931 3.006 0.0027 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.05 on 1384 degrees of freedom
## Multiple R-squared: 0.03162, Adjusted R-squared: 0.02952
## F-statistic: 15.06 on 3 and 1384 DF, p-value: 1.193e-09
The coefficient on cigprice
is small and not statistically significant. This implication holds up well.
If we run a regression of bwght
on motheduc
, including cigs
and faminc
as controls, there should not not be a statistically significant coefficient on cigprice
(i.e. there is no relationship between motheduc
and bwght
holding cigs
andfaminc
constant):
##
## Call:
## lm(formula = bwght ~ motheduc + cigs + faminc, data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.064 -11.585 0.668 13.154 150.078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.83485 3.13778 37.235 < 2e-16 ***
## motheduc 0.01426 0.25799 0.055 0.9559
## cigs -0.46335 0.09275 -4.996 6.61e-07 ***
## faminc 0.09147 0.03246 2.818 0.0049 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.08 on 1383 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.02977, Adjusted R-squared: 0.02767
## F-statistic: 14.15 on 3 and 1383 DF, p-value: 4.385e-09
The model implies simply that there is no significant correlation between faminc
and cigprice
## faminc cigprice
## faminc 1.00000000 0.09545581
## cigprice 0.09545581 1.00000000
There is a fairly weak correlation. This implication mostly holds up.
The model implies simply that there is no significant correlation between cigprice
and motheduc
## cigprice motheduc
## cigprice 1.00000000 0.07086587
## motheduc 0.07086587 1.00000000
This is an even weaker correlation. This implication seems to holds up.
List all of the possible pathways from cigs
to bwght
. Which are “front-doors” and which are “back-doors?” Are any blocked by colliders?
There are no colliders on any path.
What is the minimal sufficient set of variables we need to control in order to causally identify the effect of cigs
on bwght
?
We simply need to control for faminc
. This blocks the back door for both path 2 and path 3.
Estimate the causal effect by running the appropriate regression.5
We need to control only for faminc
, so we put it into the regression to estimate:
bwghti=β0+β1cigsi+β2faminci
##
## Call:
## lm(formula = bwght ~ cigs + faminc, data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.061 -11.543 0.638 13.126 150.083
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.97413 1.04898 111.512 < 2e-16 ***
## cigs -0.46341 0.09158 -5.060 4.75e-07 ***
## faminc 0.09276 0.02919 3.178 0.00151 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.06 on 1385 degrees of freedom
## Multiple R-squared: 0.0298, Adjusted R-squared: 0.0284
## F-statistic: 21.27 on 2 and 1385 DF, p-value: 7.942e-10
Controlling for income, each cigarette smoked while pregant will cause the birthweight to decrease by 0.46 ounces.
We saw some effect between faminc
and cigprice
. Perhaps there are unobserved factors (such as the economy’s performance) that affect both. Add an unobserved factor u1
to your dagitty
model.6
Perhaps our model is poorly specified. Maybe motheduc
actually has a causal effect on bwght
? Tweak your model from Question 9 on dagitty
to add this potential relationship. What testable implications does this new model imply?
See the middle box on the right on dagitty:
bwght⊥price|cigs,inc,educ: birthweight is independent of price, controlling for cigarettes, income, and education
price⊥educ: cigarette price is independent of education
Test these implications appropriately, like you did in Question 5. Does this model hold up well?
If we run a regression of bwght
on cigprice
, including cigs
, faminc
, and motheduc
as controls, there should not not be a statistically significant coefficient on cigprice
(i.e. there is no relationship between cigprice
and bwght
holding cigs
,faminc
, and motheduc
constant):
##
## Call:
## lm(formula = bwght ~ cigprice + cigs + faminc + motheduc, data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -95.760 -11.519 0.803 13.175 149.954
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 1.061e+02 7.407e+00 14.321 < 2e-16 ***
## cigprice 8.478e-02 5.288e-02 1.603 0.10912
## cigs -4.681e-01 9.274e-02 -5.047 5.08e-07 ***
## faminc 8.765e-02 3.253e-02 2.694 0.00714 **
## motheduc -4.448e-04 2.580e-01 -0.002 0.99862
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.06 on 1382 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.03158, Adjusted R-squared: 0.02877
## F-statistic: 11.26 on 4 and 1382 DF, p-value: 5.368e-09
The coefficient on cigprice
is small and not statistically significant. This implication holds up well.
This is the same as implication 4 from question 5. Again, this holds up reasonably well.
Under this new causal model, list all of the possible pathways from cigs
to bwght
. Which are “front-doors” and which are “back-doors?” Are any blocked by colliders?
There are no colliders on any path.
Under this new causal model, what is the minimal sufficient set of variables we need to control in order to causally identify the effect of cigs
on bwght
?
We need to control for faminc
and motheduc
. This blocks the back door for paths 2, 3, 4, and 5.
Estimate the causal effect in this new model by running the appropriate regression. Compare your answers to those in question 8.
We need to control for faminc
and motheduc
, so we put them into the regression to estimate:
bwghti=β0+β1cigsi+β2faminci+β3motheduci
##
## Call:
## lm(formula = bwght ~ cigs + faminc + motheduc, data = bwght)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.064 -11.585 0.668 13.154 150.078
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 116.83485 3.13778 37.235 < 2e-16 ***
## cigs -0.46335 0.09275 -4.996 6.61e-07 ***
## faminc 0.09147 0.03246 2.818 0.0049 **
## motheduc 0.01426 0.25799 0.055 0.9559
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.08 on 1383 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.02977, Adjusted R-squared: 0.02767
## F-statistic: 14.15 on 3 and 1383 DF, p-value: 4.385e-09
Controlling for income and education, each cigarette smoked while pregant will cause the birthweight to decrease by 0.46 ounces. It turns out there was no noticeable difference when we included education!
Try out drawing this model using the ggdag
package in R. See my DAG in question 3 for an example.
```
You can use dagitty.net
to help you, but you should start trying to recognize these on your own!↩︎
This package contains datasets used in Jeffrey Wooldrige’s Introductory Econometrics: A Modern Approach (the textbook that I used in my econometrics classes years ago!)↩︎
Which comes from The 1988 National Health Interview Survey., used in J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data Models: Applications to Models of Cigarette Smoking Behavior,” Review of Economics and Statistics 79: 596-593.↩︎
Hints: select()
these variables and then pipe them into cor(., use="pairwise.complete.obs")
to use only observations for which there are data on each variable (to avoid NA
’s).↩︎
Note, on dagitty
, you can change a variable on the diagram to be “adjusted” (controlled for) by double-clicking it and then hitting the A
key.↩︎
Note, on dagitty
, you can make a variable be “unobserved” by double-clicking it and then hitting the U
key.↩︎