Set Up

Part A DAG Practice

Example A
Example B
Example C
Example D
Example E

Part B Empirical Example

Question 1
Question 2
Question 3
Question 4
Question 5

Implication 1:
Implication 2:
Implication 3:
Implication 4:

Question 6
Question 7
Question 8
Question 9
Question 10
Question 11

Implication 1:
Implication 2:

Question 12
Question 13
Question 14
Question 15

Set Up

To minimize confusion, I suggest creating a new R Project (e.g. regression_practice) and storing any data in that folder on your computer.

Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).

Part A DAG Practice

For each of the following DAGs:¹

1. Write out all of the causal pathways from X (treatment of interest) to Y (outcome of interest).
1. Identify which variable(s) need to be controlled for to estimate the causal effect of X on Y.
1. Write out the regression equation (abstractly) for properly identifying the causal effect, based on part ii.

Example A

Pathways:

(causal, front door)
(not causal, back door)

needs to be controlled for, since it is opening a backdoor path.

Example B

Pathways:

(causal, front door)
(causal, front door)

Nothing should be controlled for, since is a mediator, and part of the effect of on

Example C

Pathways:

(causal, front door)
(not causal, back door)
(not causal, back door)

Backdoor path 3 is closed by the collider at . Backdoor path 2 remains open, so we need to control for . (If we blocked to close path 2, that would open up backdoor path 3!) Only should be controlled for.

Example D

Pathways:

(causal, front door)
(causal, front door)
(not causal, back door)

Path 2 is a front door path we want to leave open. Backdoor path 3 is closed by the collider at . Nothing needs to be controlled for!

Example E

Pathways:

(causal, front door)
(causal, front door)
(not causal, back door)
(not causal, back door)

Path 2 is a front door path we want to leave open. Backdoor path 3 is closed by the collider at . Backdoor path 4 is closed by the collider at . We don’t want to control for anything!

Part B Empirical Example

Question 1

Install the wooldridge package if you do not already have it installed.² We will use the bwght data from wooldridge³

Let’s just look at the following variables:

Variable	Description
`bwght`	Birth Weight (ounces)
`cigs`	Cigarettes smoked per day while pregnant (1988)
`motheduc`	Mother’s education (number of years)
`cigprice`	Price of cigarette pack (1988)
`faminc`	Family’s income in $1,000s (1988)

We want to explore how a mother smoking during pregnancy affects the baby’s birthweight (which may have strong effects on outcomes over the child’s life).

Just to be explicit about it, assign this as some dataframe (feel free to change the name), i.e.:

# install.packages("wooldridge")
library(wooldridge)
bwght<-wooldridge::bwght

Question 2

Make a correlation table for our variables listed above.⁴

bwght %>%
  select(bwght, cigs, motheduc, cigprice, faminc) %>%
  cor(use = "pairwise.complete.obs")

##                bwght        cigs    motheduc   cigprice      faminc
## bwght     1.00000000 -0.15076180  0.06912704 0.04918790  0.10893684
## cigs     -0.15076180  1.00000000 -0.21386510 0.00970419 -0.17304493
## motheduc  0.06912704 -0.21386510  1.00000000 0.07086587  0.45592970
## cigprice  0.04918790  0.00970419  0.07086587 1.00000000  0.09545581
## faminc    0.10893684 -0.17304493  0.45592970 0.09545581  1.00000000

Question 3

Consider the following causal model:

library(ggdag)
dagify(bwght~cigs+inc,
       cigs~price+educ+inc,
       inc~educ,
       exposure = "cigs",
       outcome = "bwght") %>% 
  ggdag(stylized = FALSE, seed=1)+theme_dag_blank()+theme(legend.position = "none")

Note what we are hypothesizing:

bwght is caused by cigs and inc
cigs are caused by price, educ, and inc
inc is caused by educ

See also how this is written into the notation in R to make the DAG.

Create this model on dagitty.net.

Question 4

What does dagitty tell us the testable implications of this causal model?

See the middle box on the right on dagitty:

: birthweight is independent of price, controlling for cigarettes and income
: birthweight is independent of education, controlling for cigarettes and income
: income is independent of price
: cigarette price is independent of education

Question 5

Test each implication given to you by dagitty.

For independencies : run a regression of on .
For conditional independencies : run a regression of on .

For each, test against the null hypothesis that the relevant coefficient is 0 (i.e. and are indeed independent).

Does this causal model hold up well?

Implication 1:

If we run a regression of bwght on cigprice, including cigs and faminc as controls, there should not not be a statistically significant coefficient on cigprice (i.e. there is no relationship between cigprice and bwght holding cigs andfaminc constant):

lm(bwght~cigprice+cigs+faminc, data = bwght) %>% summary()

## 
## Call:
## lm(formula = bwght ~ cigprice + cigs + faminc, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -95.746 -11.516   0.784  13.175 149.956 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 106.02164    6.88671  15.395  < 2e-16 ***
## cigprice      0.08499    0.05282   1.609   0.1078    
## cigs         -0.46735    0.09156  -5.104 3.78e-07 ***
## faminc        0.08811    0.02931   3.006   0.0027 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.05 on 1384 degrees of freedom
## Multiple R-squared:  0.03162,    Adjusted R-squared:  0.02952 
## F-statistic: 15.06 on 3 and 1384 DF,  p-value: 1.193e-09

The coefficient on cigprice is small and not statistically significant. This implication holds up well.

Implication 2:

If we run a regression of bwght on motheduc, including cigs and faminc as controls, there should not not be a statistically significant coefficient on cigprice (i.e. there is no relationship between motheduc and bwght holding cigs andfaminc constant):

lm(bwght~motheduc+cigs+faminc, data = bwght) %>% summary()

## 
## Call:
## lm(formula = bwght ~ motheduc + cigs + faminc, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.064 -11.585   0.668  13.154 150.078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.83485    3.13778  37.235  < 2e-16 ***
## motheduc      0.01426    0.25799   0.055   0.9559    
## cigs         -0.46335    0.09275  -4.996 6.61e-07 ***
## faminc        0.09147    0.03246   2.818   0.0049 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.08 on 1383 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.02977,    Adjusted R-squared:  0.02767 
## F-statistic: 14.15 on 3 and 1383 DF,  p-value: 4.385e-09

Implication 3:

The model implies simply that there is no significant correlation between faminc and cigprice

bwght %>%
  select(faminc, cigprice) %>%
  cor()

##              faminc   cigprice
## faminc   1.00000000 0.09545581
## cigprice 0.09545581 1.00000000

There is a fairly weak correlation. This implication mostly holds up.

Implication 4:

The model implies simply that there is no significant correlation between cigprice and motheduc

bwght %>%
  select(cigprice, motheduc) %>%
  cor(use="pairwise.complete.obs")

##            cigprice   motheduc
## cigprice 1.00000000 0.07086587
## motheduc 0.07086587 1.00000000

This is an even weaker correlation. This implication seems to holds up.

Question 6

List all of the possible pathways from cigs to bwght. Which are “front-doors” and which are “back-doors?” Are any blocked by colliders?

(causal, front door)
(non-causal, back door)
(non-causal, back door)

There are no colliders on any path.

Question 7

What is the minimal sufficient set of variables we need to control in order to causally identify the effect of cigs on bwght?

We simply need to control for faminc. This blocks the back door for both path 2 and path 3.

Question 8

Estimate the causal effect by running the appropriate regression.⁵

We need to control only for faminc, so we put it into the regression to estimate:

lm(bwght~cigs+faminc, data = bwght) %>% summary()

## 
## Call:
## lm(formula = bwght ~ cigs + faminc, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.061 -11.543   0.638  13.126 150.083 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.97413    1.04898 111.512  < 2e-16 ***
## cigs         -0.46341    0.09158  -5.060 4.75e-07 ***
## faminc        0.09276    0.02919   3.178  0.00151 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.06 on 1385 degrees of freedom
## Multiple R-squared:  0.0298, Adjusted R-squared:  0.0284 
## F-statistic: 21.27 on 2 and 1385 DF,  p-value: 7.942e-10

Controlling for income, each cigarette smoked while pregant will cause the birthweight to decrease by 0.46 ounces.

Question 9

We saw some effect between faminc and cigprice. Perhaps there are unobserved factors (such as the economy’s performance) that affect both. Add an unobserved factor u1 to your dagitty model.⁶

Question 10

Perhaps our model is poorly specified. Maybe motheduc actually has a causal effect on bwght? Tweak your model from Question 9 on dagitty to add this potential relationship. What testable implications does this new model imply?

See the middle box on the right on dagitty:

: birthweight is independent of price, controlling for cigarettes, income, and education
: cigarette price is independent of education

Question 11

Test these implications appropriately, like you did in Question 5. Does this model hold up well?

Implication 1:

If we run a regression of bwght on cigprice, including cigs, faminc, and motheduc as controls, there should not not be a statistically significant coefficient on cigprice (i.e. there is no relationship between cigprice and bwght holding cigs,faminc, and motheduc constant):

lm(bwght~cigprice+cigs+faminc+motheduc, data = bwght) %>% summary()

## 
## Call:
## lm(formula = bwght ~ cigprice + cigs + faminc + motheduc, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -95.760 -11.519   0.803  13.175 149.954 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.061e+02  7.407e+00  14.321  < 2e-16 ***
## cigprice     8.478e-02  5.288e-02   1.603  0.10912    
## cigs        -4.681e-01  9.274e-02  -5.047 5.08e-07 ***
## faminc       8.765e-02  3.253e-02   2.694  0.00714 ** 
## motheduc    -4.448e-04  2.580e-01  -0.002  0.99862    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.06 on 1382 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03158,    Adjusted R-squared:  0.02877 
## F-statistic: 11.26 on 4 and 1382 DF,  p-value: 5.368e-09

The coefficient on cigprice is small and not statistically significant. This implication holds up well.

Implication 2:

This is the same as implication 4 from question 5. Again, this holds up reasonably well.

Question 12

Under this new causal model, list all of the possible pathways from cigs to bwght. Which are “front-doors” and which are “back-doors?” Are any blocked by colliders?

(causal, front door)
(non-causal, back door)
(non-causal, back door)
(non-causal, back door)
(non-causal, back door)

There are no colliders on any path.

Question 13

Under this new causal model, what is the minimal sufficient set of variables we need to control in order to causally identify the effect of cigs on bwght?

We need to control for faminc and motheduc. This blocks the back door for paths 2, 3, 4, and 5.

Question 14

Estimate the causal effect in this new model by running the appropriate regression. Compare your answers to those in question 8.

We need to control for faminc and motheduc, so we put them into the regression to estimate:

lm(bwght~cigs+faminc+motheduc, data = bwght) %>% summary()

## 
## Call:
## lm(formula = bwght ~ cigs + faminc + motheduc, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.064 -11.585   0.668  13.154 150.078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.83485    3.13778  37.235  < 2e-16 ***
## cigs         -0.46335    0.09275  -4.996 6.61e-07 ***
## faminc        0.09147    0.03246   2.818   0.0049 ** 
## motheduc      0.01426    0.25799   0.055   0.9559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.08 on 1383 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.02977,    Adjusted R-squared:  0.02767 
## F-statistic: 14.15 on 3 and 1383 DF,  p-value: 4.385e-09

Controlling for income and education, each cigarette smoked while pregant will cause the birthweight to decrease by 0.46 ounces. It turns out there was no noticeable difference when we included education!

Question 15

Try out drawing this model using the ggdag package in R. See my DAG in question 3 for an example.

library(ggdag)
dagify(bwght~cigs+inc+educ,
       cigs~price+educ+inc,
       inc~educ+u1,
       price~u1,
       exposure = "cigs",
       outcome = "bwght") %>% 
  ggdag(stylized = FALSE, seed=1)+theme_dag_blank()+theme(legend.position = "none")

```

You can use dagitty.net to help you, but you should start trying to recognize these on your own!↩︎
This package contains datasets used in Jeffrey Wooldrige’s Introductory Econometrics: A Modern Approach (the textbook that I used in my econometrics classes years ago!)↩︎
Which comes from The 1988 National Health Interview Survey., used in J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data Models: Applications to Models of Cigarette Smoking Behavior,” Review of Economics and Statistics 79: 596-593.↩︎
Hints: select() these variables and then pipe them into cor(., use="pairwise.complete.obs") to use only observations for which there are data on each variable (to avoid NA’s).↩︎
Note, on dagitty, you can change a variable on the diagram to be “adjusted” (controlled for) by double-clicking it and then hitting the A key.↩︎
Note, on dagitty, you can make a variable be “unobserved” by double-clicking it and then hitting the U key.↩︎

3.5: Causal Inference and DAGs - R Practice (Answers)

Ryan Safner

2019-11-12

Set Up

Part A DAG Practice

Example A

Example B

Example C

Example D

Example E

Part B Empirical Example

Question 1

Question 2

Question 3

Question 4

Question 5

Implication 1:

Implication 2:

Implication 3:

Implication 4:

Question 6

Question 7

Question 8

Question 9

Question 10

Question 11

Implication 1:

Implication 2:

Question 12

Question 13

Question 14

Question 15