• Set Up
  • Part A DAG Practice
    • Example A
    • Example B
    • Example C
    • Example D
    • Example E
  • Part B Empirical Example
    • Question 1
    • Question 2
    • Question 3
    • Question 4
    • Question 5
      • Implication 1:
      • Implication 2:
      • Implication 3:
      • Implication 4:
    • Question 6
    • Question 7
    • Question 8
    • Question 9
    • Question 10
    • Question 11
      • Implication 1:
      • Implication 2:
    • Question 12
    • Question 13
    • Question 14
    • Question 15

Set Up

To minimize confusion, I suggest creating a new R Project (e.g. regression_practice) and storing any data in that folder on your computer.

Alternatively, I have made a project in R Studio Cloud that you can use (and not worry about trading room computer limitations), with the data already inside (you will still need to assign it to an object).

Part A DAG Practice

For each of the following DAGs:1

    1. Write out all of the causal pathways from X (treatment of interest) to Y (outcome of interest).
    1. Identify which variable(s) need to be controlled for to estimate the causal effect of X on Y.
    1. Write out the regression equation (abstractly) for properly identifying the causal effect, based on part ii.

Example A


  1. Pathways:
  1. XY (causal, front door)
  2. XZY (not causal, back door)
  1. Z needs to be controlled for, since it is opening a backdoor path.

  2. Y=β0+β1X+β2Z


Example B


  1. Pathways:
  1. XY (causal, front door)
  2. XMY (causal, front door)
  1. Nothing should be controlled for, since M is a mediator, and part of the effect of X on Y

  2. Y=β0+β1X


Example C


  1. Pathways:
  1. XY (causal, front door)
  2. XAZY (not causal, back door)
  3. XAZBY (not causal, back door)
  1. Backdoor path 3 is closed by the collider at Z. Backdoor path 2 remains open, so we need to control for A. (If we blocked Z to close path 2, that would open up backdoor path 3!) Only A should be controlled for.

  2. Y=β0+β1X+β2A


Example D


  1. Pathways:
  1. XY (causal, front door)
  2. XCY (causal, front door)
  3. XAZBY (not causal, back door)
  1. Path 2 is a front door path we want to leave open. Backdoor path 3 is closed by the collider at Z. Nothing needs to be controlled for!

  2. Y=β0+β1X


Example E


  1. Pathways:
  1. XY (causal, front door)
  2. XZY (causal, front door)
  3. XZAY (not causal, back door)
  4. XZBAY (not causal, back door)
  1. Path 2 is a front door path we want to leave open. Backdoor path 3 is closed by the collider at Z. Backdoor path 4 is closed by the collider at Z. We don’t want to control for anything!

  2. Y=β0+β1X


Part B Empirical Example

Question 1

Install the wooldridge package if you do not already have it installed.2 We will use the bwght data from wooldridge3

Let’s just look at the following variables:

Variable Description
bwght Birth Weight (ounces)
cigs Cigarettes smoked per day while pregnant (1988)
motheduc Mother’s education (number of years)
cigprice Price of cigarette pack (1988)
faminc Family’s income in $1,000s (1988)

We want to explore how a mother smoking during pregnancy affects the baby’s birthweight (which may have strong effects on outcomes over the child’s life).

Just to be explicit about it, assign this as some dataframe (feel free to change the name), i.e.:

# install.packages("wooldridge")
library(wooldridge)
bwght<-wooldridge::bwght

Question 2

Make a correlation table for our variables listed above.4


bwght %>%
  select(bwght, cigs, motheduc, cigprice, faminc) %>%
  cor(use = "pairwise.complete.obs")
##                bwght        cigs    motheduc   cigprice      faminc
## bwght     1.00000000 -0.15076180  0.06912704 0.04918790  0.10893684
## cigs     -0.15076180  1.00000000 -0.21386510 0.00970419 -0.17304493
## motheduc  0.06912704 -0.21386510  1.00000000 0.07086587  0.45592970
## cigprice  0.04918790  0.00970419  0.07086587 1.00000000  0.09545581
## faminc    0.10893684 -0.17304493  0.45592970 0.09545581  1.00000000

Question 3

Consider the following causal model:

library(ggdag)
dagify(bwght~cigs+inc,
       cigs~price+educ+inc,
       inc~educ,
       exposure = "cigs",
       outcome = "bwght") %>% 
  ggdag(stylized = FALSE, seed=1)+theme_dag_blank()+theme(legend.position = "none")

Note what we are hypothesizing:

  • bwght is caused by cigs and inc
  • cigs are caused by price, educ, and inc
  • inc is caused by educ

See also how this is written into the notation in R to make the DAG.

Create this model on dagitty.net.



Question 4

What does dagitty tell us the testable implications of this causal model?


See the middle box on the right on dagitty:

  1. bwghtprice|cigs,inc: birthweight is independent of price, controlling for cigarettes and income

  2. bwghteduc|cigs,inc: birthweight is independent of education, controlling for cigarettes and income

  3. incprice: income is independent of price

  4. priceeduc: cigarette price is independent of education


Question 5

Test each implication given to you by dagitty.

  • For independencies (xy): run a regression of y on x.
  • For conditional independencies (xy|z,a): run a regression of y on x,z,a.

For each, test against the null hypothesis that the relevant coefficient (^β1) is 0 (i.e. x and y are indeed independent).

Does this causal model hold up well?


Implication 1:

If we run a regression of bwght on cigprice, including cigs and faminc as controls, there should not not be a statistically significant coefficient on cigprice (i.e. there is no relationship between cigprice and bwght holding cigs andfaminc constant):

lm(bwght~cigprice+cigs+faminc, data = bwght) %>% summary()
## 
## Call:
## lm(formula = bwght ~ cigprice + cigs + faminc, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -95.746 -11.516   0.784  13.175 149.956 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 106.02164    6.88671  15.395  < 2e-16 ***
## cigprice      0.08499    0.05282   1.609   0.1078    
## cigs         -0.46735    0.09156  -5.104 3.78e-07 ***
## faminc        0.08811    0.02931   3.006   0.0027 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.05 on 1384 degrees of freedom
## Multiple R-squared:  0.03162,    Adjusted R-squared:  0.02952 
## F-statistic: 15.06 on 3 and 1384 DF,  p-value: 1.193e-09

The coefficient on cigprice is small and not statistically significant. This implication holds up well.

Implication 2:

If we run a regression of bwght on motheduc, including cigs and faminc as controls, there should not not be a statistically significant coefficient on cigprice (i.e. there is no relationship between motheduc and bwght holding cigs andfaminc constant):

lm(bwght~motheduc+cigs+faminc, data = bwght) %>% summary()
## 
## Call:
## lm(formula = bwght ~ motheduc + cigs + faminc, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.064 -11.585   0.668  13.154 150.078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.83485    3.13778  37.235  < 2e-16 ***
## motheduc      0.01426    0.25799   0.055   0.9559    
## cigs         -0.46335    0.09275  -4.996 6.61e-07 ***
## faminc        0.09147    0.03246   2.818   0.0049 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.08 on 1383 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.02977,    Adjusted R-squared:  0.02767 
## F-statistic: 14.15 on 3 and 1383 DF,  p-value: 4.385e-09

Implication 3:

The model implies simply that there is no significant correlation between faminc and cigprice

bwght %>%
  select(faminc, cigprice) %>%
  cor()
##              faminc   cigprice
## faminc   1.00000000 0.09545581
## cigprice 0.09545581 1.00000000

There is a fairly weak correlation. This implication mostly holds up.

Implication 4:

The model implies simply that there is no significant correlation between cigprice and motheduc

bwght %>%
  select(cigprice, motheduc) %>%
  cor(use="pairwise.complete.obs")
##            cigprice   motheduc
## cigprice 1.00000000 0.07086587
## motheduc 0.07086587 1.00000000

This is an even weaker correlation. This implication seems to holds up.


Question 6

List all of the possible pathways from cigs to bwght. Which are “front-doors” and which are “back-doors?” Are any blocked by colliders?


  1. cigsbwght (causal, front door)
  2. cigsfamincbwght (non-causal, back door)
  3. cibsmotheducfamincbwght (non-causal, back door)

There are no colliders on any path.


Question 7

What is the minimal sufficient set of variables we need to control in order to causally identify the effect of cigs on bwght?


We simply need to control for faminc. This blocks the back door for both path 2 and path 3.


Question 8

Estimate the causal effect by running the appropriate regression.5


We need to control only for faminc, so we put it into the regression to estimate:

bwghti=β0+β1cigsi+β2faminci

lm(bwght~cigs+faminc, data = bwght) %>% summary()
## 
## Call:
## lm(formula = bwght ~ cigs + faminc, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.061 -11.543   0.638  13.126 150.083 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.97413    1.04898 111.512  < 2e-16 ***
## cigs         -0.46341    0.09158  -5.060 4.75e-07 ***
## faminc        0.09276    0.02919   3.178  0.00151 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.06 on 1385 degrees of freedom
## Multiple R-squared:  0.0298, Adjusted R-squared:  0.0284 
## F-statistic: 21.27 on 2 and 1385 DF,  p-value: 7.942e-10

Controlling for income, each cigarette smoked while pregant will cause the birthweight to decrease by 0.46 ounces.


Question 9

We saw some effect between faminc and cigprice. Perhaps there are unobserved factors (such as the economy’s performance) that affect both. Add an unobserved factor u1 to your dagitty model.6



Question 10

Perhaps our model is poorly specified. Maybe motheduc actually has a causal effect on bwght? Tweak your model from Question 9 on dagitty to add this potential relationship. What testable implications does this new model imply?


See the middle box on the right on dagitty:

  1. bwghtprice|cigs,inc,educ: birthweight is independent of price, controlling for cigarettes, income, and education

  2. priceeduc: cigarette price is independent of education


Question 11

Test these implications appropriately, like you did in Question 5. Does this model hold up well?


Implication 1:

If we run a regression of bwght on cigprice, including cigs, faminc, and motheduc as controls, there should not not be a statistically significant coefficient on cigprice (i.e. there is no relationship between cigprice and bwght holding cigs,faminc, and motheduc constant):

lm(bwght~cigprice+cigs+faminc+motheduc, data = bwght) %>% summary()
## 
## Call:
## lm(formula = bwght ~ cigprice + cigs + faminc + motheduc, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -95.760 -11.519   0.803  13.175 149.954 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.061e+02  7.407e+00  14.321  < 2e-16 ***
## cigprice     8.478e-02  5.288e-02   1.603  0.10912    
## cigs        -4.681e-01  9.274e-02  -5.047 5.08e-07 ***
## faminc       8.765e-02  3.253e-02   2.694  0.00714 ** 
## motheduc    -4.448e-04  2.580e-01  -0.002  0.99862    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.06 on 1382 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.03158,    Adjusted R-squared:  0.02877 
## F-statistic: 11.26 on 4 and 1382 DF,  p-value: 5.368e-09

The coefficient on cigprice is small and not statistically significant. This implication holds up well.

Implication 2:

This is the same as implication 4 from question 5. Again, this holds up reasonably well.


Question 12

Under this new causal model, list all of the possible pathways from cigs to bwght. Which are “front-doors” and which are “back-doors?” Are any blocked by colliders?


  1. cigsbwght (causal, front door)
  2. cigsfamincbwght (non-causal, back door)
  3. cigscigpriceu1famincbwght (non-causal, back door)
  4. cigsmotheducbwght (non-causal, back door)
  5. cigsmotheducfamincbwght (non-causal, back door)

There are no colliders on any path.


Question 13

Under this new causal model, what is the minimal sufficient set of variables we need to control in order to causally identify the effect of cigs on bwght?


We need to control for faminc and motheduc. This blocks the back door for paths 2, 3, 4, and 5.


Question 14

Estimate the causal effect in this new model by running the appropriate regression. Compare your answers to those in question 8.


We need to control for faminc and motheduc, so we put them into the regression to estimate:

bwghti=β0+β1cigsi+β2faminci+β3motheduci

lm(bwght~cigs+faminc+motheduc, data = bwght) %>% summary()
## 
## Call:
## lm(formula = bwght ~ cigs + faminc + motheduc, data = bwght)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -96.064 -11.585   0.668  13.154 150.078 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 116.83485    3.13778  37.235  < 2e-16 ***
## cigs         -0.46335    0.09275  -4.996 6.61e-07 ***
## faminc        0.09147    0.03246   2.818   0.0049 ** 
## motheduc      0.01426    0.25799   0.055   0.9559    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 20.08 on 1383 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.02977,    Adjusted R-squared:  0.02767 
## F-statistic: 14.15 on 3 and 1383 DF,  p-value: 4.385e-09

Controlling for income and education, each cigarette smoked while pregant will cause the birthweight to decrease by 0.46 ounces. It turns out there was no noticeable difference when we included education!


Question 15

Try out drawing this model using the ggdag package in R. See my DAG in question 3 for an example.


library(ggdag)
dagify(bwght~cigs+inc+educ,
       cigs~price+educ+inc,
       inc~educ+u1,
       price~u1,
       exposure = "cigs",
       outcome = "bwght") %>% 
  ggdag(stylized = FALSE, seed=1)+theme_dag_blank()+theme(legend.position = "none")

```


  1. You can use dagitty.net to help you, but you should start trying to recognize these on your own!↩︎

  2. This package contains datasets used in Jeffrey Wooldrige’s Introductory Econometrics: A Modern Approach (the textbook that I used in my econometrics classes years ago!)↩︎

  3. Which comes from The 1988 National Health Interview Survey., used in J. Mullahy (1997), “Instrumental-Variable Estimation of Count Data Models: Applications to Models of Cigarette Smoking Behavior,” Review of Economics and Statistics 79: 596-593.↩︎

  4. Hints: select() these variables and then pipe them into cor(., use="pairwise.complete.obs") to use only observations for which there are data on each variable (to avoid NA’s).↩︎

  5. Note, on dagitty, you can change a variable on the diagram to be “adjusted” (controlled for) by double-clicking it and then hitting the A key.↩︎

  6. Note, on dagitty, you can make a variable be “unobserved” by double-clicking it and then hitting the U key.↩︎