Statistics profession is obstinant that we cannot say anything about causality
But you have to! It's how the human brain works!
We can't concieve of (spurious) correlation without some causation
Angus Deaton
Economics Nobel 2015
The RCT is a useful tool, but I think that is a mistake to put method ahead of substance. I have written papers using RCTs. Like other methods of investigation, they are often useful, and, like other methods, they have dangers and drawbacks. Methodological prejudice can only tie our hands...It is not true that an RCT, when feasible, will always do better than an observational study. This should not be controversial [but] might still make some uncomfortable, particularly the second [statement]: (a) RCTs are affected by the same problems of inference and estimation that economists have faced using other methods, and (b) no RCT can ever legitimately claim to have established causality. My theme is that RCTs have no special status, they have no exemption from the problems of inference that econometricians have always wrestled with, and there is nothing that they, and only they, can accomplish.
Deaton, Angus, 2019, "Randomization in the Tropics Revisited: A Theme and Eleven Variations," Working Paper
Lant Pritchett
People keep saying that the recent Nobelists "studied global poverty." This is exactly wrong. They made a commitment to a method, not a subject, and their commitment to method prevented them from studying global poverty.
At a conference at Brookings in 2008 Paul Romer (last years Nobelist) said: "You guys are like going to a doctor who says you have an allergy and you have cancer. With the skin rash we can divide you skin into areas and test variety of substances and identify with precision and some certainty the cause. Cancer we have some ideas how to treat it but there are a variety of approaches and since we cannot be sure and precise about which is best for you, we will ignore the cancer and not treat it."
Angus Deaton
Economics Nobel 2015
"Lant Pritchett is so fun to listen to, sometimes you could forget that he is completely full of shit."
Programs randomly assign treatment to different individuals and measure causal effect of treatment
RAND Health Insurance Study
Oregon Medicaid Expansion
HUD's Moving to Opportunity
Tennessee STAR
Source: British Medical Journal
Correlation:
Causation:
John | Maria |
---|---|
![]() |
![]() |
Y0J=3 | Y0M=5 |
Y1J=4 | Y1M=5 |
Y1J−Y0J=1 | Y1M−Y0M=0 |
YJ=(Y1J)=4 | YM=(Y0M)=5 |
Recall example from class 1.2
John will choose to buy health insurance
Maria will choose to not buy health insurance
Health insurance improves John's score by 1, has no effect on Maria's score
Note, all we can observe in the data are their health outcomes after they have chosen (not) to buy health insurance: YJ=4 and YM=5
Observed difference between John and Maria: YJ−YM=−1
John | Maria |
---|---|
![]() |
![]() |
YJ=4 | YM=5 |
This is all the data we actually observe
Observed difference between John and Maria: YJ−YM=Y1J−Y0M⏟=−1
Recall:
We don't see the counterfactuals:
We will seek to understand what causality is and how we can approach finding it
We will also explore the different common research designs meant to identify causal relationships
These skills, more than supply & demand, constrained optimization models, ISLM, etc, are the tools and comparative advantage of a modern research economist
Example
If X is a light switch, and Y is a light:
Example
Example
Example
Example
Example
Example
If person i who recieved treatment had not recieved the treatment, we can predict what his outcome would have been
If person j who did not recieve treatment had recieved treatment, we can predict what her outcome would have been
A surprisingly simple, yet rigorous and powerful method of modeling is using a causal diagram or Directed Acyclic Graph (DAG)
A simple series of nodes (variables) connected by arrows that indicate causal effects
Causal diagram (DAG) is the model of what you think the data-generating process is
Useful to help figure out how to identify particular relationships of interest
Requires some common sense/economic intutition
Remember, all models are wrong, we just need them to be useful!
Suppose we have data on three variables
IP
: how much a firm spends on IP lawsuits tech
: whether a firm is in tech industryprofit
: firm profitsThey are all correlated with each other, but what's are the causal relationships?
We need our own causal model (from theory, intuition, etc) to sort
Consider all the variables likely to be important to the data-generating process (including variables we can't observe!)
For simplicity, combine some similar ones together or prune those that aren't very important
Consider which variables are likely to affect others, and draw arrows connecting them
Test some testable implications of the model (to see if we have a correct one!)
Drawing an arrow requires a direction - making a statement about causality!
Omitting an arrow makes an equally important statement too!
If two variables are correlated, but neither causes the other, it's likely they are both caused by another (perhaps unobserved) variable - add it!
There should be no cycles or loops (if so, there's probably another missing variable, such as time)
Example: what is the effect of education on wages?
Education ("treatment" or "exposure")
Wages ("outcome" or "response")
In social science and complex systems, 1000s of variables could plausibly be in DAG!
So simplify:
Background, Year of birth, Location, Compulsory schooling, all cause education
Background, year of birth, location, job connections probably cause wages
Background, Year of birth, Location, Compulsory schooling, all cause education
Background, year of birth, location, job connections probably cause wages
Job connections in fact is probably caused by education!
Location and background probably both caused by unobserved factor (u1)
This is messy, but we have a causal model!
Makes our assumptions explicit, and many of them are testable
DAG suggests certain relationships that will not exist:
laws
and conx
go through educ
educ
, then cor(laws,conx)
should be zero!Dagitty.net is a great tool to make these and give you testable implications
Click Model -> New Model
Name your "exposure" variable (X of interest) and "outcome" variable (Y)
Click and drag to move nodes around
Add a new variable by double-clicking
Add an arrow by double-clicking one variable and then double-clicking on the target (do again to remove arrow)
Minimal sufficient adjustment sets containing background, location, year for estimating the total effect of educ on wage: background, location, year
job_connections
⊥ year
| educ
educ
, there should be no correlation between job_connections
and year
- can test this with data!In order to identify the effect of education on wages, our model implies we need to control for background, location, and year
How to control for them?
In order to identify the effect of education on wages, our model implies we need to control for background, location, and year
How to control for them?
wages=β0+β1educ+β2background+β3location+β4year
In order to identify the effect of education on wages, our model implies we need to control for background, location, and year
How to control for them?
wages=β0+β1educ+β2background+β3location+β4year
background
, location
, and year
, we can identify the causal effect of educ
→ wage
.ggdag
Y~X
means "Y
is caused by X
"library(ggdag)dagify(wage~educ+conx+year+bckg+loc, educ~bckg+year+loc+laws, conx~educ, bckg~u1, loc~u1) %>% ggdag()+ theme_dag()
Typical notation:
X is independent variable of interest
Y is dependent/"response" variable
Other variables use other letters
Arrows indicate causal effect (in proper direction)
Two types of causal effect:
Arrows indicate causal effect (in proper direction)
Two types of causal effect:
Direct effects X→Y
Indirect effects X→M→Y
Arrows indicate causal effect (in proper direction)
Two types of causal effect:
Direct effects X→Y
Indirect effects X→M→Y
You of course might have both!
Z is a confounder of X→Y, Z affects both X and Y
cor(X,Y) is made up of two parts:
Failing to control for Z will bias our estimate of the causal effect of X→Y!
Confounders are the DAG-equivalent of omitted variable bias
Biased regression Y=β0+β1X leaving out Z
ˆβ1 picks up both:
A causal "front-door" path X→Y
A non-causal "back-door" path X←Z→Y
1 Regardless of the directions of the arrows!
If we can control for Z, we can block the back-door path X←Z→Y
This would only leave the front-door, X→Y
How to "control for" Z? We want to remove its influence over determining the values of X and Y
Multiple methods exist, but we've done this with multivariate regression
At the simplest level, consider only looking at values of X and Y for all observations that have the same value of Z
Controlling for a single variable along a long causal path is sufficient to block that path!
Causal path: X→Y
Backdoor path: X←A→B→C→Y
It is sufficient to block this backdoor by controlling either A or B or C!
To achieve proper identification of the causal effect:
The "back-door criterion": control for the minimal amount of variables sufficient to ensure that no back-door exists between X and Y
Example:
X←A→B→Y (back-door)
Need only control for A or B to block the one back-door path
Example:
X←A→B←C→Y (back-door, but blocked by collider!)
B is a collider because both A and C point to it
Example:
Hos: being in the hospital
Both Flu and Bus send you to Hos (arrows)
Knowing Flu doesn't tell us about Bus (no arrow)
Conditional on being in Hos, negative association between Flu and Bus (spurious!)
Example: How much of wage disparities are caused by gender-based discrimination?
occup
?If we control for occupation:
cor(abil,discr)
where there wasn't one!Can't identify causal effect controlling for occup
alone!
occup
and abil
(but perhaps abil
is unobserved)Example:
X←B→Y (back-door)
Should we control for M?
Example:
If we control for M, this would block the front-door!
If we can estimate X→M and M→Y (note, no back-doors to either of these!), we can estimate X→Y
Tobacco industry claimed that cor(smoking,cancer) could be spurious due to a confounding gene
that affects both!
gene
is unobservableSuppose smoking causes tar
buildup in lungs, which cause cancer
We should not control for tar
, it's on the front-door path
Thus, to achieve causal identification, control for the minimal amount of variables such that:
Ensure that no back-door path remains open
Ensure that no front-door path is closed
Statistics profession is obstinant that we cannot say anything about causality
But you have to! It's how the human brain works!
We can't concieve of (spurious) correlation without some causation
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |