^β1∼N(E[^β1],σ^β1)
E[^β1]; the center of the distribution (2 classes ago)
σ^β1; how precise is our estimate? (last class)
1 Under the 4 assumptions about u (particularly, cor(X,u)=0).
Problem for identification: endogeneity
Problem for inference: randomness
OLS estimators (^β0 and ^β1) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
OLS estimators (^β0 and ^β1) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Modeled randomness: u includes all factors affecting Y other than X
OLS estimators (^β0 and ^β1) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Modeled randomness: u includes all factors affecting Y other than X
Inferential statistics analyzes a sample to make inferences about a much larger (unobservable) population
Population: all possible individuals that match some well-defined criterion of interest (people, firms, cities, etc)
Sample: some portion of the population of interest to represent the whole
Sample→⏟statistical inferencePopulation→⏟causal indentificationUnobserved Parameters
We want to identify causal relationships between population variables
We'll use sample statistics to infer something about population parameters
Estimation: use our sample data to construct a point estimate of a population parameter and subject it to a hypothesis test
Confidence interval: use our sample data to construct a range for the population parameter
Estimation: use our sample data to construct a point estimate of a population parameter and subject it to a hypothesis test
Confidence interval: use our sample data to construct a range for the population parameter
First method is more common, but second is still widely acknowledged
Both will give you similar results
Note statistical inference is different than causal inference!
We have already used statistics to estimate a relationship between X and Y
We want to test if these estimates are statistically significant and they describe the population
We have already used statistics to estimate a relationship between X and Y
We want to test if these estimates are statistically significant and they describe the population
Examples:
We have already used statistics to estimate a relationship between X and Y
We want to test if these estimates are statistically significant and they describe the population
Examples:
Note, we can test a lot of hypotheses about a lot of population parameters, e.g.
We will focus only on hypotheses about the population regression slope (ˆβ1), i.e. the causal effect1 of X on Y
1 With a model this simple, it's almost certainly not causal, but this is the ultimate direction we are heading...
A null hypothesis, H0
An alternative hypothesis, Ha
A test statistic to determine if we reject H0 when the statistic reaches a "critical value"
A conclusion whether or not to reject H0 in favor of Ha
Any sample statistic (e.g. ^β1) will rarely be exactly equal to the hypothesized population parameter (e.g. β1)
Difference between observed statistic and true paremeter could be because:
Parameter is not the hypothesized value (H0 is false)
Parameter is truly the hypothesized value (H0 is true) but sampling variability gave us a different estimate
Type I error (false positive): rejecting H0 when it is in fact true
Type II error (false negative): failing to reject H0 when it is in fact false
Truth |
|||
---|---|---|---|
Null is True | Null is False | ||
Judgment | Reject Null | TYPE I ERROR | CORRECT |
(False +) | (True +) | ||
Don't Reject Null | CORRECT | TYPE II ERROR | |
(True -) | (False -) |
Truth |
|||
---|---|---|---|
Defendant is Innocent | Defendant is Guilty | ||
Judgment | Convict | TYPE I ERROR | CORRECT |
(False +) | (True +) | ||
Acquit | CORRECT | TYPE II ERROR | |
(True -) | (False -) |
Truth |
|||
---|---|---|---|
Defendant is Innocent | Defendant is Guilty | ||
Judgment | Convict | TYPE I ERROR | CORRECT |
(False +) | (True +) | ||
Acquit | CORRECT | TYPE II ERROR | |
(True -) | (False -) |
Anglo-American common law presumes defendant is innocent: H0
Jury judges whether the evidence presented against the defendant is plausible assuming the defendant were in fact innocent
Truth |
|||
---|---|---|---|
Defendant is Innocent | Defendant is Guilty | ||
Judgment | Convict | TYPE I ERROR | CORRECT |
(False +) | (True +) | ||
Acquit | CORRECT | TYPE II ERROR | |
(True -) | (False -) |
Anglo-American common law presumes defendant is innocent: H0
Jury judges whether the evidence presented against the defendant is plausible assuming the defendant were in fact innocent
If highly improbable: sufficient evidence to reject H0 and convict
William Blackstone
(1723-1780)
"It is better that ten guilty persons escape than that one innocent suffer."
Blackstone, William, 1765-1770, Commentaries on the Laws of England
α=P(Reject H0|H0 is true)
α=P(Reject H0|H0 is true)
α=P(Reject H0|H0 is true)
The confidence level is defined as (1−α)
The probability of a Type II error is defined as β:
β=P(Don't reject H0|H0 is false)
Truth |
|||
---|---|---|---|
Null is True | Null is False | ||
Judgment | Reject Null | TYPE I ERROR | CORRECT |
(alpha) | (1-beta) | ||
Don't Reject Null | CORRECT | TYPE II ERROR | |
(1-alpha) | (beta) |
Power=1−β=P(Reject H0|H0 is false)
Power=1−β=P(Reject H0|H0 is false)
p(δ≥δi|H0 is true)
After running our test, we need to make a decision between the competing hypotheses
Compare p-value with pre-determined α (commonly, α=0.05, 95% confidence level)
Sir Ronald A. Fisher
(1890—1962)
"The null hypothesis is never proved or established, but is possibly disproved, in the course of experimentation. Every experiment may be said to exist only in order to give the facts a chance of disproving the null hypothesis."
1931, The Design of Experiments
Modern philosophy of science is largely based off of hypothesis testing and falsifiability, which form the "Scientific Method"1
For something to be "scientific", it must be falsifiable, or at least testable
Hypotheses can be corroborated with evidence, but always tentative until falsified by data in suggesting an alternative hypothesis
"All swans are white" is a hypothesis rejected upon discovery of a single black swan
Rigorous course on statistics (ECMG 212 or MATH 112) will spend weeks going through different types of tests:
See today's class notes page for more
R
package called infer
Calculate a statistic, δi1, from a sample of data
Simulate a world where δ is null (H0)
Examine the distribution of δ across the null world
Calculate the probability that δi could exist in the null world
Decide if δi is statistically significant
1 δ can stand in for any test-statistic in any hypothesis test! For our purposes, δ is the slope of our regression sample, ˆβ1.
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a modellm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
calculate()
the p-value
lm()
:H0:β1=0H1:β1≠0
infer
allows you to run through these steps manually to understand the process:specify()
a model
hypothesize()
the null
generate()
simulations of the null world
calculate()
the p-value
visualize()
with a histogram (optional)
Test statistic (δ): measures how far what we observed in our sample (^β1) is from what we would expect if the null hypothesis were true (β1=0)
Rejection region: if the test statistic reaches a "critical value" of δ, then we reject the null hypothesis
1 Again, see today's class notes for more on the t-distribution. k is the number of independent variables our model has, in this case, with just one X, k=1. We use two degrees of freedom to calculate ^β0 and ^β1, hence we have n−2 df.
Our world, and a world where β1=0 by assumption.
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | |
str | -2.279808 | 0.4798256 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | |
str | -2.279808 | 0.4798256 |
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | |
---|---|---|---|
(Intercept) | 647.8027952 | 9.7147718 | |
str | 0.3235038 | 0.4923581 |
# save as obs_slopesample_slope <- school_reg_tidy %>% # this is the regression tidied with broom filter(term=="str") %>% pull(estimate)# confirm what it issample_slope
## [1] -2.279808
data %>% specify(y ~ x)
specify()
function, which is essentially a lm()
function for regression (for our purposes)CASchool %>% specify(testscr ~ str)
ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> | |||
---|---|---|---|---|
690.8 | 17.88991 | |||
661.2 | 21.52466 | |||
643.6 | 18.69723 |
%>% hypothesize(null = "independence")
infer
's language, we are hypothesizing that str
and testscr
are independent
(β1=0)1CASchool %>% specify(testscr ~ str) %>% hypothesize(null = "independence")
ABCDEFGHIJ0123456789 |
testscr <dbl> | str <dbl> | |||
---|---|---|---|---|
690.8 | 17.88991 | |||
661.2 | 21.52466 | |||
643.6 | 18.69723 |
1 type
can be either point
(for specific point estimates for a single variable, such as a sample mean, (ˉx), or independence
(for hypotheses about two samples or a relationship between variables). See more here.
%>% generate(reps = n, type = "permute")
reps
and set the type
equal to "permute"
i %>% generate(reps = 1000, type = "permute")
1 Note for spacing on the slide, I saved the previous code as i
and pipe it into the remainder.
%>% generate(reps = n, type = "permute")
%>% generate(reps = n, type = "permute")
"bootstrap"
takes a random draw of our existing sample's observations (of the same number of observations) with replacement"permute"
is a bootstrap
without replacement1 You can do either of these in base R with sample()
, which has 3 arguments: a vector to sample from, size
(number of obs), and replace
equal to TRUE
or FALSE
. See more for infer
here.
%>% calculate(stat = "")
We calculate
sample statistics for each of the 1,000 replicate
samples
In our case, calculate the slope, (^β1) for each replicate
i %>% generate(reps = 1000, type = "permute") %>% calculate(stat = "slope")
stat
s for calculation: "mean"
, "median"
, "prop"
, "diff in means"
, "diff in props"
, etc. (see package information)%>% calculate(stat = "")
%>% get_p_value(obs stat = "", direction = "both")
We can calculate the p-value
sample_slope
(-2.28) in our simulated null distributionTwo-sided alternative Ha:β1≠0, we double the raw p-value
simulations %>% get_p_value(obs_stat = sample_slope, direction = "both")
ABCDEFGHIJ0123456789 |
p_value <dbl> | ||||
---|---|---|---|---|
0 |
+ Note here I saved the results of our previous code as simulations
for spacing.
%>% get_ci( level = 0.95, type = "se", point_estimate = "")
sample_slope
^β1 of -2.28point_estimate
simulations %>% get_confidence_interval(level = 0.95, type = "se", point_estimate = sample_slope)
ABCDEFGHIJ0123456789 |
lower <dbl> | upper <dbl> | |||
---|---|---|---|---|
-3.234823 | -1.324793 |
%>% visualize()
simulations %>% visualize()
%>% visualize()
sample_slope
to show our finding on the null distr.simulations %>% visualize(obs_stat = sample_slope)
%>% visualize()+ shade_p_value()
shade_p_value
to see what p issimulations %>% visualize(obs_stat = sample_slope)+ shade_p_value(obs_stat = sample_slope, direction = "two_sided")
%>% visualize()+ shade_ci()
tibble
of them from 4 slides ago as ci_values
simulations %>% visualize(obs_stat = sample_slope)+ shade_confidence_interval(ci_values)
infer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histograminfer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histogramsimulations %>% ggplot(data = .)+ aes(x = stat)+ geom_histogram(color="white", fill="indianred")+ geom_vline(xintercept = sample_slope, color = "blue", size = 2, linetype = "dashed")+ labs(x = expression(paste("Distribution of ", hat(beta[1]), " under ", H[0], " that ", beta[1]==0)), y = "Samples")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
infer
's visualize()
function is just a wrapper function for ggplot()
simulations
tibble
and just ggplot
a normal histogramsimulations %>% ggplot(data = .)+ aes(x = stat)+ geom_histogram(color="white", fill="indianred")+ geom_vline(xintercept = sample_slope, color = "blue", size = 2, linetype = "dashed")+ labs(x = expression(paste("Distribution of ", hat(beta[1]), " under ", H[0], " that ", beta[1]==0)), y = "Samples")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)
R does things the old-fashioned way, using a theoretical null distribution instead of simulation
A t-distribution with n−k−1 df1
Calculate a t-statistic for ^β1:
test statistic=estimate−null hypothesisstandard error of estimate
1 k is the number of X variables.
test statistic=estimate−null hypothesisstandard error of estimate
t has the same interpretation as Z, number of std. dev. away from the distribution's center1
Compares to a critical value of t∗ (determined by α & n−k−1)
1 Think of our simulated distribution, the center was 0.
2 The 68-95-99.7% empirical rule!
t=^β1−β1,0se(^β1)t=−2.28−00.48t=−4.75
Our sample slope is 4.75 standard deviations below the mean under H0
p-value: prob. of a test statistic at least as large (in magnitude) as ours if the null hypothesis were true1
1 Think of our simulated distribution, the center was 0.
Ha:β1<0
p-value: Prob(t<ti)
Ha:β1>0
p-value: Prob(t>ti)
Ha:β1≠0
p-value: 2×Prob(t>|ti|)
summary(school_reg)
## ## Call:## lm(formula = testscr ~ str, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -47.727 -14.251 0.483 12.822 48.540 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 698.9330 9.4675 73.825 < 2e-16 ***## str -2.2798 0.4798 -4.751 2.78e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.58 on 418 degrees of freedom## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897 ## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
broom
's tidy()
(with confidence intervals)tidy(school_reg, conf.int=TRUE)
ABCDEFGHIJ0123456789 |
term <chr> | estimate <dbl> | std.error <dbl> | statistic <dbl> | p.value <dbl> | conf.low <dbl> | conf.high <dbl> |
---|---|---|---|---|---|---|
(Intercept) | 698.932952 | 9.4674914 | 73.824514 | 6.569925e-242 | 680.32313 | 717.542779 |
str | -2.279808 | 0.4798256 | -4.751327 | 2.783307e-06 | -3.22298 | -1.336637 |
H0:β1=0Ha:βa≠0
Because the hypothesis test's p-value < α (0.05)...
We have sufficient evidence to reject H0 in favor of our alternative hypothesis. Our sample suggests that there is a relationship between class size and test scores.
H0:β1=0Ha:βa≠0
Because the hypothesis test's p-value < α (0.05)...
We have sufficient evidence to reject H0 in favor of our alternative hypothesis. Our sample suggests that there is a relationship between class size and test scores.
Using the confidence intervals:
We are 95% confident that the true marginal effect of class size on test scores is between −3.22 and −1.34.
Confidence intervals are all two-sided by nature CI0.95=([^β1−2×se(^β1)],[^β1+2×se(^β1]))
Hypothesis test (t-test) of H0:β1=0 computes a t-value of1 t=^β1se(^β1)
Confidence intervals are all two-sided by nature CI0.95=([^β1−2×se(^β1)],[^β1+2×se(^β1]))
Hypothesis test (t-test) of H0:β1=0 computes a t-value of1 t=^β1se(^β1)
If a confidence interval contains the H0 value (i.e. 0, for our test), then we fail to reject H0.
1 Since our null hypothesis is that (\beta_{1,0}=0)
, the test statistic simplifies to this neat fraction.
p is the probability that the alternative hypothesis is false
p is the probability that the null hypothesis is true
p is the probability that the alternative hypothesis is false
p is the probability that the null hypothesis is true
p is the probability that our observed effects were produced purely by random chance
p is the probability that the alternative hypothesis is false
p is the probability that the null hypothesis is true
p is the probability that our observed effects were produced purely by random chance
p tells us how significant our finding is
"The widespread use of 'statistical significance' (generally interpreted as (p≤0.05) as a license for making a claim of a scientific finding (or implied truth) leads to considerable distortion of the scientific process."
Wasserstein, Ronald L. and Nicole A. Lazar, (2016), "The ASA's Statement on p-Values: Context, Process, and Purpose," The American Statistician 30(2): 129-133
Again, p-value is the probability that, assuming the null hypothesis is true, we obtain (by pure random chance) a test statistic at least as extreme as the one we estimated for our sample
A low p-value means either (and we can't distinguish which):
Test Score | |
Intercept | 698.93 *** |
(9.47) | |
STR | -2.28 *** |
(0.48) | |
N | 420 |
R-Squared | 0.05 |
SER | 18.58 |
*** p < 0.001; ** p < 0.01; * p < 0.05. |
Statistical significance is shown by asterisks, common (but not always!) standard:
Rare, but sometimes regression tables include p-values for estimates
^β1∼N(E[^β1],σ^β1)
E[^β1]; the center of the distribution (2 classes ago)
σ^β1; how precise is our estimate? (last class)
1 Under the 4 assumptions about u (particularly, cor(X,u)=0).
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |