"All models are wrong. But some are useful." - George Box
"All models are wrong. But some are useful." - George Box
$$Observed_i = Model_i + Error_i$$
How well does a line fit data? How tightly clustered around the line are the data points?
Quantify how much variation in \(Y_i\) is "explained" by the model
How well does a line fit data? How tightly clustered around the line are the data points?
Quantify how much variation in \(Y_i\) is "explained" by the model
$$\underbrace{Y_i}_{Observed}=\underbrace{\widehat{Y_i}}_{Model}+\underbrace{\hat{u}}_{Error}$$
$$R^2 = \frac{\text{ variation in }\widehat{Y_i}}{\text{variation in }Y_i}$$
1 Sometimes called the "coefficient of determination"
$$R^2 =\frac{ESS}{TSS}$$
$$R^2 =\frac{ESS}{TSS}$$
$$R^2 =\frac{ESS}{TSS}$$
$$ESS= \sum^n_{i=1}(\hat{Y_i}-\bar{Y})^2$$
$$R^2 =\frac{ESS}{TSS}$$
$$ESS= \sum^n_{i=1}(\hat{Y_i}-\bar{Y})^2$$
$$R^2 =\frac{ESS}{TSS}$$
$$ESS= \sum^n_{i=1}(\hat{Y_i}-\bar{Y})^2$$
$$TSS= \sum^n_{i=1}(Y_i-\bar{Y})^2$$
1 Sometimes called Model Sum of Squares (MSS) or Regression Sum of Squares (RSS) in other textbooks
2 It can be shown that \(\bar{\hat{Y_i}}=\bar{Y}\)
$$R^2=1-\frac{SSE}{TSS}$$
$$R^2=1-\frac{SSE}{TSS}$$
$$R^2=(r_{X,Y})^2$$
# as squared correlation coefficient# Base Rcor(CASchool$testscr, CASchool$str)^2
## [1] 0.0512401
# dplyrCASchool %>% summarize(r_sq = cor(testscr,str)^2)
## # A tibble: 1 x 1## r_sq## <dbl>## 1 0.0512
broom
's augment()
command makes a lot of new regression-based values, such as:.fitted
: predicted values \((\hat{Y_i})\).resid
: residuals \((\hat{u_i})\)library(broom)school_reg %>% augment() %>% head(., n=5) # show first 5 values
## # A tibble: 5 x 9## testscr str .fitted .se.fit .resid .hat .sigma .cooksd .std.resid## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 691. 17.9 658. 1.24 32.7 0.00442 18.5 0.00689 1.76 ## 2 661. 21.5 650. 1.28 11.3 0.00475 18.6 0.000893 0.612## 3 644. 18.7 656. 1.01 -12.7 0.00297 18.6 0.000700 -0.685## 4 648. 17.4 659. 1.42 -11.7 0.00586 18.6 0.00117 -0.629## 5 641. 18.7 656. 1.02 -15.5 0.00301 18.6 0.00105 -0.836
R
as the ratio of variances in model vs. actual (i.e. akin to \(\frac{ESS}{TSS}\))# as ratio of variancesschool_reg %>% augment() %>% summarize(r_sq = var(.fitted)/var(testscr)) # var. of *predicted* testscr over var. of *actual* testscr
## # A tibble: 1 x 1## r_sq## <dbl>## 1 0.0512
$$\hat{\sigma_u}=\sqrt{\frac{SSE}{n-2}}$$
$$\hat{\sigma_u}=\sqrt{\frac{SSE}{n-2}}$$
school_reg %>% augment() %>% summarize(SSE = sum(.resid^2), df = n()-1-1, SER = sqrt(SSE/df))
## # A tibble: 1 x 3## SSE df SER## <dbl> <dbl> <dbl>## 1 144315. 418 18.6
school_reg %>% augment() %>% summarize(SSE = sum(.resid^2), df = n()-1-1, SER = sqrt(SSE/df))
## # A tibble: 1 x 3## SSE df SER## <dbl> <dbl> <dbl>## 1 144315. 418 18.6
In large samples (where \(n-2\) is very close to \(n)\), SER converges to just the standard deviation of the residuals
school_reg %>% augment() %>% summarize(sd_resid = sd(.resid))
## # A tibble: 1 x 1## sd_resid## <dbl>## 1 18.6
summary()
command in Base R
gives:Multiple R-squared
Residual standard error
(SER)# Base Rsummary(school_reg)
## ## Call:## lm(formula = testscr ~ str, data = CASchool)## ## Residuals:## Min 1Q Median 3Q Max ## -47.727 -14.251 0.483 12.822 48.540 ## ## Coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 698.9330 9.4675 73.825 < 2e-16 ***## str -2.2798 0.4798 -4.751 2.78e-06 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Residual standard error: 18.58 on 418 degrees of freedom## Multiple R-squared: 0.05124, Adjusted R-squared: 0.04897 ## F-statistic: 22.58 on 1 and 418 DF, p-value: 2.783e-06
# using broomlibrary(broom)glance(school_reg)
## # A tibble: 1 x 11## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>## 1 0.0512 0.0490 18.6 22.6 2.78e-6 2 -1822. 3650. 3663.## # … with 2 more variables: deviance <dbl>, df.residual <int>
# using broomlibrary(broom)glance(school_reg)
## # A tibble: 1 x 11## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>## 1 0.0512 0.0490 18.6 22.6 2.78e-6 2 -1822. 3650. 3663.## # … with 2 more variables: deviance <dbl>, df.residual <int>
r.squared
is 0.05
\(\implies\) about 5% of variation in testscr
is explained by our modelsigma
(SER) is 18.6
\(\implies\) average test score is about 18.6 points above/below our model's prediction# using broomlibrary(broom)glance(school_reg)
## # A tibble: 1 x 11## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC## <dbl> <dbl> <dbl> <dbl> <dbl> <int> <dbl> <dbl> <dbl>## 1 0.0512 0.0490 18.6 22.6 2.78e-6 2 -1822. 3650. 3663.## # … with 2 more variables: deviance <dbl>, df.residual <int>
r.squared
is 0.05
\(\implies\) about 5% of variation in testscr
is explained by our modelsigma
(SER) is 18.6
\(\implies\) average test score is about 18.6 points above/below our model's prediction# extract it if you want with $r.squaredglance(school_reg)$r.squared
## [1] 0.0512401
Problem for identification: endogeneity
Problem for inference: randomness
OLS estimators \((\hat{\beta_0}\) and \(\hat{\beta_1})\) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
OLS estimators \((\hat{\beta_0}\) and \(\hat{\beta_1})\) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Modeled randomness: \(u\) includes all factors affecting \(Y\) other than \(X\)
OLS estimators \((\hat{\beta_0}\) and \(\hat{\beta_1})\) are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Modeled randomness: \(u\) includes all factors affecting \(Y\) other than \(X\)
Inferential statistics analyzes a sample to make inferences about a much larger (unobservable) population
Population: all possible individuals that match some well-defined criterion of interest (people, firms, cities, etc)
Sample: some portion of the population of interest to represent the whole
We almost never can directly study the population, so we model it with our samples
Example: Suppose you randomly select 100 people and ask how many hours they spend on the internet each day. You take the mean of your sample, and it comes out to 5.4 hours.
Example: Suppose you randomly select 100 people and ask how many hours they spend on the internet each day. You take the mean of your sample, and it comes out to 5.4 hours.
Example: Suppose you randomly select 100 people and ask how many hours they spend on the internet each day. You take the mean of your sample, and it comes out to 5.4 hours.
We are more interested in the corresponding parameter of the relevant population (e.g. all Americans)
If we take another sample of \(n=100\) people, would we get the same number?
Example: Suppose you randomly select 100 people and ask how many hours they spend on the internet each day. You take the mean of your sample, and it comes out to 5.4 hours.
We are more interested in the corresponding parameter of the relevant population (e.g. all Americans)
If we take another sample of \(n=100\) people, would we get the same number?
Roughly, but probably not exactly
Sampling variability describes the effect of a statistic varying somewhat from sample to sample
If we collect many samples, and each sample is randomly drawn from the population (and then replaced), then the distribution of samples is said to be independently and identically distributed (i.i.d.)
Each sample is independent of each other sample (due to replacement)
Each sample comes from the identical underlying population distribution
Calculating OLS estimators for a sample makes the OLS estimators themselves random variables:
Draw of \(i\) is random \(\implies\) value of each \((X_i,Y_i)\) is random \(\implies\) \(\hat{\beta_0},\hat{\beta_1}\) are random
Taking different samples will create different values of \(\hat{\beta_0},\hat{\beta_1}\)
Therefore, \(\hat{\beta_0},\hat{\beta_1}\) each have a sampling distribution across different samples
If neither of these are true, we have other methods (coming shortly!)
One of the most fundamental principles in all of statistics
Allows for virtually all testing of statistical hypotheses \(\rightarrow\) estimating probabilities of values on a normal distribution
The CLT allows us to approximate the sampling distributions of \(\hat{\beta_0}\) and \(\hat{\beta_1}\) as normal
We care about \(\hat{\beta_1}\) (slope) since it has economic meaning, rarely about \(\hat{\beta_0}\) (intercept)
$$\hat{\beta_1} \sim N(E[\hat{\beta_1}], \sigma_{\hat{\beta_1}})$$
$$\hat{\beta_1} \sim N(E[\hat{\beta_1}], \sigma_{\hat{\beta_1}})$$
\(E[\hat{\beta_1}]\); what is the center of the distribution? (today)
\(\sigma_{\hat{\beta_1}}\); how precise is our estimate? (next class)
In order to talk about \(E[\hat{\beta_1}]\), we need to talk about \(u\)
Recall: \(u\) is a random variable, and we can never measure the error term
The expected value of the residuals is 0 $$E[u]=0$$
The variance of the residuals over \(X\) is constant, written: $$var(u|X)=\sigma^2_{u}$$
The expected value of the residuals is 0 $$E[u]=0$$
The variance of the residuals over \(X\) is constant, written: $$var(u|X)=\sigma^2_{u}$$
Errors are not correlated across observations: $$cor(u_i,u_j)=0 \quad \forall i \neq j$$
The expected value of the residuals is 0 $$E[u]=0$$
The variance of the residuals over \(X\) is constant, written: $$var(u|X)=\sigma^2_{u}$$
Errors are not correlated across observations: $$cor(u_i,u_j)=0 \quad \forall i \neq j$$
There is no correlation between \(X\) and the error term: $$cor(X, u)=0 \text{ or } E[u|X]=0$$
The expected value of the residuals is 0 $$E[u]=0$$
The variance of the residuals over \(X\) is constant, written: $$var(u|X)=\sigma^2_{u}$$
The variance of the residuals over \(X\) is constant, written: $$var(u|X)=\sigma^2_{u}$$
Assumption 2 implies that errors are "homoskedastic:" they have the same variance across \(X\)
Often this assumption is violated: errors may be "heteroskedastic:" they do not have the same variance across \(X\)
This is a problem for inference, but we have a simple fix for this (coming shortly)
Errors are not correlated across observations: $$cor(u_i,u_j)=0 \quad \forall i \neq j$$
For simple cross-sectional data, this is rarely an issue
Time-series & panel data nearly always contain serial correlation or autocorrelation between errors
e.g. "this week's sales look a lot like last weel's sales, which look like...etc"
There are fixes to deal with autocorrelation (coming much later)
There is no correlation between \(X\) and the error term: $$cor(X, u)=0$$
This is the absolute killer assumption, because it assumes exogeneity
Often called the Zero Conditional Mean assumption: $$E[u|X]=0$$
"Does knowing \(X\) give me any useful information about \(u\)?"
$$E[\hat{\beta_1}]=\beta_1$$
$$E[\hat{\beta_1}]=\beta_1$$
Does not mean any sample gives us \(\hat{\beta_1}=\beta_1\), only the estimation procedure will, on average, yield the correct value
Random errors above and below the true value cancel out (so that on average, \(E[\hat{u}|X]=0)\)
Example: We want to estimate the average height (H) of U.S. adults (population) and have a random sample of 100 adults.
Example: We want to estimate the average height (H) of U.S. adults (population) and have a random sample of 100 adults.
Calculate the mean height of our sample \((\bar{H})\) to estimate the true mean height of the population \((\mu_H)\)
\(\bar{H}\) is an estimator of \(\mu_H\)
Example: We want to estimate the average height (H) of U.S. adults (population) and have a random sample of 100 adults.
Calculate the mean height of our sample \((\bar{H})\) to estimate the true mean height of the population \((\mu_H)\)
\(\bar{H}\) is an estimator of \(\mu_H\)
There are many estimators we could use to estimate \(\mu_H\)
Biasedness: does the estimator give us the true parameter on average?
Efficiency: an estimator with a smaller variance is better
1 Technically, we also care about consistency: minimizing uncertainty about the correct value. The Law of Large Numbers, similar to CLT, permits this. We don't need to get too advanced about probability in this class.
\(\mathbf{\hat{\beta_1}}\) is the Best Linear Unbiased Estimator (BLUE) estimator of \(\mathbf{\beta_1}\) when \(X\) is exogenous1
No systematic difference, on average, between sample values of \(\hat{\beta_1}\) and the true population \(\beta_1\):
$$E[\hat{\beta_1}]=\beta_1$$
1 The proof for this is known as the Gauss-Markov Theorem. See today's class notes for a simplified proof.
$$cor(X,u)=0$$
$$cor(X,u)=0$$
$$E(u|X)=0$$
$$cor(X,u)=0$$
$$E(u|X)=0$$
For any known value of \(X\), the expected value of \(u\) is 0
Knowing the value of \(X\) must tell us nothing about the value of \(u\) (anything else relevant to \(Y\) other than \(X\))
We can then confidently assert causation: \(X \rightarrow Y\)
Example: Suppose we estimate the following relationship:
$$\text{Violent crimes}_t=\beta_0+\beta_1\text{Ice cream sales}_t+u_t$$
We find \(\hat{\beta_1}>0\)
Does this mean Ice cream sales$\rightarrow$Violent crimes?
$$E[\hat{\beta_1}]=\beta_1+cor(X,u)\frac{\sigma_u}{\sigma_X}$$
$$E[\hat{\beta_1}]=\beta_1+cor(X,u)\frac{\sigma_u}{\sigma_X}$$
$$E[\hat{\beta_1}]=\beta_1+cor(X,u)\frac{\sigma_u}{\sigma_X}$$
$$E[\hat{\beta_1}]=\beta_1+cor(X,u)\frac{\sigma_u}{\sigma_X}$$
If \(X\) is exogenous: \(cor(X,u)=0\), we're just left with \(\beta_1\)
The larger \(cor(X,u)\) is, larger bias: \(\left(E[\hat{\beta_1}]-\beta_1 \right)\)
$$E[\hat{\beta_1}]=\beta_1+cor(X,u)\frac{\sigma_u}{\sigma_X}$$
If \(X\) is exogenous: \(cor(X,u)=0\), we're just left with \(\beta_1\)
The larger \(cor(X,u)\) is, larger bias: \(\left(E[\hat{\beta_1}]-\beta_1 \right)\)
We can "sign" the direction of the bias based on \(cor(X,u)\)
1 See today's class notes for proof.
Example: $$wages_i=\beta_0+\beta_1 education_i+u$$
Is this an accurate reflection of \(educ \rightarrow wages\)?
Does \(E[u|education]=0\)?
What would \(E[u|education]>0\) mean?
Example: $$\text{per capita cigarette consumption}=\beta_0+\beta_1 \text{State cig tax rate}+u $$
Is this an accurate reflection of \(tax \rightarrow cons\)?
Does \(E[u|tax]=0\)?
What would \(E[u|tax]>0\) mean?
Think about an idealized randomized controlled experiment
Subjects randomly assigned to treatment or control group
Implies knowing whether someone is treated \((X)\) tells us nothing about their personal characteristics \((u)\)
Random assignment makes \(u\) independent of \(X\), so
$$cor(X,u)=0\text{ and } E[u|X]=0$$
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |