2.4: OLS: Goodness of Fit and Bias

ECON 480 · Econometrics · Fall 2019

Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsf19
metricsF19.classes.ryansafner.com

Goodness of Fit

Models

"All models are wrong. But some are useful." - George Box

Models

"All models are wrong. But some are useful." - George Box

All of Statistics:

Goodness of Fit

How well does a line fit data? How tightly clustered around the line are the data points?
Quantify how much variation in is "explained" by the model

Goodness of Fit

How well does a line fit data? How tightly clustered around the line are the data points?
Quantify how much variation in is "explained" by the model

Recall OLS estimators are chosen specifically to minimize the Sum of Squared Errors (SSE):

Goodness of Fit:

Primary measure¹ is regression R-squared, the fraction of variation in explained by variation in predicted values

¹ Sometimes called the "coefficient of determination"

Goodness of Fit: Formula

Explained Sum of Squares (ESS):¹ sum of squared deviations of predicted values from their mean²

Goodness of Fit: Formula

Explained Sum of Squares (ESS):¹ sum of squared deviations of predicted values from their mean²

Goodness of Fit: Formula

Explained Sum of Squares (ESS):¹ sum of squared deviations of predicted values from their mean²

Total Sum of Squares (TSS): sum of squared deviations of observed values from their mean

Goodness of Fit: Formula

Explained Sum of Squares (ESS):¹ sum of squared deviations of predicted values from their mean²

Total Sum of Squares (TSS): sum of squared deviations of observed values from their mean

¹ Sometimes called Model Sum of Squares (MSS) or Regression Sum of Squares (RSS) in other textbooks

² It can be shown that

Goodness of Fit: Formula II

Equivalently, the complement of the fraction of unexplained variation in

Equivalently, the square of the correlation coefficient between and :

Goodness of Fit: Formula II

Equivalently, the complement of the fraction of unexplained variation in

Equivalently, the square of the correlation coefficient between and :

Calculating in R I

If we wanted to calculate it manually:

# as squared correlation coefficient
# Base R
cor(CASchool$testscr, CASchool$str)^2

## [1] 0.0512401

# dplyr
CASchool %>%
  summarize(r_sq = cor(testscr,str)^2)

## # A tibble: 1 x 1
##     r_sq
##    <dbl>
## 1 0.0512

Calculating in R II

Recall broom's augment() command makes a lot of new regression-based values, such as:
- .fitted: predicted values
- .resid: residuals

library(broom)
school_reg %>%
  augment() %>%
  head(., n=5) # show first 5 values

## # A tibble: 5 x 9
##   testscr   str .fitted .se.fit .resid    .hat .sigma  .cooksd .std.resid
##     <dbl> <dbl>   <dbl>   <dbl>  <dbl>   <dbl>  <dbl>    <dbl>      <dbl>
## 1    691.  17.9    658.    1.24   32.7 0.00442   18.5 0.00689       1.76 
## 2    661.  21.5    650.    1.28   11.3 0.00475   18.6 0.000893      0.612
## 3    644.  18.7    656.    1.01  -12.7 0.00297   18.6 0.000700     -0.685
## 4    648.  17.4    659.    1.42  -11.7 0.00586   18.6 0.00117      -0.629
## 5    641.  18.7    656.    1.02  -15.5 0.00301   18.6 0.00105      -0.836

Calculating in R III

We can calculate R as the ratio of variances in model vs. actual (i.e. akin to )

# as ratio of variances
school_reg %>%
  augment() %>%
  summarize(r_sq = var(.fitted)/var(testscr)) # var. of *predicted* testscr over var. of *actual* testscr

## # A tibble: 1 x 1
##     r_sq
##    <dbl>
## 1 0.0512

Goodness of Fit: Standard Error of the RegressionThe Standard Error of the Regression, ˆσ or ˆσu is an estimator of the standard deviation of ui
   

Goodness of Fit: Standard Error of the Regression

The Standard Error of the Regression, or is an estimator of the standard deviation of

Goodness of Fit: Standard Error of the Regression

The Standard Error of the Regression, or is an estimator of the standard deviation of

Measures the average size of the residuals (distance between a data point and the regression line)
- Degrees of Freedom correction of : we use up 2 df to first calculate and !

Calculating SER in Rschool_reg %>%
  augment() %>%
  summarize(SSE = sum(.resid^2),
            df = n()-1-1,
            SER = sqrt(SSE/df))

## # A tibble: 1 x 3
##       SSE    df   SER
##     <dbl> <dbl> <dbl>
## 1 144315.   418  18.6
   

Calculating SER in R

school_reg %>%
  augment() %>%
  summarize(SSE = sum(.resid^2),
            df = n()-1-1,
            SER = sqrt(SSE/df))

## # A tibble: 1 x 3
##       SSE    df   SER
##     <dbl> <dbl> <dbl>
## 1 144315.   418  18.6

In large samples (where is very close to , SER converges to just the standard deviation of the residuals

school_reg %>%
  augment() %>%
  summarize(sd_resid = sd(.resid))

## # A tibble: 1 x 1
##   sd_resid
##      <dbl>
## 1     18.6

Goodness of Fit: Looking at R Isummary() command in Base R gives:Multiple R-squared
Residual standard error (SER)
Calculated with a df of n−2


# Base R
summary(school_reg)

## 
## Call:
## lm(formula = testscr ~ str, data = CASchool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
## str          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06
   

Goodness of Fit: Looking at R II

# using broom
library(broom)
glance(school_reg)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1    0.0512        0.0490  18.6      22.6 2.78e-6     2 -1822. 3650. 3663.
## # … with 2 more variables: deviance <dbl>, df.residual <int>

Goodness of Fit: Looking at R II

# using broom
library(broom)
glance(school_reg)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1    0.0512        0.0490  18.6      22.6 2.78e-6     2 -1822. 3650. 3663.
## # … with 2 more variables: deviance <dbl>, df.residual <int>

r.squared is 0.05 about 5% of variation in testscr is explained by our model
sigma (SER) is 18.6 average test score is about 18.6 points above/below our model's prediction

Goodness of Fit: Looking at R II

# using broom
library(broom)
glance(school_reg)

## # A tibble: 1 x 11
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <int>  <dbl> <dbl> <dbl>
## 1    0.0512        0.0490  18.6      22.6 2.78e-6     2 -1822. 3650. 3663.
## # … with 2 more variables: deviance <dbl>, df.residual <int>

r.squared is 0.05 about 5% of variation in testscr is explained by our model
sigma (SER) is 18.6 average test score is about 18.6 points above/below our model's prediction

# extract it if you want with $r.squared
glance(school_reg)$r.squared

## [1] 0.0512401

Bias: The Sampling Distributions of the OLS Estimators

Recall: The Two Big Problems with Data

We use econometrics to identify causal relationships and make inferences about them

Problem for identification: endogeneity
- is exogenous if its variation is unrelated to other factors that affect
- is endogenous if its variation is related to other factors that affect
Problem for inference: randomness
- Data is random due to natural sampling variation
- Taking one sample of a population will yield slightly different information than another sample of the same population

Distributions of the OLS Estimators

OLS estimators and are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:

Distributions of the OLS Estimators

OLS estimators and are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Modeled randomness: includes all factors affecting other than
- different samples will have different values of those other factors

Distributions of the OLS Estimators

OLS estimators and are computed from a finite (specific) sample of data
Our OLS model contains 2 sources of randomness:
Modeled randomness: includes all factors affecting other than
- different samples will have different values of those other factors

Sampling randomness: different samples will generate different OLS estimators
- Thus, are also random variables, with their own sampling distribution

Inferential Statistics and Sampling Distributions

Inferential statistics analyzes a sample to make inferences about a much larger (unobservable) population
Population: all possible individuals that match some well-defined criterion of interest (people, firms, cities, etc)
- Characteristics about (relationships between variables describing) populations are called parameters
Sample: some portion of the population of interest to represent the whole
- Samples examine part of a population to generate statistics used to estimate population parameters
We almost never can directly study the population, so we model it with our samples

Sampling Basics

Example: Suppose you randomly select 100 people and ask how many hours they spend on the internet each day. You take the mean of your sample, and it comes out to 5.4 hours.

Sampling Basics

Example: Suppose you randomly select 100 people and ask how many hours they spend on the internet each day. You take the mean of your sample, and it comes out to 5.4 hours.

5.4 hours is a sample statistic describing the sample
We are more interested in the corresponding parameter of the relevant population (e.g. all Americans)

Sampling Basics

Example: Suppose you randomly select 100 people and ask how many hours they spend on the internet each day. You take the mean of your sample, and it comes out to 5.4 hours.

5.4 hours is a sample statistic describing the sample
We are more interested in the corresponding parameter of the relevant population (e.g. all Americans)
If we take another sample of people, would we get the same number?

Sampling Basics

Example: Suppose you randomly select 100 people and ask how many hours they spend on the internet each day. You take the mean of your sample, and it comes out to 5.4 hours.

5.4 hours is a sample statistic describing the sample
We are more interested in the corresponding parameter of the relevant population (e.g. all Americans)
If we take another sample of people, would we get the same number?
Roughly, but probably not exactly
Sampling variability describes the effect of a statistic varying somewhat from sample to sample
- This is normal, not the result of any error or bias!

I.I.D. Samples

If we collect many samples, and each sample is randomly drawn from the population (and then replaced), then the distribution of samples is said to be independently and identically distributed (i.i.d.)
Each sample is independent of each other sample (due to replacement)
Each sample comes from the identical underlying population distribution

The Sampling Distribution of OLS Estimators

Calculating OLS estimators for a sample makes the OLS estimators themselves random variables:
Draw of is random value of each is random are random
Taking different samples will create different values of
Therefore, each have a sampling distribution across different samples

The Central Limit TheoremCentral Limit Theorem (CLT): if we collect samples of size n from the same population and generate a sample statistic (e.g. OLS estimator), then with large enough n, the distribution of the sample statistic is approximately normal IF n≥30
Samples come from a known normal distribution ∼N(μ,σ)

The Central Limit Theorem

Central Limit Theorem (CLT): if we collect samples of size n from the same population and generate a sample statistic (e.g. OLS estimator), then with large enough n, the distribution of the sample statistic is approximately normal IF
2. Samples come from a known normal distribution

If neither of these are true, we have other methods (coming shortly!)
One of the most fundamental principles in all of statistics
Allows for virtually all testing of statistical hypotheses estimating probabilities of values on a normal distribution

The Sampling Distribution of I

The CLT allows us to approximate the sampling distributions of and as normal
We care about (slope) since it has economic meaning, rarely about (intercept)

The Sampling Distribution of II

We want to know:

; what is the center of the distribution? (today)
; how precise is our estimate? (next class)

Bias and Exogeneity

Assumptions about Errors I

In order to talk about , we need to talk about
Recall: is a random variable, and we can never measure the error term

Assumptions about Errors II

We make 4 critical assumptions about :

The expected value of the residuals is 0

Assumptions about Errors II

We make 4 critical assumptions about :

The expected value of the residuals is 0

Assumptions about Errors II

We make 4 critical assumptions about :

The expected value of the residuals is 0
The variance of the residuals over is constant, written:

Assumptions about Errors II

We make 4 critical assumptions about :

The expected value of the residuals is 0
The variance of the residuals over is constant, written:
Errors are not correlated across observations:

Assumptions about Errors II

We make 4 critical assumptions about :

The expected value of the residuals is 0
The variance of the residuals over is constant, written:
Errors are not correlated across observations:
There is no correlation between and the error term:

Assumptions 1 and 2: Errors are i.i.d.

The expected value of the residuals is 0
The variance of the residuals over is constant, written:

The first two assumptions errors are i.i.d., drawn from the same distribution with mean 0 and variance

Assumption 2: Homoskedasticity

The variance of the residuals over is constant, written:
Assumption 2 implies that errors are "homoskedastic:" they have the same variance across
Often this assumption is violated: errors may be "heteroskedastic:" they do not have the same variance across
This is a problem for inference, but we have a simple fix for this (coming shortly)

Assumption 3: No Serial Correlation

Errors are not correlated across observations:
For simple cross-sectional data, this is rarely an issue
Time-series & panel data nearly always contain serial correlation or autocorrelation between errors
e.g. "this week's sales look a lot like last weel's sales, which look like...etc"
There are fixes to deal with autocorrelation (coming much later)

Assumption 4: The Zero Conditional Mean Assumption

There is no correlation between and the error term:
This is the absolute killer assumption, because it assumes exogeneity
Often called the Zero Conditional Mean assumption:
"Does knowing give me any useful information about ?"
- If yes: model is endogenous, biased and not-causal!

Exogeneity and Unbiasedness

is unbiased iff no systematic difference, on average, between sample values of and true population , i.e.

Exogeneity and Unbiasedness

is unbiased iff no systematic difference, on average, between sample values of and true population , i.e.

Does not mean any sample gives us , only the estimation procedure will, on average, yield the correct value
Random errors above and below the true value cancel out (so that on average,

Sidenote: Estimators of Statistics IIn statistics, an estimator is simply a rule that for calculating a statistic (often about a wider population parameter)
   

Sidenote: Estimators of Statistics I

In statistics, an estimator is simply a rule that for calculating a statistic (often about a wider population parameter)

Example: We want to estimate the average height (H) of U.S. adults (population) and have a random sample of 100 adults.

Sidenote: Estimators of Statistics I

In statistics, an estimator is simply a rule that for calculating a statistic (often about a wider population parameter)

Example: We want to estimate the average height (H) of U.S. adults (population) and have a random sample of 100 adults.

Calculate the mean height of our sample to estimate the true mean height of the population
is an estimator of

Sidenote: Estimators of Statistics I

In statistics, an estimator is simply a rule that for calculating a statistic (often about a wider population parameter)

Example: We want to estimate the average height (H) of U.S. adults (population) and have a random sample of 100 adults.

Calculate the mean height of our sample to estimate the true mean height of the population
is an estimator of
There are many estimators we could use to estimate
- How about using the first value in our sample:

Sidenote: Estimators of Statistics II

What makes one estimator (e.g. ) better than another (e.g. )?¹

Biasedness: does the estimator give us the true parameter on average?
Efficiency: an estimator with a smaller variance is better

¹ Technically, we also care about consistency: minimizing uncertainty about the correct value. The Law of Large Numbers, similar to CLT, permits this. We don't need to get too advanced about probability in this class.

Exogeneity and Unbiasedness I

is the Best Linear Unbiased Estimator (BLUE) estimator of when is exogenous¹
No systematic difference, on average, between sample values of and the true population :

Does not mean that each sample gives us , only the estimation procedure will, on average, yield the correct value

¹ The proof for this is known as the Gauss-Markov Theorem. See today's class notes for a simplified proof.

Exogeneity and Unbiasedness II

Recall, an exogenous variable is unrelated to other factors affecting , i.e.:

Exogeneity and Unbiasedness II

Recall, an exogenous variable is unrelated to other factors affecting , i.e.:

Again, this is called the Zero Conditional Mean Assumption

Exogeneity and Unbiasedness II

Recall, an exogenous variable is unrelated to other factors affecting , i.e.:

Again, this is called the Zero Conditional Mean Assumption

For any known value of , the expected value of is 0
Knowing the value of must tell us nothing about the value of (anything else relevant to other than )
We can then confidently assert causation:

Endogeneity and BiasNearly all independent variables are endogenous, they are related to the error term ucor(X,u)≠0
   

Endogeneity and Bias

Nearly all independent variables are endogenous, they are related to the error term

Example: Suppose we estimate the following relationship:

We find
Does this mean Ice cream sales$\rightarrow$Violent crimes?

Endogeneity and Bias III

The true expected value of is actually:¹

Endogeneity and Bias III

The true expected value of is actually:¹

Takeaways:

Endogeneity and Bias III

The true expected value of is actually:¹

Takeaways:

If is exogenous: , we're just left with

Endogeneity and Bias III

The true expected value of is actually:¹

Takeaways:

If is exogenous: , we're just left with
The larger is, larger bias:

Endogeneity and Bias III

The true expected value of is actually:¹

Takeaways:

If is exogenous: , we're just left with
The larger is, larger bias:
We can "sign" the direction of the bias based on
- Positive overestimates the true is too high)
- Negative underestimates the true is too low)

¹ See today's class notes for proof.

Endogeneity and Bias: Example I

Example:

Is this an accurate reflection of ?
Does ?
What would mean?

Endogeneity and Bias: Example II

Example:

Is this an accurate reflection of ?
Does ?
What would mean?

Exogeneity and RCTs

Think about an idealized randomized controlled experiment
Subjects randomly assigned to treatment or control group
- Implies knowing whether someone is treated tells us nothing about their personal characteristics
- Random assignment makes independent of , so

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

2.4: OLS: Goodness of Fit and Bias

ECON 480 · Econometrics · Fall 2019

Ryan Safner Assistant Professor of Economics safner@hood.edu ryansafner/metricsf19 metricsF19.classes.ryansafner.com

Goodness of Fit

Models

Models

All of Statistics:

Goodness of Fit

Goodness of Fit

Goodness of Fit: R2

Goodness of Fit: R2 Formula

Goodness of Fit: R2 Formula

Goodness of Fit: R2 Formula

Goodness of Fit: R2 Formula

Goodness of Fit: R2 Formula

Goodness of Fit: R2 Formula II

Goodness of Fit: R2 Formula II

Calculating R2 in R I

Calculating R2 in R II

Calculating R2 in R III

Goodness of Fit: Standard Error of the Regression

Goodness of Fit: Standard Error of the Regression

Goodness of Fit: Standard Error of the Regression

Calculating SER in R

Calculating SER in R

Goodness of Fit: Looking at R I

Goodness of Fit: Looking at R II

Goodness of Fit: Looking at R II

Goodness of Fit: Looking at R II

Bias: The Sampling Distributions of the OLS Estimators

Recall: The Two Big Problems with Data

Distributions of the OLS Estimators

Distributions of the OLS Estimators

Distributions of the OLS Estimators

Inferential Statistics and Sampling Distributions

Sampling Basics

Sampling Basics

Sampling Basics

Sampling Basics

I.I.D. Samples

The Sampling Distribution of OLS Estimators

The Central Limit Theorem

The Central Limit Theorem

The Sampling Distribution of ^β1 I

The Sampling Distribution of ^β1 II

Bias and Exogeneity

Assumptions about Errors I

Assumptions about Errors II

Assumptions about Errors II

Assumptions about Errors II

Assumptions about Errors II

Assumptions about Errors II

Assumptions 1 and 2: Errors are i.i.d.

Assumption 2: Homoskedasticity

Assumption 3: No Serial Correlation

Assumption 4: The Zero Conditional Mean Assumption

Exogeneity and Unbiasedness

Exogeneity and Unbiasedness

Sidenote: Estimators of Statistics I

Sidenote: Estimators of Statistics I

Sidenote: Estimators of Statistics I

Sidenote: Estimators of Statistics I

Sidenote: Estimators of Statistics II

Exogeneity and Unbiasedness I

Exogeneity and Unbiasedness II

Exogeneity and Unbiasedness II

Exogeneity and Unbiasedness II

Endogeneity and Bias

Endogeneity and Bias

Endogeneity and Bias III

Endogeneity and Bias III

Endogeneity and Bias III

Endogeneity and Bias III

Endogeneity and Bias III

Endogeneity and Bias: Example I

Endogeneity and Bias: Example II

Exogeneity and RCTs

Goodness of Fit

Help

Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsf19
metricsF19.classes.ryansafner.com

Goodness of Fit:

Goodness of Fit: Formula

Goodness of Fit: Formula

Goodness of Fit: Formula

Goodness of Fit: Formula

Goodness of Fit: Formula

Goodness of Fit: Formula II

Goodness of Fit: Formula II

Calculating in R I

Calculating in R II

Calculating in R III

The Sampling Distribution of I

The Sampling Distribution of II