2.5: OLS: Precision and Diagnostics

ECON 480 · Econometrics · Fall 2019

Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsf19
metricsF19.classes.ryansafner.com

The Sampling Distribution of

E[^β1]; the center of the distribution (last class)
- ¹

The Sampling Distribution of

; the center of the distribution (last class)
- ¹
; how precise is our estimate? (today)
- Variance or standard error⁺

¹ Under the 4 assumptions about (particularly, .

⁺ Standard "error" is the analog of standard deviation when talking about
the sampling distribution of a sample statistic (such as or .

Variation in

What Affects Variation in

Variation in is affected by three things:

Goodness of fit of the model ⁺
- Larger larger
Sample size,
- Larger smaller
Variance of
- Larger smaller

⁺ Recall from last class, the Standard Error of the Regression

Variation in : Goodness of Fit

Variation in : Sample Size

Variation in : Variation in

Presenting Regression Results

Our Class Size Regression: Base Rsummary(school_reg) # get full summary

## 
## Call:
## lm(formula = testscr ~ str, data = CASchool)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -47.727 -14.251   0.483  12.822  48.540 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 698.9330     9.4675  73.825  < 2e-16 ***
## str          -2.2798     0.4798  -4.751 2.78e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 18.58 on 418 degrees of freedom
## Multiple R-squared:  0.05124,    Adjusted R-squared:  0.04897 
## F-statistic: 22.58 on 1 and 418 DF,  p-value: 2.783e-06
How can we present all of this information in a tidy way?

   

Our Class Size Regression: Broom I

broom's tidy() gives us a little bit neater table

library(broom)
tidy(school_reg)

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	698.932952	9.4674914	73.824514	6.569925e-242
str	-2.279808	0.4798256	-4.751327	2.783307e-06

Our Class Size Regression: Broom II

broom's glance() gives us summary statistics about the regression

library(broom)
glance(school_reg)

ABCDEFGHIJ0123456789

r.squared <dbl>	adj.r.squared <dbl>	sigma <dbl>	statistic <dbl>	p.value <dbl>	df <int>	logLik <dbl>	AIC <dbl>	BIC <dbl>
0.0512401	0.04897033	18.58097	22.57511	2.783307e-06	2	-1822.25	3650.499	3662.62

Presenting Regressions in a Table

Professional journals and papers often have a regression table, including:
- Estimates of and
- Standard errors of and (often below, in parentheses)
- Indications of statistical significance (often with asterisks)
- Measures of regression fit: , , etc
Later: multiple columns for multiple models

	Test Score
Intercept	698.93 ***
	(9.47)
STR	-2.28 ***
	(0.48)
N	420
R-Squared	0.05
SER	18.58
* p < 0.001; p < 0.01; * p < 0.05.

Regression Output with huxtable I

You will need to first install.packages("huxtable")
Load with library(huxtable)
Command: huxreg()
Main argument is the name of your lm object
Default output is fine, but often we want to customize a bit

# install.packages("huxtable")
library(huxtable)
huxreg(school_reg)

	(1)
(Intercept)	698.933 ***
	(9.467)
str	-2.280 ***
	(0.480)
N	420
R2	0.051
logLik	-1822.250
AIC	3650.499
* p < 0.001; p < 0.01; * p < 0.05.

Regression Output with huxtable II

Can give title to each column

"Test Score" = school_reg

Regression Output with huxtable II

Can give title to each column

"Test Score" = school_reg

Can change name of coefficients from default

coefs = c("Intercept" = "(Intercept)",
          "STR" = "str")

Regression Output with huxtable II

Can give title to each column

"Test Score" = school_reg

Can change name of coefficients from default

coefs = c("Intercept" = "(Intercept)",
          "STR" = "str")

Decide what statistics to include, and rename them

statistics = c("N" = "nobs",
               "R-Squared" = "r.squared",
               "SER" = "sigma")

Regression Output with huxtable II

Can give title to each column

"Test Score" = school_reg

Can change name of coefficients from default

coefs = c("Intercept" = "(Intercept)",
          "STR" = "str")

Decide what statistics to include, and rename them

statistics = c("N" = "nobs",
               "R-Squared" = "r.squared",
               "SER" = "sigma")

Choose how many decimal places to round to

number_format = 2

Regression Output with huxtable IIIhuxreg("Test Score" = school_reg,
       coefs = c("Intercept" = "(Intercept)",
                 "STR" = "str"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 2)

   

Regression Output with huxtable IIIhuxreg("Test Score" = school_reg,
       coefs = c("Intercept" = "(Intercept)",
                 "STR" = "str"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 2)

Test Score

Intercept
698.93 ***

(9.47)   

STR
-2.28 ***

(0.48)   

N
420       

R-Squared
0.05    

SER
18.58    

 *** p < 0.001;  ** p < 0.01;  * p < 0.05.

Regression Outputs

huxtable is one package you can use
- See here for more options
I used to only use stargazer, but as it was originally meant for STATA, it has limits and problems
- A great cheetsheat by my friend Jake Russ

Diagnostics about Regression

Diagnostics: Residuals I

We often look at the residuals of a regression to get more insight about its goodness of fit and its bias
Recall broom's augment creates some useful new variables
- .fitted are fitted (predicted) values from model, i.e.
- .resid are residuals (errors) from model, i.e.

Diagnostics: Residuals II

Often a good idea to store in a new object (so we can make some plots)

aug_reg<-augment(school_reg)
aug_reg %>%
  head()

testscr	str	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
691	17.9	658	1.24	32.7	0.00442	18.5	0.00689	1.76
661	21.5	650	1.28	11.3	0.00475	18.6	0.000893	0.612
644	18.7	656	1.01	-12.7	0.00297	18.6	0.0007	-0.685
648	17.4	659	1.42	-11.7	0.00586	18.6	0.00117	-0.629
641	18.7	656	1.02	-15.5	0.00301	18.6	0.00105	-0.836
606	21.4	650	1.24	-44.6	0.00446	18.5	0.013	-2.4

aug_reg<-augment(school_reg)
aug_reg %>% head()

testscr	str	.fitted	.se.fit	.resid	.hat	.sigma	.cooksd	.std.resid
691	17.9	658	1.24	32.7	0.00442	18.5	0.00689	1.76
661	21.5	650	1.28	11.3	0.00475	18.6	0.000893	0.612
644	18.7	656	1.01	-12.7	0.00297	18.6	0.0007	-0.685
648	17.4	659	1.42	-11.7	0.00586	18.6	0.00117	-0.629
641	18.7	656	1.02	-15.5	0.00301	18.6	0.00105	-0.836
606	21.4	650	1.24	-44.6	0.00446	18.5	0.013	-2.4

Recap: Assumptions about Errors

Recall the 4 critical assumptions about :

The expected value of the residuals is 0
The variance of the residuals over is constant, written:
Errors are not correlated across observations:
No correlation between and the error term:

Assumptions 1 and 2: Errors are i.i.d.

Assumptions 1 and 2 assume that errors are coming from the same (normal) distribution
- Assumption 1:
- Assumption 2: sd(u|X)=σu
  - virtually always unknown...
We often can visually check by plotting a histogram of

Plotting Residualsggplot(data = aug_reg)+
  aes(x = .resid)+
  geom_histogram(color="white")+
  labs(x = expression(paste("Residual, ", hat(u))))+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=20)

   

Plotting Residuals

ggplot(data = aug_reg)+
  aes(x = .resid)+
  geom_histogram(color="white")+
  labs(x = expression(paste("Residual, ", hat(u))))+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=20)

Plotting Residuals

ggplot(data = aug_reg)+
  aes(x = .resid)+
  geom_histogram(color="white")+
  labs(x = expression(paste("Residual, ", hat(u))))+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=20)

Just to check:

aug_reg %>%
  summarize(E_u = mean(.resid),
            sd_u = sd(.resid))

E_u	sd_u
-5.76e-16	18.6

Residual PlotWe often plot a residual plot to see any odd patterns about residualsx-axis are X values (str)
y-axis are u values (.resid)

Residual Plot

We often plot a residual plot to see any odd patterns about residuals
- -axis are values (str)
- -axis are values (.resid)

Problem: Heteroskedasticity

Homoskedasticity

"Homoskedasticity:" variance of the residuals over is constant, written:
Knowing the value of does not affect the variance (spread) of the errors

Heteroskedasticity I

"Heteroskedasticity:" variance of the residuals over is NOT constant:
This does not cause to be biased, but it does cause the standard error of to be incorrect
This does cause a problem for inference!

Heteroskedasticity II

Recall the formula for the standard error of :

This actually assumes homoskedasticity

Heteroskedasticity III

Under heteroskedasticity, the standard error of mutates to:

This is a heteroskedasticity-robust (or just "robust") method of calculating
Don't learn formula, do learn what heteroskedasticity is and how it affects our model!

Visualizing Heteroskedasticity I

Our original scatterplot with regression line

Visualizing Heteroskedasticity I

Our original scatterplot with regression line
Does the spread of the errors change over different values of ?
- No: homoskedastic
- Yes: heteroskedastic

Visualizing Heteroskedasticity I

Our original scatterplot with regression line
Does the spread of the errors change over different values of ?
- No: homoskedastic
- Yes: heteroskedastic

More Obvious Heteroskedasticity

Visual cue: data is "fan-shaped"
- Data points are closer to line in some areas
- Data points are more spread from line in other areas

More Obvious Heteroskedasticity

Visual cue: data is "fan-shaped"
- Data points are closer to line in some areas
- Data points are more spread from line in other areas

What Might Cause Heteroskedastic Errors?

	Wage
Intercept	-0.90
	(0.68)
Years of Schooling	0.54 ***
	(0.05)
N	526
R-Squared	0.16
SER	3.38
* p < 0.001; p < 0.01; * p < 0.05.

What Might Cause Heteroskedastic Errors?

	Wage
Intercept	-0.90
	(0.68)
Years of Schooling	0.54 ***
	(0.05)
N	526
R-Squared	0.16
SER	3.38
* p < 0.001; p < 0.01; * p < 0.05.

Detecting Heteroskedasticity ISeveral tests to check if data is heteroskedastic 
One common test is Breusch-Pagan test 
Can use bptest() with lmtest package in RH0: homoskedastic
If p-value < 0.05, reject H0⟹ heteroskedastic

   

Detecting Heteroskedasticity ISeveral tests to check if data is heteroskedastic 
One common test is Breusch-Pagan test 
Can use bptest() with lmtest package in RH0: homoskedastic
If p-value < 0.05, reject H0⟹ heteroskedastic

# install.packages("lmtest")
library("lmtest")
bptest(school_reg)

## 
##     studentized Breusch-Pagan test
## 
## data:  school_reg
## BP = 5.7936, df = 1, p-value = 0.01608
   

Detecting Heteroskedasticity IIHow about our wage regression?
# install.packages("lmtest")
library("lmtest")
bptest(wage_reg)

## 
##     studentized Breusch-Pagan test
## 
## data:  wage_reg
## BP = 15.306, df = 1, p-value = 9.144e-05
   

Fixing Heteroskedasticity I

Heteroskedasticity is easy to fix with software that can calculate robust standard errors (using the more complicated formula above)
Easiest method is to use estimatr package
- lm_robust() command (instead of lm) to run regression
- set se_type="stata" to calculate robust SEs using the formula above

#install.packages("estimatr")
library(estimatr)
school_reg_robust <-lm_robust(testscr ~ str, data = CASchool,
                              se_type = "stata")
school_reg_robust

##               Estimate Std. Error   t value      Pr(>|t|)   CI Lower
## (Intercept) 698.932952 10.3643599 67.436191 9.486678e-227 678.560192
## str          -2.279808  0.5194892 -4.388557  1.446737e-05  -3.300945
##               CI Upper  DF
## (Intercept) 719.305713 418
## str          -1.258671 418

Fixing Heteroskedasticity IIlibrary(huxtable)
huxreg("Normal" = school_reg,
       "Robust" = school_reg_robust,
       coefs = c("Intercept" = "(Intercept)",
                 "STR" = "str"),
       statistics = c("N" = "nobs",
                      "R-Squared" = "r.squared",
                      "SER" = "sigma"),
       number_format = 2)




Normal
Robust

Intercept
698.93 ***
698.93 ***


(9.47)   
(10.36)   

STR
-2.28 ***
-2.28 ***


(0.48)   
(0.52)   

N
420       
420       

R-Squared
0.05    
0.05    

SER
18.58    
       

 *** p < 0.001;  ** p < 0.01;  * p < 0.05.


   

Assumption 3: No Serial Correlation

Errors are not correlated across observations:
For simple cross-sectional data, this is rarely an issue
Time-series & panel data nearly always contain serial correlation or autocorrelation between errors
Errors may be clustered
- by group: e.g. all observations from Maryland, all observations from Virginia, etc.
- by time: GDP in 2006 around the world, GDP in 2008 around the world, etc.
We'll deal with these fixes when we talk about panel data (or time-series if necessary)

Outliers

Outliers Can Bias OLS! I

Outliers can affect the slope (and intercept) of the line and add bias
- May be result of human error (measurement, transcribing, etc)
- May be meaningful and accurate
In any case, compare how including/dropping outliers affects regression and always discuss outliers!

Outliers Can Bias OLS! II


No Outliers
Outliers

Intercept
698.93 ***
641.40 ***


(9.47)   
(11.21)   

STR
-2.28 ***
0.71    


(0.48)   
(0.57)   

N
420       
423       

R-Squared
0.05    
0.00    

SER
18.58    
23.76    

 *** p < 0.001;  ** p < 0.01;  * p < 0.05.


   

Detecting Outliers

The car package has an outlierTest command to run on the regression

library("car")
# Use Bonferonni test 
outlierTest(school_outlier_reg) # will point out which obs #s seem outliers

##     rstudent unadjusted p-value Bonferroni p
## 422 8.822768         3.0261e-17   1.2800e-14
## 423 7.233470         2.2493e-12   9.5147e-10
## 421 6.232045         1.1209e-09   4.7414e-07

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

	Normal	Robust
Intercept	698.93 ***	698.93 ***
	(9.47)	(10.36)
STR	-2.28 ***	-2.28 ***
	(0.48)	(0.52)
N	420	420
R-Squared	0.05	0.05
SER	18.58
* p < 0.001; p < 0.01; * p < 0.05.

	No Outliers	Outliers
Intercept	698.93 ***	641.40 ***
	(9.47)	(11.21)
STR	-2.28 ***	0.71
	(0.48)	(0.57)
N	420	423
R-Squared	0.05	0.00
SER	18.58	23.76
* p < 0.001; p < 0.01; * p < 0.05.

2.5: OLS: Precision and Diagnostics

ECON 480 · Econometrics · Fall 2019

Ryan Safner Assistant Professor of Economics safner@hood.edu ryansafner/metricsf19 metricsF19.classes.ryansafner.com

The Sampling Distribution of ^β1

The Sampling Distribution of ^β1

Variation in ^β1

What Affects Variation in ^β1

Variation in ^β1: Goodness of Fit

Variation in ^β1: Sample Size

Variation in ^β1: Variation in X

Presenting Regression Results

Our Class Size Regression: Base R

Our Class Size Regression: Broom I

Our Class Size Regression: Broom II

Presenting Regressions in a Table

Regression Output with huxtable I

Regression Output with huxtable II

Regression Output with huxtable II

Regression Output with huxtable II

Regression Output with huxtable II

Regression Output with huxtable III

Regression Output with huxtable III

Regression Outputs

Diagnostics about Regression

Diagnostics: Residuals I

Diagnostics: Residuals II

Recap: Assumptions about Errors

Assumptions 1 and 2: Errors are i.i.d.

Plotting Residuals

Plotting Residuals

Plotting Residuals

Residual Plot

Residual Plot

Problem: Heteroskedasticity

Homoskedasticity

Heteroskedasticity I

Heteroskedasticity II

Heteroskedasticity III

Visualizing Heteroskedasticity I

Visualizing Heteroskedasticity I

Visualizing Heteroskedasticity I

More Obvious Heteroskedasticity

More Obvious Heteroskedasticity

What Might Cause Heteroskedastic Errors?

What Might Cause Heteroskedastic Errors?

Detecting Heteroskedasticity I

Detecting Heteroskedasticity I

Detecting Heteroskedasticity II

Fixing Heteroskedasticity I

Fixing Heteroskedasticity II

Assumption 3: No Serial Correlation

Outliers

Outliers Can Bias OLS! I

Outliers Can Bias OLS! II

Detecting Outliers

The Sampling Distribution of ^β1

Help

Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsf19
metricsF19.classes.ryansafner.com

The Sampling Distribution of

The Sampling Distribution of

Variation in

What Affects Variation in

Variation in : Goodness of Fit

Variation in : Sample Size

Variation in : Variation in

The Sampling Distribution of