3.6: Regression with Categorical Data

ECON 480 · Econometrics · Fall 2019

Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsf19
metricsF19.classes.ryansafner.com

Categorical Data

Categorical data place an individual into one of several possible categories
- e.g. sex, season, political party
- may be responses to survey questions
- can be quantitative (e.g. age, zip code)
R calls these factors (we'll deal with them much later in the course)

Example Research Question

Example: do men earn higher wages, on average, than women? If so, how much?

The Pure Statistics of Comparing Group Means

Using basic statistics, we can test for a statistically significant difference in group means with a t-test¹
Define:
: average earnings of a sample of men
: average earnings of a sample of women
Difference in group averages:
The hypothesis test is:

¹ See today's class notes page for this example

Comparing Groups with Regression

In a regression, we can easily compare across groups via a dummy variable¹
Dummy variable only or , depending on if a condition is met
Signifies whether an observation belongs to a category or not

¹ Also called a binary variable or dichotomous variable

Comparing Groups with Regression

In a regression, we can easily compare across groups via a dummy variable¹
Dummy variable only or , depending on if a condition is met
Signifies whether an observation belongs to a category or not

¹ Also called a binary variable or dichotomous variable

Example:

Comparing Groups with Regression

In a regression, we can easily compare across groups via a dummy variable¹
Dummy variable only or , depending on if a condition is met
Signifies whether an observation belongs to a category or not

¹ Also called a binary variable or dichotomous variable

Example:

Again, makes less sense as the "slope" of a line in this context

Comparing Groups in Regression: Scatterplot

Female is our dummy -variable
Hard to see relationships because of overplotting

Comparing Groups in Regression: Scatterplot

Female is our dummy -variable
Hard to see relationships because of overplotting
Use geom_jitter() instead of geom_point() to randomly nudge points
- Only for plotting purposes, does not affect actual data, regression, etc.

Comparing Groups in Regression: Scatterplot

Female is our dummy -variable
Hard to see relationships because of overplotting
Use geom_jitter() instead of geom_point() to randomly nudge points
- Only for plotting purposes, does not affect actual data, regression, etc.

Dummy Variables as Group Means

When (Control group):
- the mean of when

Dummy Variables as Group Means

When (Control group):
- the mean of when
When (Treatment group):
- the mean of when

Dummy Variables as Group Means

When (Control group):
- the mean of when
When (Treatment group):
- the mean of when

So the difference in group means:

Dummy Variables as Group Means: Our Example

Example:

Mean wage for men:

Dummy Variables as Group Means: Our Example

Example:

Mean wage for men:

Dummy Variables as Group Means: Our Example

Example:

Mean wage for men:
Mean wage for women:

Dummy Variables as Group Means: Our Example

Example:

Mean wage for men:
Mean wage for women:

Dummy Variables as Group Means: Our Example

Example:

Mean wage for men:
Mean wage for women:
Difference in wage between men & women:

Dummy Variables as Group Means: Our Example

Example:

Mean wage for men:
Mean wage for women:
Difference in wage between men & women:

Comparing Groups in Regression: Scatterplot

Female is our dummy -variable
Hard to see relationships because of overplotting
Use geom_jitter() instead of geom_point() to randomly nudge points
- Only for plotting purposes, does not affect actual data, regression, etc.

The Data

# from wooldridge package
library(wooldridge)
# save as a dataframe
wages<-wooldridge::wage1
head(wages)

ABCDEFGHIJ0123456789

	wage <dbl>	educ <int>	exper <int>	tenure <int>	female <int>	married <int>	numdep <int>	smsa <int>
1	3.10	11	2	0	1	0	2	1
2	3.24	12	22	2	1	1	3	1
3	3.00	11	2	0	0	0	2	0
4	6.00	8	44	28	0	1	0	1
5	5.30	12	7	2	0	1	1	0
6	8.75	16	9	8	0	1	0	1

Get Group Averages & Std. Devs.# Summarize for Men
wages %>%
  filter(female==0) %>%
  summarize(mean = mean(wage),
            sd = sd(wage))
ABCDEFGHIJ0123456789
mean
<dbl>
sd
<dbl>
7.0994894.160858
1 row
# Summarize for Women
wages %>%
  filter(female==1) %>%
  summarize(mean = mean(wage),
            sd = sd(wage))
ABCDEFGHIJ0123456789
mean
<dbl>
sd
<dbl>
4.5876592.529363
1 row
   

mean <dbl>	sd <dbl>
7.099489	4.160858

mean <dbl>	sd <dbl>
4.587659	2.529363

Visualize Differences

The Regression I

femalereg<-lm(wage~female, data=wages)
summary(femalereg)

## 
## Call:
## lm(formula = wage ~ female, data = wages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5995 -1.8495 -0.9877  1.4260 17.8805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   7.0995     0.2100  33.806  < 2e-16 ***
## female       -2.5118     0.3034  -8.279 1.04e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.476 on 524 degrees of freedom
## Multiple R-squared:  0.1157,    Adjusted R-squared:  0.114 
## F-statistic: 68.54 on 1 and 524 DF,  p-value: 1.042e-15

The Regression II

library(broom)
tidy(femalereg)

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	7.099489	0.2100082	33.805777	8.971839e-134
female	-2.511830	0.3034092	-8.278688	1.041764e-15

Dummy Regression vs. Group Means

From tabulation of group means

Gender	Avg. Wage	Std. Dev.
Female
Male
Difference

From -test of difference in group means

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>
(Intercept)	7.099489	0.2100082
female	-2.511830	0.3034092

Recoding Dummies

Example:

Suppose instead of we had used:

Recoding Dummies with DataABCDEFGHIJ0123456789
 
 
wage
<dbl>
female
<int>
male
<dbl>
13.1010
23.2410
33.0001
46.0001
55.3001
68.7501
6 rows
   

	wage <dbl>	female <int>	male <dbl>
1	3.10	1	0
2	3.24	1	0
3	3.00	0	1
4	6.00	0	1
5	5.30	0	1
6	8.75	0	1

Scatterplot with Male

Dummy Variables as Group Means: With Male

Example:

Mean wage for men:
Mean wage for women:
Difference in wage between men & women:

Scatterplot with Male

The Regression with Male I

malereg<-lm(wage~male, data=wages)
summary(malereg)

## 
## Call:
## lm(formula = wage ~ male, data = wages)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5.5995 -1.8495 -0.9877  1.4260 17.8805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   4.5877     0.2190  20.950  < 2e-16 ***
## male          2.5118     0.3034   8.279 1.04e-15 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.476 on 524 degrees of freedom
## Multiple R-squared:  0.1157,    Adjusted R-squared:  0.114 
## F-statistic: 68.54 on 1 and 524 DF,  p-value: 1.042e-15

The Regression with Male II

library(broom)
tidy(malereg)

ABCDEFGHIJ0123456789

term <chr>	estimate <dbl>	std.error <dbl>	statistic <dbl>	p.value <dbl>
(Intercept)	4.587659	0.2189834	20.949802	3.012371e-71
male	2.511830	0.3034092	8.278688	1.041764e-15

The Dummy Regression: Male or Female

	(1)	(2)
Constant	4.59 ***	7.10 ***
	(0.22)	(0.21)
Female		-2.51 ***
		(0.30)
Male	2.51 ***
	(0.30)
N	526	526
R-Squared	0.12	0.12
SER	3.48	3.48
* p < 0.001; p < 0.01; * p < 0.05.

Note it doesn't matter if we use male or female, males always earn $2.51 more than females
Compare the constant (average for the group)
Should you use male AND female? We'll come to that...

Categorical Variables (More than 2 Categories)

Categorical Variables with More than 2 CategoriesA categorical variable expresses membership in a category, where there is no ranking or hierarchy of the categoriesWe've looked at categorical variables with 2 categories only
e.g. Male/Female, Spring/Summer/Fall/Winter, Democratic/Republican/Independent

Categorical Variables with More than 2 CategoriesA categorical variable expresses membership in a category, where there is no ranking or hierarchy of the categoriesWe've looked at categorical variables with 2 categories only
e.g. Male/Female, Spring/Summer/Fall/Winter, Democratic/Republican/Independent

Might be an ordinal variable expresses rank or an ordering of data, but not necessarily their relative magnitudee.g. Order of finalists in a competition (1st, 2nd, 3rd)
e.g. Highest education attained (1=elementary school, 2=high school, 3=bachelor's degree, 4=graduate degree)

Using Categorical Variables in Regression I

Example: How do wages vary by region of the country? Let

Using Categorical Variables in Regression I

Example: How do wages vary by region of the country? Let

Can we run the following regression?

Using Categorical Variables in Regression II