2.1: Data 101 and Descriptive Statistics

ECON 480 · Econometrics · Fall 2019

Ryan Safner
Assistant Professor of Economics
safner@hood.edu
ryansafner/metricsf19
metricsF19.classes.ryansafner.com

Review From 1.2: Two Big Problems with Data

We want to use econometrics to identify causal relationships and make inferences about them

Problem for identification: endogeneity
Problem for inference: randomness

Review from 1.2: Identification Problem: Endogeneity

An independent variable is exogenous if its variation is unrelated to other factors that affect the dependent variable
An independent variable is endogenous if its variation is related to other factors that affect the dependent variable

Review from 1.2: Inference Problem: Randomness

Data is random due to natural sampling variation
- Taking one sample of a population will yield slightly different information than another sample of the same population
Common in statistics, easy to fix
Inferential Statistics: making claims about a wider population using sample data
- We use common tools and techniques to deal with randomness

The Two Problems: Where We're Heading...Ultimately

We want to identify causal relationships between population variables
- Logically first thing to consider
- Endogeneity problem
We'll use sample statistics to infer something about population parameters
- In practice, we'll only ever have a finite sample distribution of data
- We don't know the population distribution of data
- Randomness problem

Data 101

Data 101 I

Data are information with context
Individuals are the entities described by a set of data
- e.g. persons, households, firms, countries

Data 101 I

Variables are particular characteristics about an individual
- e.g. age, income, profits, population, GDP, marital status, type of legal institutions
Observations or cases are the separate individuals described by a collection of variables
- e.g. for one individual, we have their age, sex, income, education, etc.
individuals and observations are not necessarily the same:
- e.g. we can have separate observations on the same individual over time

Categorical Data

Categorical data place an individual into one of several possible categories
- e.g. sex, season, political party
- may be responses to survey questions
- can be quantitative (e.g. age, zip code)
R calls these factors (we'll deal with them much later in the course)

Categorical Data: Visualizing I

Summary of diamonds by cut
cut	n	frequency
Fair	1610	0.0298480
Good	4906	0.0909529
Very Good	12082	0.2239896
Premium	13791	0.2556730
Ideal	21551	0.3995365

Good way to represent categorical data is with a frequency table
Count (n): total number of individuals in a category
Frequency: proportion of a category relative to all data

Categorical Data: Visualizing II

Charts and graphs are always better ways to visualize data
A bar graph represents categories as bars, with lengths proportional to the count or relative frequency fo each category

ggplot(diamonds, aes(x=cut,
                     fill=cut))+
  geom_bar()+
  guides(fill=F)+
  theme_bw(base_family = "Fira Sans Condensed",
           base_size=20)

Categorical Data: Visualizing III

Avoid pie charts!
People are not good at judging 2-d differences (angles, area)
People are good at judging 1-d differences (length)

Categorical Data: Visualizing III

Avoid pie charts!
People are not good at judging 2-d differences (angles, area)
People are good at judging 1-d differences (length)

Categorical Data: Visualizing IV

Maybe a stacked bar chart

diamonds %>%
  count(cut) %>%
ggplot(., aes(x="", y=n, fill=cut))+
  geom_col()+
  geom_label(aes(x="", y=n, label=cut), position = position_stack(), color="white")+
  guides(fill=F)+
  theme_void()

Categorical Data: Visualizing IV

Maybe lollipop chart

diamonds %>%
  count(cut) %>%
  mutate(cut_name = as.factor(cut)) %>%
ggplot(., aes(x = cut_name, y = n, color = cut))+
 geom_point(stat="identity",
            fill="black",
            size=12)  +
  geom_segment(aes(x = cut_name, y = 0,
                   xend = cut_name,
                   yend = n), size = 2)+
  geom_text(aes(label = n),color="white", size=3) +
  coord_flip()+
  labs(x = "Cut")+
  theme_classic(base_family = "Fira Sans Condensed",
                base_size=20)+
  guides(color = F)

Categorical Data: Visualizing IV

Maybe a treemap

library(treemapify)
diamonds %>%
  count(cut) %>%
ggplot(., aes(area = n, fill = cut)) +
  geom_treemap() +
  guides(fill = FALSE) +
  geom_treemap_text(aes(label = cut), 
                    colour = "white",
                    place = "center",
                    grow = TRUE)

Quantitative Data I

Quantitative variables take on numerical values of equal units that describe an individual
- Units: points, dollars, inches
- Context: GPA, prices, height
We can mathematically manipulate only quantitative data
- e.g. sum, average, standard deviation

Context is Key!How variables are classified depends on the purpose of collecting and using the data
   

Context is Key!

How variables are classified depends on the purpose of collecting and using the data

Quick Check: What kind of data (categorical or quantitative) does each variable describe?

Context is Key!

How variables are classified depends on the purpose of collecting and using the data

Quick Check: What kind of data (categorical or quantitative) does each variable describe?

Age measured in years

Context is Key!

How variables are classified depends on the purpose of collecting and using the data

Quick Check: What kind of data (categorical or quantitative) does each variable describe?

Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)

Context is Key!

How variables are classified depends on the purpose of collecting and using the data

Quick Check: What kind of data (categorical or quantitative) does each variable describe?

Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
The date a purchase was made

Context is Key!

How variables are classified depends on the purpose of collecting and using the data

Quick Check: What kind of data (categorical or quantitative) does each variable describe?

Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
The date a purchase was made
Transaction ID

Context is Key!

How variables are classified depends on the purpose of collecting and using the data

Quick Check: What kind of data (categorical or quantitative) does each variable describe?

Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
The date a purchase was made
Transaction ID
The amount of money spent on a Super Bowl ad

Context is Key!

How variables are classified depends on the purpose of collecting and using the data

Quick Check: What kind of data (categorical or quantitative) does each variable describe?

Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
The date a purchase was made
Transaction ID
The amount of money spent on a Super Bowl ad
Customer ratings

Context is Key!

How variables are classified depends on the purpose of collecting and using the data

Quick Check: What kind of data (categorical or quantitative) does each variable describe?

Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
The date a purchase was made
Transaction ID
The amount of money spent on a Super Bowl ad
Customer ratings
The number of correct answers on an exam

Discrete Data

Discrete data are finite, with a countable number of alternatives
Categorical: e.g. letter grades A, B, C, D, F
Quantitative: integers, e.g. SAT Score, number of children

Continuous Data

Continuous data are infinitely divisible, with an uncountable number of alternatives
- e.g. weights, temperature, GPA
Many discrete variables may be treated as if they are continuous
- e.g. SAT scores, wages