Problem for identification: endogeneity
Problem for inference: randomness
An independent variable (X) is exogenous if its variation is unrelated to other factors that affect the dependent variable (Y)
An independent variable (X) is endogenous if its variation is related to other factors that affect the dependent variable (Y)
Data is random due to natural sampling variation
Common in statistics, easy to fix
Inferential Statistics: making claims about a wider population using sample data
Sample→⏟statistical inferencePopulation→⏟causal indentificationUnobserved Parameters
We want to identify causal relationships between population variables
We'll use sample statistics to infer something about population parameters
Data are information with context
Individuals are the entities described by a set of data
Variables are particular characteristics about an individual
Observations or cases are the separate individuals described by a collection of variables
individuals and observations are not necessarily the same:
Categorical data place an individual into one of several possible categories
R
calls these factors
(we'll deal with them much later in the course)
cut | n | frequency |
---|---|---|
Fair | 1610 | 0.0298480 |
Good | 4906 | 0.0909529 |
Very Good | 12082 | 0.2239896 |
Premium | 13791 | 0.2556730 |
Ideal | 21551 | 0.3995365 |
Good way to represent categorical data is with a frequency table
Count (n): total number of individuals in a category
Frequency: proportion of a category relative to all data
Charts and graphs are always better ways to visualize data
A bar graph represents categories as bars, with lengths proportional to the count or relative frequency fo each category
ggplot(diamonds, aes(x=cut, fill=cut))+ geom_bar()+ guides(fill=F)+ theme_bw(base_family = "Fira Sans Condensed", base_size=20)
Avoid pie charts!
People are not good at judging 2-d differences (angles, area)
People are good at judging 1-d differences (length)
Avoid pie charts!
People are not good at judging 2-d differences (angles, area)
People are good at judging 1-d differences (length)
diamonds %>% count(cut) %>%ggplot(., aes(x="", y=n, fill=cut))+ geom_col()+ geom_label(aes(x="", y=n, label=cut), position = position_stack(), color="white")+ guides(fill=F)+ theme_void()
diamonds %>% count(cut) %>% mutate(cut_name = as.factor(cut)) %>%ggplot(., aes(x = cut_name, y = n, color = cut))+ geom_point(stat="identity", fill="black", size=12) + geom_segment(aes(x = cut_name, y = 0, xend = cut_name, yend = n), size = 2)+ geom_text(aes(label = n),color="white", size=3) + coord_flip()+ labs(x = "Cut")+ theme_classic(base_family = "Fira Sans Condensed", base_size=20)+ guides(color = F)
library(treemapify)diamonds %>% count(cut) %>%ggplot(., aes(area = n, fill = cut)) + geom_treemap() + guides(fill = FALSE) + geom_treemap_text(aes(label = cut), colour = "white", place = "center", grow = TRUE)
Quantitative variables take on numerical values of equal units that describe an individual
We can mathematically manipulate only quantitative data
Quick Check: What kind of data (categorical or quantitative) does each variable describe?
Quick Check: What kind of data (categorical or quantitative) does each variable describe?
Quick Check: What kind of data (categorical or quantitative) does each variable describe?
Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
Quick Check: What kind of data (categorical or quantitative) does each variable describe?
Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
The date a purchase was made
Quick Check: What kind of data (categorical or quantitative) does each variable describe?
Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
The date a purchase was made
Transaction ID
Quick Check: What kind of data (categorical or quantitative) does each variable describe?
Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
The date a purchase was made
Transaction ID
The amount of money spent on a Super Bowl ad
Quick Check: What kind of data (categorical or quantitative) does each variable describe?
Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
The date a purchase was made
Transaction ID
The amount of money spent on a Super Bowl ad
Customer ratings
Quick Check: What kind of data (categorical or quantitative) does each variable describe?
Age measured in years
Age measured in ranges (0-9 years, 10-19 years, 20-29 years, etc)
The date a purchase was made
Transaction ID
The amount of money spent on a Super Bowl ad
Customer ratings
The number of correct answers on an exam
Discrete data are finite, with a countable number of alternatives
Categorical: e.g. letter grades A, B, C, D, F
Quantitative: integers, e.g. SAT Score, number of children
Continuous data are infinitely divisible, with an uncountable number of alternatives
Many discrete variables may be treated as if they are continuous
Quick Check: What kind of data (discrete or continuous) does each variable describe?
Quick Check: What kind of data (discrete or continuous) does each variable describe?
Quick Check: What kind of data (discrete or continuous) does each variable describe?
Weight in pounds
Price in dollars
Quick Check: What kind of data (discrete or continuous) does each variable describe?
Weight in pounds
Price in dollars
Grade (Letter)
Quick Check: What kind of data (discrete or continuous) does each variable describe?
Weight in pounds
Price in dollars
Grade (Letter)
Grade (Percentage)
Quick Check: What kind of data (discrete or continuous) does each variable describe?
Weight in pounds
Price in dollars
Grade (Letter)
Grade (Percentage)
Temperature
Quick Check: What kind of data (discrete or continuous) does each variable describe?
Weight in pounds
Price in dollars
Grade (Letter)
Grade (Percentage)
Temperature
Amazon Star Rating
Quick Check: What kind of data (discrete or continuous) does each variable describe?
Weight in pounds
Price in dollars
Grade (Letter)
Grade (Percentage)
Temperature
Amazon Star Rating
Number of customers
ID | Name | Age | Sex | Income |
---|---|---|---|---|
1 | John | 23 | Male | 41000 |
2 | Emile | 18 | Male | 52600 |
3 | Natalya | 28 | Female | 48000 |
4 | Lakisha | 31 | Female | 60200 |
5 | Cheng | 36 | Male | 81900 |
The most common data structure we use is a spreadsheet
data.frame
or tibble
A row contains data about all variables for a single individual
A column contains data about a single variable across all individuals
It is common to use some notation like the following:
Let {x1,x2,⋯,xn} be a simple data series on variable X
It is common to use some notation like the following:
Let {x1,x2,⋯,xn} be a simple data series on variable X
Quick Check: Let x represent the score on a homework assignment: 75,100,92,87,79,0,95
ID | Name | Age | Sex | Income |
---|---|---|---|---|
1 | John | 23 | Male | 41000 |
2 | Emile | 18 | Male | 52600 |
3 | Natalya | 28 | Female | 48000 |
4 | Lakisha | 31 | Female | 60200 |
5 | Cheng | 36 | Male | 81900 |
Cross-sectional data: observations of individuals at a given point in time
Each observation is a unique individual
Simplest and most common data
A "snapshot" to compare differences across individuals
Year | GDP | Unemployment | CPI |
---|---|---|---|
1950 | 8.2 | 0.06 | 100 |
1960 | 9.9 | 0.04 | 118 |
1970 | 10.2 | 0.08 | 130 |
1980 | 12.4 | 0.08 | 190 |
1985 | 13.6 | 0.06 | 196 |
Time-series data: observations of the same individuals over time
Each observation is an individual-year
Often used for macroeconomics, finance, and forecasting
Unique challenges for time series
A "moving picture" to see how individuals change over time
City | Year | Murders | Population | Unemployment |
---|---|---|---|---|
Philadelphia | 1986 | 5 | 3.700 | 8.7 |
Philadelphia | 1990 | 8 | 4.200 | 7.2 |
D.C. | 1986 | 2 | 0.250 | 5.4 |
D.C. | 1990 | 10 | 0.275 | 5.5 |
New York | 1986 | 3 | 6.400 | 9.6 |
Panel, or longitudinal dataset: a time-series for each cross-sectional entity
Must be the same cross-sectional entities over time
More common today for serious researchers
Unique challenges for panel data
A combination of "snapshot" comparisons and differences over time
Variables take on different values, we can describe a variable's distribution (of these values)
We want to visualize and analyze distributions to search for meaningful patterns using statistics
Descriptive Statistics: describes or summarizes the properties of a sample
Inferential Statistics: infers properties about a larger population from the properties of a sample1
1 We'll encounter inferential statistics mainly in the context of regression later.
A common way to present a quantitative variable's distribution is a histogram
Divide up values into bins of a certain size, and count the number of values falling within each bin, representing them visually as bars
Example: a class of 13 students takes a quiz (out of 100 points) with the following results:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
Example: a class of 13 students takes a quiz (out of 100 points) with the following results:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
quizzes<-tibble(scores = c(0,62,66,71,71,74,76,79,83,86,88,93,95))
Example: a class of 13 students takes a quiz (out of 100 points) with the following results:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
h<-ggplot(quizzes,aes(x=scores))+ geom_histogram(breaks = seq(0,100,10), color = "black", fill = "#56B4E9")+ scale_x_continuous(breaks = seq(0,100,10))+ labs(x = "Scores", y = "Number of Students")+ theme_bw(base_family = "Fira Sans Condensed", base_size=20)h
The mode of a variable is simply its most frequent value
A variable can have multiple modes
The mode of a variable is simply its most frequent value
A variable can have multiple modes
Example: a class of 13 students takes a quiz (out of 100 points) with the following results:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
There is no dedicated mode()
function in R
, surprisingly
A workaround in dplyr
:
quizzes %>% count(scores) %>% arrange(desc(n))
## # A tibble: 12 x 2## scores n## <dbl> <int>## 1 71 2## 2 0 1## 3 62 1## 4 66 1## 5 74 1## 6 76 1## 7 79 1## 8 83 1## 9 86 1## 10 88 1## 11 93 1## 12 95 1
Looking at a histogram, the modes are the "peaks" of the distribution
May be unimodal, bimodal, trimodal, etc
tibble(scores=c(0,33,33,33,33,35,62,66,71,71,74,76,79,83,86,88,93,95)) %>% count(scores) %>% arrange(desc(n))
## # A tibble: 14 x 2## scores n## <dbl> <int>## 1 33 4## 2 71 2## 3 0 1## 4 35 1## 5 62 1## 6 66 1## 7 74 1## 8 76 1## 9 79 1## 10 83 1## 11 86 1## 12 88 1## 13 93 1## 14 95 1
A distribution is symmetric if it looks roughly the same on either side of the "center"
The thinner ends (far left and far right) are called the tails of a distribution
Outlier: extreme value that does not appear part of the general pattern of a distribution
Can strongly affect descriptive statistics
Might be the most informative part of the data
Could be the result of errors
Should always be explored and discussed!
μ=x1+x2+...+xNN=1NN∑i=1xi
For N values of variable x, "mu" is the sum of all individual x values (xi) from 1 to N, divided by the N number of values1
See today's class notes for more about the summation operator, Σ, it'll come up again!
1 Note the mean need not be an actual value of the data!
ˉx=x1+x2+...+xnn=1nn∑i=1xi
ˉx=x1+x2+...+xnn=1nn∑i=1xi
Example:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
ˉx=113(0+62+66+71+71+74+76+79+83+86+88+93+95)ˉx=94413ˉx=72.62
ˉx=x1+x2+...+xnn=1nn∑i=1xi
Example:
{0,62,66,71,71,74,76,79,83,86,88,93,95}
ˉx=113(0+62+66+71+71+74+76+79+83+86+88+93+95)ˉx=94413ˉx=72.62
quizzes %>% summarize(mean=mean(scores))
## # A tibble: 1 x 1## mean## <dbl>## 1 72.6
Example:
{62,66,71,71,74,76,79,83,86,88,93,95}
ˉx=112(62+66+71+71+74+76+79+83+86+88+93+95)=94412=78.67
Example:
{62,66,71,71,74,76,79,83,86,88,93,95}
ˉx=112(62+66+71+71+74+76+79+83+86+88+93+95)=94412=78.67
quizzes %>% filter(scores>0) %>% summarize(mean=mean(scores))
## # A tibble: 1 x 1## mean## <dbl>## 1 78.7
{0,62,66,71,71,74,76,79,83,86,88,93,95}
The median is the midpoint of the distribution
Arrange values in numerical order
symmetric %>% summarize(mean = mean(x), median = median(x))
## # A tibble: 1 x 2## mean median## <dbl> <dbl>## 1 4 4
leftskew %>% summarize(mean = mean(x), median = median(x))
## mean median## 1 4.615385 5
rightskew %>% summarize(mean = mean(x), median = median(x))
## # A tibble: 1 x 2## mean median## <dbl> <dbl>## 1 3.38 3
A common set of summary statistics about a distribution is known as the "five number summary":
A common set of summary statistics about a distribution is known as the "five number summary":
# Base R summary command (includes Mean)summary(quizzes$scores)
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 0.00 71.00 76.00 72.62 86.00 95.00
quizzes %>% # dplyr summarize(Min = min(scores), Q1 = quantile(scores, 0.25), Median = median(scores), Q3 = quantile(scores, 0.75), Max = max(scores))
## # A tibble: 1 x 5## Min Q1 Median Q3 Max## <dbl> <dbl> <dbl> <dbl> <dbl>## 1 0 71 76 86 95
quizzes %>% summarize("37th percentile" = quantile(scores,0.37))
## # A tibble: 1 x 1## `37th percentile`## <dbl>## 1 72.3
Boxplots are a great way to visualize the 5 number summary
Height of box: Q1 to Q3 (known as interquartile range (IQR), middle 50% of data)
Line inside box: median (50th percentile)
"Whiskers" identify data within 1.5×IQR
Points beyond whiskers are outliers
Example:
Quiz 1:{0,62,66,71,71,74,76,79,83,86,88,93,95}Quiz 2:{50,62,72,73,79,81,82,82,86,90,94,98,99}
quizzes_new %>% summary()
## student quiz_1 quiz_2 ## Min. : 1 Min. : 0.00 Min. :50.00 ## 1st Qu.: 4 1st Qu.:71.00 1st Qu.:73.00 ## Median : 7 Median :76.00 Median :82.00 ## Mean : 7 Mean :72.62 Mean :80.62 ## 3rd Qu.:10 3rd Qu.:86.00 3rd Qu.:90.00 ## Max. :13 Max. :95.00 Max. :99.00
I don't like the options available for printing out summary statistics
So I wrote my own R function
that makes nice summary tables that uses dplyr
and tidyr
One day I will release as a package; until then the .R
file is saved here
# loads .R files to full functions from; make sure it's in YOUR working directory/Project!source("../files/summaries.R") # MY path (this website's R Project)# let's summarize variables "hwy" and "cty" from *mpg* datasetmpg %>% summary_table(hwy, cty)
## # A tibble: 2 x 9## Variable Obs Min Q1 Median Q3 Max Mean `Std. Dev.`## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>## 1 cty 234 9 14 17 19 35 16.9 4.26## 2 hwy 234 12 18 24 27 44 23.4 5.95
knit
ted in R markdown
:mpg %>% summary_table(hwy, cty) %>% knitr::kable(., format="html")
Variable | Obs | Min | Q1 | Median | Q3 | Max | Mean | Std. Dev. |
---|---|---|---|---|---|---|---|---|
cty | 234 | 9 | 14 | 17 | 19 | 35 | 16.86 | 4.26 |
hwy | 234 | 12 | 18 | 24 | 27 | 44 | 23.44 | 5.95 |
markdown
and making final products nicer when we discuss your paper project (have you forgotten?)Every observation i deviates from the mean of the data: deviationi=xi−μ
There are as many deviations as there are data points (n)
We can measure the average or standard deviation of a variable from its mean
Before we get there...
σ2=1NN∑i=1(xi−μ)2
Why do we square deviations?
What are these units?
σ=√σ2=√1NN∑i=1(xi−μ)2
σ2=1n−1n∑i=1(xi−ˉx)2
s=√s2=√1n−1n∑i=1(xi−ˉx)2
Example: Calculate the sample standard deviation for the following series:
{2,4,6,8,10}
Example: Calculate the sample standard deviation for the following series:
{2,4,6,8,10}
sd(c(2,4,6,8,10))
## [1] 3.162278
# first let's save our data in a tibblesd_example<-tibble(x=c(2,4,6,8,10))# first find the mean (just so we know)sd_example %>% summarize(mean(x))
## # A tibble: 1 x 1## `mean(x)`## <dbl>## 1 6
# now let's make some more columns:sd_example <- sd_example %>% mutate(deviations = x-mean(x), # take deviations from mean deviations_sq = deviations^2) # square them
sd_example # see what we made
## # A tibble: 5 x 3## x deviations deviations_sq## <dbl> <dbl> <dbl>## 1 2 -4 16## 2 4 -2 4## 3 6 0 0## 4 8 2 4## 5 10 4 16
sd_example %>% # sum the squared deviations summarize(sum_sq_devs = sum(deviations_sq), # divide by n-1 to get variance variance = sum_sq_devs/(n()-1), # square root to get sd std_dev = sqrt(variance))
## # A tibble: 1 x 3## sum_sq_devs variance std_dev## <dbl> <dbl> <dbl>## 1 40 10 3.16
You Try: Calculate the sample standard deviation for the following series:
{1,3,5,7}
You Try: Calculate the sample standard deviation for the following series:
{1,3,5,7}
sd(c(1,3,5,7))
## [1] 2.581989
Population size: N
Mean: μ
Variance: σ2=1NN∑i=1(xi−μ)2
Standard deviation: σ=√σ2
Population size: n
Mean: ˉx
Variance: s2=1n−1n∑i=1(xi−ˉx)2
Standard deviation: s=√s2
Problem for identification: endogeneity
Problem for inference: randomness
Keyboard shortcuts
↑, ←, Pg Up, k | Go to previous slide |
↓, →, Pg Dn, Space, j | Go to next slide |
Home | Go to first slide |
End | Go to last slide |
Number + Return | Go to specific slide |
b / m / f | Toggle blackout / mirrored / fullscreen mode |
c | Clone slideshow |
p | Toggle presenter mode |
t | Restart the presentation timer |
?, h | Toggle this help |
Esc | Back to slideshow |