The Popularity of Baby Names

Question 1

Part A
Part B

Question 2
Question 3
Question 4
Question 5

Part A
Part B

Question 6

Part A
Part B

Question 7

Political and Economic Freedom Around the World

Question 8
Question 9
Question 10
Question 11
Question 12
Question 13

Answers may be longer than I would deem sufficient on an exam. Some might vary slightly based on points of interest, examples, or personal experience. These suggested answers are designed to give you both the answer and a short explanation of why it is the answer.

The Popularity of Baby Names

Install and load the package babynames. Get help for ?babynames to see what the data includes.

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0

## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

# install for first use
# install.packages("babynames")

# load package 
library(babynames)

# explore help
# ?babynames

Question 1

Part A

What are the top 5 boys names for 2017, and what percent of overall names is each?

# save as a new tibble
top_5_boys_2017 <- babynames %>% # take data
  filter(sex=="M", # filter by males
         year==2017) %>% # and for 2007
  arrange(desc(n)) %>% # arrange in largest-to-smallest order of n (number)
  slice(1:5) %>% # optional, look only at first 5 rows; head(., n=5) also works
  mutate(percent = round(prop*100, 2)) # also optional, make a percent variable rounded to 2 decimals

# look at our new tibble
top_5_boys_2017

ABCDEFGHIJ0123456789

year <dbl>	sex <chr>	name <chr>	n <int>	prop <dbl>	percent <dbl>
2017	M	Liam	18728	0.00953909	0.95
2017	M	Noah	18326	0.00933433	0.93
2017	M	William	14904	0.00759134	0.76
2017	M	James	14232	0.00724906	0.72
2017	M	Logan	13974	0.00711764	0.71

The top 5 names are

Liam (0.95%)
Noah (0.93%)
William (0.76%)
James (0.72%)
Logan (0.71%)

Part B

What are the top 5 girls names, and what percent of overall names is each?

# save as a new tibble
top_5_girls_2017 <- babynames %>% # take data
  filter(sex=="F", # filter by females
         year==2017) %>% # and for 2007
  arrange(desc(n)) %>% # arrange in largest-to-smallest order of n (number)
  slice(1:5) %>% # optional, look only at first 5 rows; head(., n=5) also works
  mutate(percent = round(prop*100, 2)) # also optional, make a percent variable rounded to 2 decimals

# look at our new tibble
top_5_girls_2017

ABCDEFGHIJ0123456789

year <dbl>	sex <chr>	name <chr>	n <int>	prop <dbl>	percent <dbl>
2017	F	Emma	19738	0.01052750	1.05
2017	F	Olivia	18632	0.00993760	0.99
2017	F	Ava	15902	0.00848152	0.85
2017	F	Isabella	15100	0.00805377	0.81
2017	F	Sophia	14831	0.00791029	0.79

The top 5 names are

Emma (1.05%)
Olivia (0.99%)
Ava (0.85%)
Isabella (0.81%)
Sophia (0.79%)

Question 2

Make two barplots, of these top 5 names, one for each sex. Map aesthetics x to name and y to prop¹ and use geom_col (since you are declaring a specific y, otherwise you could just use geom_bar() and just an x.)

ggplot(data = top_5_boys_2017)+
  aes(x = reorder(name, n), #note this reorders the x variable from small to large n
      y = percent, # you can use prop if you didn't make a percent variable
      fill = name)+ # optional color!
  geom_col()+
  
  # now I'm just making it pretty
  scale_y_continuous(labels=function(x)paste(x,"%",sep=""))+ # optional, add percent signs
      labs(x = "Name",
         y = "Percent of All Babies With Name",
         title = "Most Popular Boys Names Since 1880",
         fill = "Boy's Name",
         caption = "Source: SSA")+
    theme_classic(base_family = "Fira Sans Condensed", base_size=16)+
  coord_flip()+ # rotate axes!
  theme(legend.position = "") # hide legend

ggplot(data = top_5_girls_2017)+
  aes(x = reorder(name, n), #note this reorders the x variable from small to large n
      y = percent, # you can use prop if you didn't make a percent variable
      fill = name)+ # optional color!
  geom_col()+
  # now I'm just making it pretty
  scale_y_continuous(labels=function(x)paste(x,"%",sep=""))+ # optional, add percent signs
      labs(x = "Name",
         y = "Percent of All Girls With Name",
         title = "Most Popular Girls Names Since 1880",
         fill = "Girl's Name",
         caption = "Source: SSA")+
    theme_classic(base_family = "Fira Sans Condensed", base_size=16)+
  coord_flip()+ # rotate axes!
  theme(legend.position = "") # hide legend

Question 3

Find your name.² count by sex how many babies since 1880 were named your name.³ Also add a variable for the percent of each sex.

babynames %>%
  filter(name == "Ryan") %>%
  count(sex, wt=n) %>%
  mutate(percent = round((n/sum(n)*100),2))

ABCDEFGHIJ0123456789

sex <chr>	n <int>	percent <dbl>
F	22910	2.42
M	924877	97.58

Question 4

Make a line graph of the number of babies with your name over time, colored by sex.

# note here I'm going to wrangle the data and then pipe it directly into ggplot
# you can wrangle the data and save it as a different tibble, then use THAT tibble
# for your (data = ...) command in ggplot

# first wrangle data
babynames %>%
  filter(name == "Ryan") %>%

  # now we pipe into ggplot
  ggplot(data = .)+ # the "." is a placeholder for the stuff above!
  aes(x = year,
      y = n,
      color = sex)+
  geom_line(size=1)+
  labs(x = "Year",
       y = "Number of Babies",
       title = "Popularity of Babies Named 'Ryan'",
       color = "Sex",
       caption = "Source: SSA")+
    theme_classic(base_family = "Fira Sans Condensed", base_size=16)

Question 5

Part A

Make a table of the most common name for boys by year between 1980-2017.⁴

babynames %>%
  group_by(year) %>% # we want one observation per year
  filter(sex == "M",
         year>1979) %>% # or >==1980
  arrange(desc(n))%>% # start with largest n first
  slice(1) # take first row only

ABCDEFGHIJ0123456789

year <dbl>	sex <chr>	name <chr>	n <int>	prop <dbl>
1980	M	Michael	68693	0.03703079
1981	M	Michael	68765	0.03692247
1982	M	Michael	68228	0.03615445
1983	M	Michael	67995	0.03649110
1984	M	Michael	67736	0.03610228
1985	M	Michael	64906	0.03373805
1986	M	Michael	64205	0.03342343
1987	M	Michael	63647	0.03264834
1988	M	Michael	64133	0.03204521
1989	M	Michael	65382	0.03120182

Part B

Now do the same for girls.

babynames %>%
  group_by(year) %>% # we want one observation per year
  filter(sex == "F",
         year>1979) %>% # or >==1980
  arrange(desc(n))%>% # start with largest n first
  slice(1) # take first row only

ABCDEFGHIJ0123456789

year <dbl>	sex <chr>	name <chr>	n <int>	prop <dbl>
1980	F	Jennifer	58376	0.03278886
1981	F	Jennifer	57049	0.03190242
1982	F	Jennifer	57115	0.03148593
1983	F	Jennifer	54342	0.03036962
1984	F	Jennifer	50561	0.02804442
1985	F	Jessica	48346	0.02619098
1986	F	Jessica	52674	0.02854888
1987	F	Jessica	55991	0.02988050
1988	F	Jessica	51538	0.02680669
1989	F	Jessica	47885	0.02403998

Question 6

Now let’s graph the evolution of the most common names since 1880.

Part A

First, find out what are the top 10 overall most popular names for boys and for girls. You may want to create two vectors, each with these top 5 names.

babynames %>%
  group_by(name) %>% # we want one row per name
  filter(sex=="M") %>%
  summarize(total=sum(n)) %>% # add upp all of the n's for all years for each name
  arrange(desc(total)) %>% # list largest total first
  slice(1:5) 

ABCDEFGHIJ0123456789

name <chr>	total <int>
James	5150472
John	5115466
Robert	4814815
Michael	4350824
William	4102604

# make a vector of the names (we'll need this for our graph below)
top_boys_names<-c("James", "John", "Robert", "Michael", "William")

# you could alternatively add a command, 
# %>% pull(name) to the first chunk of code, 
# and it would do the same thing, but we'd want to save it, 
# for example:

babynames %>%
  group_by(name) %>% # we want one row per name
  filter(sex=="M") %>%
  summarize(total=sum(n)) %>% # add upp all of the n's for all years for each name
  arrange(desc(total)) %>% # list largest total first
  slice(1:5) %>%
  pull(name)

## [1] "James"   "John"    "Robert"  "Michael" "William"

babynames %>%
  group_by(name) %>% # we want one row per name
  filter(sex=="F") %>%
  summarize(total=sum(n)) %>% # add upp all of the n's for all years for each name
  arrange(desc(total)) %>% # list largest total first
  slice(1:5)

ABCDEFGHIJ0123456789

name <chr>	total <int>
Mary	4123200
Elizabeth	1629679
Patricia	1571692
Jennifer	1466281
Linda	1452249

# make a vector of the names (we'll need this for our graph below)
top_girls_names<-c("Mary", "Elizabeth", "Patricia", "Jennifer", "Linda")

Part B

Now make two linegraphs of these 5 names over time, one for boys, and one for girls.

babynames %>%
  group_by(year) %>%
  filter(sex == "M",
         name %in% top_boys_names) %>%
  ggplot(data = .,
         aes(x = year,
             y = prop,
             color = name))+
  geom_line()+
      labs(x = "Year",
         y = "Proportion of Babies with Name",
         title = "Most Popular Boys Names Since 1880",
         color = "Boy's Name",
         caption = "Source: SSA")+
    theme_classic(base_family = "Fira Sans Condensed", base_size=16)

babynames %>%
  group_by(year) %>%
  filter(sex == "F",
         name %in% top_girls_names) %>%
  ggplot(data = .,
         aes(x = year,
             y = prop,
             color = name))+
  geom_line()+
    labs(x = "Year",
         y = "Proportion of Babies with Name",
         title = "Most Popular Girls Names Since 1880",
         color = "Girl's Name",
         caption = "Source: SSA")+
    theme_classic(base_family = "Fira Sans Condensed", base_size=16)

Question 7

Bonus (hard!): What are the 10 most common “gender-neutral” names?⁵

There’s a lot to this, so I’ll break this up step by step and show you what happens at each major step.

We want to find the names where 48% to 52% of the babies with the name are male, as I defined in the footnote. First let’s mutate a variable to figure out how many babies with a particular name are male.

To do this, we’ll need to make a two variables to count the number of males and females of each name each year. We’ll use the ifelse() function for each:

Make a male variable where, for each name in each year, if sex=="M", then count the number of males (n) that year, otherwise set it equal to 0.
Make a female variable where, for each name in each year, if sex=="F", then count the number of females (n) that year, otherwise set it equal to 0.

babynames %>%
  mutate(male = ifelse(sex == "M", n, 0),
         female = ifelse(sex == "F", n, 0))

ABCDEFGHIJ0123456789

year <dbl>	sex <chr>	name <chr>	n <int>	prop <dbl>	female <dbl>
1880	F	Mary	7065	0.07238359	7065
1880	F	Anna	2604	0.02667896	2604
1880	F	Emma	2003	0.02052149	2003
1880	F	Elizabeth	1939	0.01986579	1939
1880	F	Minnie	1746	0.01788843	1746
1880	F	Margaret	1578	0.01616720	1578
1880	F	Ida	1472	0.01508119	1472
1880	F	Alice	1414	0.01448696	1414
1880	F	Bertha	1320	0.01352390	1320
1880	F	Sarah	1288	0.01319605	1288

Now with this variable, we want to count the total number of males and females with each name over the entire dataset. Let’s first group_by(name) so we’ll get one row for every name. We will summarize() and take the sum of our male and of our female variables.

babynames %>%
  mutate(male = ifelse(sex == "M", n, 0),
         female = ifelse(sex == "F", n, 0)) %>%
  group_by(name) %>%
    summarize(Male = sum(male),
              Female = sum(female))

ABCDEFGHIJ0123456789

name <chr>	Male <dbl>	Female <dbl>
Aaban	107	0
Aabha	0	35
Aabid	10	0
Aabir	5	0
Aabriella	0	32
Aada	0	5
Aadam	254	0
Aadan	130	0
Aadarsh	199	0
Aaden	4653	5

Now, we want to figure out what fraction of each name is Male or Female. It doesn’t matter which we do here, I’ll do Male. mutate() a new variable I’ll call perc_male for the percent of the name being for Male babies. It takes the summed variables we made before, and takes the fraction that are Male, multiplying by 100 to get percents (which isn’t necessary, but is easy to read).

babynames %>%
  mutate(male = ifelse(sex == "M", n, 0),
         female = ifelse(sex == "F", n, 0)) %>%
  group_by(name) %>%
    summarize(Male = sum(male),
              Female = sum(female))%>%
  mutate(perc_male = (Male/(Male+Female)*100))

ABCDEFGHIJ0123456789

name <chr>	Male <dbl>	Female <dbl>	perc_male <dbl>
Aaban	107	0	100.00000000
Aabha	0	35	0.00000000
Aabid	10	0	100.00000000
Aabir	5	0	100.00000000
Aabriella	0	32	0.00000000
Aada	0	5	0.00000000
Aadam	254	0	100.00000000
Aadan	130	0	100.00000000
Aadarsh	199	0	100.00000000
Aaden	4653	5	99.89265779

Right now, it’s still in alphabetical order. We want to arrange it by perc_male, and more importantly, we want perc_male to be between 48 and 52, so let’s filter accordingly:

babynames %>%
  mutate(male = ifelse(sex == "M", n, 0),
         female = ifelse(sex == "F", n, 0)) %>%
  group_by(name) %>%
    summarize(Male = sum(male),
              Female = sum(female))%>%
  mutate(perc_male = (Male/(Male+Female)*100)) %>%
  arrange(perc_male) %>%
  filter(perc_male > 48,
         perc_male < 52)

ABCDEFGHIJ0123456789

name <chr>	Male <dbl>	Female <dbl>	perc_male <dbl>
Demetrice	1623	1754	48.06041
Shenan	25	27	48.07692
Yael	3162	3414	48.08394
Harlo	164	177	48.09384
Daylyn	202	218	48.09524
Oluwatosin	139	150	48.09689
Chaning	13	14	48.14815
Kirin	351	378	48.14815
Odera	13	14	48.14815
Jireh	644	693	48.16754

This gives us a lot of names, all falling between 48% and 52% male. But we want the most popular names that are in this range. So let’s finally mutate a new variable called total that simply adds the number of Male and Female babies with a name. Then let’s arrange our results by desc(total) to get the largest first, and then slice(1:10) to get the top 10 only.

babynames %>%
  mutate(male = ifelse(sex == "M", n, 0),
         female = ifelse(sex == "F", n, 0)) %>%
  group_by(name) %>%
    summarize(Male = sum(male),
              Female = sum(female))%>%
  mutate(perc_male = (Male/(Male+Female)*100)) %>%
  arrange(perc_male) %>%
  filter(perc_male > 48,
         perc_male < 52) %>%
  mutate(total = Male+Female) %>%
  arrange(desc(total)) %>%
  slice(1:10)

ABCDEFGHIJ0123456789

name <chr>	Male <dbl>	Female <dbl>	perc_male <dbl>	total <dbl>
Kerry	49596	48534	50.54112	98130
Robbie	20863	22264	48.37573	43127
Justice	17080	15782	51.97493	32862
Blair	14470	14195	50.47968	28665
Kris	13982	13490	50.89546	27472
Elisha	13330	13599	49.50054	26929
Unknown	9307	9416	49.70891	18723
Mckinley	9389	8955	51.18295	18344
Baby	6078	5871	50.86618	11949
Santana	4651	4952	48.43278	9603

Political and Economic Freedom Around the World

For the remaining questions, we’ll look at the relationship between Economic Freedom and Political Freedom in countries around the world today. Our data for economic freedom comes from the Fraser Institute, and our data for political freedom comes from Freedom House.

Question 8

Download these two datasets that I’ve cleaned up a bit:⁶

Load them with df<-read_csv("name_of_the_file.csv") and save one as econfreedom and the other as polfreedom. Look at each tibble you’ve created.

I am creating this document for/from the website, so these are all stored in a folder called data, one folder up from my current folder, homeworks. To get there, I go up one folder (..) and move to data, where these csv files are stored.

I suggest you either keep the data in the same folder as your R working directory (always check with getwd()), or create an R Project and store the data files in that same folder.

# import data with read_csv from readr

# note these file paths will be different for you
polfreedom<-read_csv("../data/freedomhouse2018.csv")

## Parsed with column specification:
## cols(
##   .default = col_double(),
##   `Country/Territory` = col_character(),
##   Status = col_character()
## )

## See spec(...) for full column specifications.

econfreedom<-read_csv("../data/econfreedom.csv")

## Warning: Missing column names filled in: 'X1' [1]

## Parsed with column specification:
## cols(
##   X1 = col_double(),
##   Country = col_character(),
##   ISO = col_character(),
##   ef = col_double(),
##   gdp = col_double(),
##   continent = col_character()
## )

# look at each dataframe
polfreedom

ABCDEFGHIJ0123456789

Country/Territory <chr>	Status <chr>	PR Rating <dbl>	CL Rating <dbl>	A1 <dbl>	A2 <dbl>	A3 <dbl>	A <dbl>	B1 <dbl>	B2 <dbl>
Abkhazia	PF	4	5	3	2	1	6	2	3
Afghanistan	NF	5	6	1	0	1	2	2	2
Albania	PF	3	3	3	3	2	8	3	4
Algeria	NF	6	5	1	1	1	3	1	1
Andorra	F	1	1	4	4	4	12	4	4
Angola	NF	6	6	0	2	1	3	2	1
Antigua and Barbuda	F	2	2	4	4	4	12	3	4
Argentina	F	2	2	4	4	3	11	4	3
Armenia	PF	5	4	1	1	2	4	2	2
Australia	F	1	1	4	4	4	12	4	4

econfreedom

ABCDEFGHIJ0123456789

X1 <dbl>	Country <chr>	ISO <chr>	ef <dbl>	gdp <dbl>	continent <chr>
1	Albania	ALB	7.40	4543.0880	Europe
2	Algeria	DZA	5.15	4784.1943	Africa
3	Angola	AGO	5.08	4153.1463	Africa
4	Argentina	ARG	4.81	10501.6603	Americas
5	Australia	AUS	7.93	54688.4459	Oceania
6	Austria	AUT	7.56	47603.7968	Europe
7	Bahrain	BHR	7.60	22347.9708	Asia
8	Bangladesh	BGD	6.35	972.8807	Asia
9	Belgium	BEL	7.51	45181.4382	Europe
10	Benin	BEN	6.22	804.7240	Africa

Question 9

The polfreedom dataset is still a bit messy. Let’s overwrite it (or assign to something like polfreedom2) and select Country/Territory and Total (total freedom score) and rename Country.Territory to Country.

polfreedom<-polfreedom %>%
  select(`Country/Territory`, Total) %>%
  rename(Country=`Country/Territory`)

Question 10

Now we can try to merge these two datasets into one. Since they both have Country as a variable, we can merge these tibbles using left_join(econfreedom, polfreedom, by="Country")⁷ and save this as a new tibble (something like freedom).

This one is a bit advanced to explain (but see the last few slides of 1.5 for more), so just copy what I gave you!

freedom<-left_join(econfreedom, polfreedom, by="Country")

Question 11

Now make a scatterplot of Political Freedom (total)⁸ as y on Economic Freedom (ef) as x and color by continent.

## Warning: Removed 1 rows containing missing values (geom_point).

Question 12

Let’s do this again, but highlight some key countries. Pick three countries, and make a new tibble from freedom that is only the observations of those countries. Additionally, install and load a packaged called ggrepel⁹ Next, redo your plot from question 11, but now add a layer: geom_label_repel and set its data to your three-country tibble, use same aesthetics as your overall plot, but be sure to add label = ISO, to use the ISO country code to label.¹⁰

# install.packages("ggrepel") install for first use 
library(ggrepel) # load 

interest<-filter(freedom, ISO %in% c("CHN", "NOR", "USA"))

ggplot(data=freedom, aes(x=ef,y=Total))+
  geom_point(aes(color=continent))+
  geom_label_repel(data=interest, aes(ef, Total, label=ISO,color=continent),alpha=0.6)+
  xlab("Economic Freedom Score")+ylab("Political Freedom Score")+theme_bw()+labs(caption="Sources: Frasier Institute, Freedom House")+
  theme_classic(base_family = "Fira Sans Condensed", base_size=16)

## Warning: Removed 1 rows containing missing values (geom_point).

Question 13

Make another plot similar to 12, except this time use GDP per Capita (gdp) as y. Feel free to try to put a regression line with geom_smooth()!¹¹. Those of you in my Development course, you just made my graphs from Lesson 2!

ggplot(data=freedom, aes(x=ef,y=gdp))+
  geom_point(aes(color=continent))+
  geom_smooth(data=freedom)+
  geom_label_repel(data=interest, aes(ef, Total, label=ISO,color=continent),alpha=0.6)+
  xlab("Economic Freedom Score")+ylab("Political Freedom Score")+theme_bw()+labs(caption="Sources: Frasier Institute, Freedom House")+
  theme_classic(base_family = "Fira Sans Condensed", base_size=16)

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

Or percent, if you made that variable, as I did.↩︎
If your name isn’t in there :(, pick a random name.↩︎
Hint: if you do this, you’ll get the number of rows (years) there are in the data. You want to add the number of babies in each row (n), so inside count, add wt=n to weight the count by n.↩︎
Hint: once you’ve got all the right conditions, you’ll get a table with a lot of data. You only want to slice the 1st row for each table.↩︎
This is hard to define. For our purposes, let’s define this as names where between 48 and 52% of the babies with the name are Male.↩︎
If you want, try downloading them from the websites yourself!↩︎
Note, if you saved as something else in question 9., use that instead of polfreedom!↩︎
Feel free to rename these!↩︎
This automatically adjusts labels so they don’t cover points on a plot!↩︎
You might also want to set a low alpha level to make sure the labels don’t obscure other points!↩︎
If you do, be sure to set its data to the full freedom, not just your three countries!↩︎

Problem Set 1 (Answers)

Ryan Safner

ECON 480 - Fall 2019

The Popularity of Baby Names

Question 1

Part A

Part B

Question 2

Question 3

Question 4

Question 5

Part A

Part B

Question 6

Part A

Part B

Question 7

Political and Economic Freedom Around the World

Question 8

Question 9

Question 10

Question 11

Question 12

Question 13