Answers may be longer than I would deem sufficient on an exam. Some might vary slightly based on points of interest, examples, or personal experience. These suggested answers are designed to give you both the answer and a short explanation of why it is the answer.
Install and load the package babynames
. Get help for ?babynames
to see what the data includes.
## ── Attaching packages ─────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 1.0.0 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
What are the top 5 boys names for 2017, and what percent of overall names is each?
# save as a new tibble
top_5_boys_2017 <- babynames %>% # take data
filter(sex=="M", # filter by males
year==2017) %>% # and for 2007
arrange(desc(n)) %>% # arrange in largest-to-smallest order of n (number)
slice(1:5) %>% # optional, look only at first 5 rows; head(., n=5) also works
mutate(percent = round(prop*100, 2)) # also optional, make a percent variable rounded to 2 decimals
# look at our new tibble
top_5_boys_2017
year <dbl> | sex <chr> | name <chr> | n <int> | prop <dbl> | percent <dbl> |
---|---|---|---|---|---|
2017 | M | Liam | 18728 | 0.00953909 | 0.95 |
2017 | M | Noah | 18326 | 0.00933433 | 0.93 |
2017 | M | William | 14904 | 0.00759134 | 0.76 |
2017 | M | James | 14232 | 0.00724906 | 0.72 |
2017 | M | Logan | 13974 | 0.00711764 | 0.71 |
The top 5 names are
What are the top 5 girls names, and what percent of overall names is each?
# save as a new tibble
top_5_girls_2017 <- babynames %>% # take data
filter(sex=="F", # filter by females
year==2017) %>% # and for 2007
arrange(desc(n)) %>% # arrange in largest-to-smallest order of n (number)
slice(1:5) %>% # optional, look only at first 5 rows; head(., n=5) also works
mutate(percent = round(prop*100, 2)) # also optional, make a percent variable rounded to 2 decimals
# look at our new tibble
top_5_girls_2017
year <dbl> | sex <chr> | name <chr> | n <int> | prop <dbl> | percent <dbl> |
---|---|---|---|---|---|
2017 | F | Emma | 19738 | 0.01052750 | 1.05 |
2017 | F | Olivia | 18632 | 0.00993760 | 0.99 |
2017 | F | Ava | 15902 | 0.00848152 | 0.85 |
2017 | F | Isabella | 15100 | 0.00805377 | 0.81 |
2017 | F | Sophia | 14831 | 0.00791029 | 0.79 |
The top 5 names are
Make two barplots, of these top 5 names, one for each sex. Map aes
thetics x
to name
and y
to prop
1 and use geom_col
(since you are declaring a specific y
, otherwise you could just use geom_bar()
and just an x
.)
ggplot(data = top_5_boys_2017)+
aes(x = reorder(name, n), #note this reorders the x variable from small to large n
y = percent, # you can use prop if you didn't make a percent variable
fill = name)+ # optional color!
geom_col()+
# now I'm just making it pretty
scale_y_continuous(labels=function(x)paste(x,"%",sep=""))+ # optional, add percent signs
labs(x = "Name",
y = "Percent of All Babies With Name",
title = "Most Popular Boys Names Since 1880",
fill = "Boy's Name",
caption = "Source: SSA")+
theme_classic(base_family = "Fira Sans Condensed", base_size=16)+
coord_flip()+ # rotate axes!
theme(legend.position = "") # hide legend
ggplot(data = top_5_girls_2017)+
aes(x = reorder(name, n), #note this reorders the x variable from small to large n
y = percent, # you can use prop if you didn't make a percent variable
fill = name)+ # optional color!
geom_col()+
# now I'm just making it pretty
scale_y_continuous(labels=function(x)paste(x,"%",sep=""))+ # optional, add percent signs
labs(x = "Name",
y = "Percent of All Girls With Name",
title = "Most Popular Girls Names Since 1880",
fill = "Girl's Name",
caption = "Source: SSA")+
theme_classic(base_family = "Fira Sans Condensed", base_size=16)+
coord_flip()+ # rotate axes!
theme(legend.position = "") # hide legend
Find your name.2 count
by sex
how many babies since 1880 were named your name.3 Also add a variable for the percent of each sex.
sex <chr> | n <int> | percent <dbl> | ||
---|---|---|---|---|
F | 22910 | 2.42 | ||
M | 924877 | 97.58 |
Make a line graph of the number of babies with your name over time, color
ed by sex
.
# note here I'm going to wrangle the data and then pipe it directly into ggplot
# you can wrangle the data and save it as a different tibble, then use THAT tibble
# for your (data = ...) command in ggplot
# first wrangle data
babynames %>%
filter(name == "Ryan") %>%
# now we pipe into ggplot
ggplot(data = .)+ # the "." is a placeholder for the stuff above!
aes(x = year,
y = n,
color = sex)+
geom_line(size=1)+
labs(x = "Year",
y = "Number of Babies",
title = "Popularity of Babies Named 'Ryan'",
color = "Sex",
caption = "Source: SSA")+
theme_classic(base_family = "Fira Sans Condensed", base_size=16)
Make a table of the most common name for boys by year between 1980-2017.4
year <dbl> | sex <chr> | name <chr> | n <int> | prop <dbl> |
---|---|---|---|---|
1980 | M | Michael | 68693 | 0.03703079 |
1981 | M | Michael | 68765 | 0.03692247 |
1982 | M | Michael | 68228 | 0.03615445 |
1983 | M | Michael | 67995 | 0.03649110 |
1984 | M | Michael | 67736 | 0.03610228 |
1985 | M | Michael | 64906 | 0.03373805 |
1986 | M | Michael | 64205 | 0.03342343 |
1987 | M | Michael | 63647 | 0.03264834 |
1988 | M | Michael | 64133 | 0.03204521 |
1989 | M | Michael | 65382 | 0.03120182 |
Now do the same for girls.
year <dbl> | sex <chr> | name <chr> | n <int> | prop <dbl> |
---|---|---|---|---|
1980 | F | Jennifer | 58376 | 0.03278886 |
1981 | F | Jennifer | 57049 | 0.03190242 |
1982 | F | Jennifer | 57115 | 0.03148593 |
1983 | F | Jennifer | 54342 | 0.03036962 |
1984 | F | Jennifer | 50561 | 0.02804442 |
1985 | F | Jessica | 48346 | 0.02619098 |
1986 | F | Jessica | 52674 | 0.02854888 |
1987 | F | Jessica | 55991 | 0.02988050 |
1988 | F | Jessica | 51538 | 0.02680669 |
1989 | F | Jessica | 47885 | 0.02403998 |
Now let’s graph the evolution of the most common names since 1880.
First, find out what are the top 10 overall most popular names for boys and for girls. You may want to create two vectors, each with these top 5 names.
name <chr> | total <int> | |||
---|---|---|---|---|
James | 5150472 | |||
John | 5115466 | |||
Robert | 4814815 | |||
Michael | 4350824 | |||
William | 4102604 |
# make a vector of the names (we'll need this for our graph below)
top_boys_names<-c("James", "John", "Robert", "Michael", "William")
# you could alternatively add a command,
# %>% pull(name) to the first chunk of code,
# and it would do the same thing, but we'd want to save it,
# for example:
babynames %>%
group_by(name) %>% # we want one row per name
filter(sex=="M") %>%
summarize(total=sum(n)) %>% # add upp all of the n's for all years for each name
arrange(desc(total)) %>% # list largest total first
slice(1:5) %>%
pull(name)
## [1] "James" "John" "Robert" "Michael" "William"
name <chr> | total <int> | |||
---|---|---|---|---|
Mary | 4123200 | |||
Elizabeth | 1629679 | |||
Patricia | 1571692 | |||
Jennifer | 1466281 | |||
Linda | 1452249 |
Now make two line
graphs of these 5 names over time, one for boys, and one for girls.
babynames %>%
group_by(year) %>%
filter(sex == "M",
name %in% top_boys_names) %>%
ggplot(data = .,
aes(x = year,
y = prop,
color = name))+
geom_line()+
labs(x = "Year",
y = "Proportion of Babies with Name",
title = "Most Popular Boys Names Since 1880",
color = "Boy's Name",
caption = "Source: SSA")+
theme_classic(base_family = "Fira Sans Condensed", base_size=16)
babynames %>%
group_by(year) %>%
filter(sex == "F",
name %in% top_girls_names) %>%
ggplot(data = .,
aes(x = year,
y = prop,
color = name))+
geom_line()+
labs(x = "Year",
y = "Proportion of Babies with Name",
title = "Most Popular Girls Names Since 1880",
color = "Girl's Name",
caption = "Source: SSA")+
theme_classic(base_family = "Fira Sans Condensed", base_size=16)
Bonus (hard!): What are the 10 most common “gender-neutral” names?5
There’s a lot to this, so I’ll break this up step by step and show you what happens at each major step.
We want to find the names where 48% to 52% of the babies with the name are male, as I defined in the footnote. First let’s mutate
a variable to figure out how many babies with a particular name are male.
To do this, we’ll need to make a two variables to count the number of male
s and female
s of each name each year. We’ll use the ifelse()
function for each:
male
variable where, for each name in each year, if sex=="M"
, then count the number of males (n
) that year, otherwise set it equal to 0
.female
variable where, for each name in each year, if sex=="F"
, then count the number of females (n
) that year, otherwise set it equal to 0
.year <dbl> | sex <chr> | name <chr> | n <int> | prop <dbl> | male <dbl> | female <dbl> |
---|---|---|---|---|---|---|
1880 | F | Mary | 7065 | 0.07238359 | 0 | 7065 |
1880 | F | Anna | 2604 | 0.02667896 | 0 | 2604 |
1880 | F | Emma | 2003 | 0.02052149 | 0 | 2003 |
1880 | F | Elizabeth | 1939 | 0.01986579 | 0 | 1939 |
1880 | F | Minnie | 1746 | 0.01788843 | 0 | 1746 |
1880 | F | Margaret | 1578 | 0.01616720 | 0 | 1578 |
1880 | F | Ida | 1472 | 0.01508119 | 0 | 1472 |
1880 | F | Alice | 1414 | 0.01448696 | 0 | 1414 |
1880 | F | Bertha | 1320 | 0.01352390 | 0 | 1320 |
1880 | F | Sarah | 1288 | 0.01319605 | 0 | 1288 |
Now with this variable, we want to count the total number of males and females with each name over the entire dataset. Let’s first group_by(name)
so we’ll get one row for every name. We will summarize()
and take the sum
of our male
and of our female
variables.
name <chr> | Male <dbl> | Female <dbl> | ||
---|---|---|---|---|
Aaban | 107 | 0 | ||
Aabha | 0 | 35 | ||
Aabid | 10 | 0 | ||
Aabir | 5 | 0 | ||
Aabriella | 0 | 32 | ||
Aada | 0 | 5 | ||
Aadam | 254 | 0 | ||
Aadan | 130 | 0 | ||
Aadarsh | 199 | 0 | ||
Aaden | 4653 | 5 |
Now, we want to figure out what fraction of each name is Male or Female. It doesn’t matter which we do here, I’ll do Male. mutate()
a new variable I’ll call perc_male
for the percent of the name being for Male babies. It takes the summed variables we made before, and takes the fraction that are Male, multiplying by 100 to get percents (which isn’t necessary, but is easy to read).
name <chr> | Male <dbl> | Female <dbl> | perc_male <dbl> | |
---|---|---|---|---|
Aaban | 107 | 0 | 100.00000000 | |
Aabha | 0 | 35 | 0.00000000 | |
Aabid | 10 | 0 | 100.00000000 | |
Aabir | 5 | 0 | 100.00000000 | |
Aabriella | 0 | 32 | 0.00000000 | |
Aada | 0 | 5 | 0.00000000 | |
Aadam | 254 | 0 | 100.00000000 | |
Aadan | 130 | 0 | 100.00000000 | |
Aadarsh | 199 | 0 | 100.00000000 | |
Aaden | 4653 | 5 | 99.89265779 |
Right now, it’s still in alphabetical order. We want to arrange it by perc_male
, and more importantly, we want perc_male
to be between 48 and 52, so let’s filter
accordingly:
name <chr> | Male <dbl> | Female <dbl> | perc_male <dbl> | |
---|---|---|---|---|
Demetrice | 1623 | 1754 | 48.06041 | |
Shenan | 25 | 27 | 48.07692 | |
Yael | 3162 | 3414 | 48.08394 | |
Harlo | 164 | 177 | 48.09384 | |
Daylyn | 202 | 218 | 48.09524 | |
Oluwatosin | 139 | 150 | 48.09689 | |
Chaning | 13 | 14 | 48.14815 | |
Kirin | 351 | 378 | 48.14815 | |
Odera | 13 | 14 | 48.14815 | |
Jireh | 644 | 693 | 48.16754 |
This gives us a lot of names, all falling between 48% and 52% male. But we want the most popular names that are in this range. So let’s finally mutate
a new variable called total
that simply adds the number of Male
and Female
babies with a name. Then let’s arrange
our results by desc(total)
to get the largest first, and then slice(1:10)
to get the top 10 only.
babynames %>%
mutate(male = ifelse(sex == "M", n, 0),
female = ifelse(sex == "F", n, 0)) %>%
group_by(name) %>%
summarize(Male = sum(male),
Female = sum(female))%>%
mutate(perc_male = (Male/(Male+Female)*100)) %>%
arrange(perc_male) %>%
filter(perc_male > 48,
perc_male < 52) %>%
mutate(total = Male+Female) %>%
arrange(desc(total)) %>%
slice(1:10)
name <chr> | Male <dbl> | Female <dbl> | perc_male <dbl> | total <dbl> |
---|---|---|---|---|
Kerry | 49596 | 48534 | 50.54112 | 98130 |
Robbie | 20863 | 22264 | 48.37573 | 43127 |
Justice | 17080 | 15782 | 51.97493 | 32862 |
Blair | 14470 | 14195 | 50.47968 | 28665 |
Kris | 13982 | 13490 | 50.89546 | 27472 |
Elisha | 13330 | 13599 | 49.50054 | 26929 |
Unknown | 9307 | 9416 | 49.70891 | 18723 |
Mckinley | 9389 | 8955 | 51.18295 | 18344 |
Baby | 6078 | 5871 | 50.86618 | 11949 |
Santana | 4651 | 4952 | 48.43278 | 9603 |
For the remaining questions, we’ll look at the relationship between Economic Freedom and Political Freedom in countries around the world today. Our data for economic freedom comes from the Fraser Institute, and our data for political freedom comes from Freedom House.
Download these two datasets that I’ve cleaned up a bit:6
Load them with df<-read_csv("name_of_the_file.csv")
and save one as econfreedom
and the other as polfreedom
. Look at each tibble
you’ve created.
I am creating this document for/from the website, so these are all stored in a folder called data
, one folder up from my current folder, homeworks
. To get there, I go up one folder (..
) and move to data
, where these csv
files are stored.
I suggest you either keep the data in the same folder as your R
working directory (always check with getwd()
), or create an R Project and store the data files in that same folder.
## Parsed with column specification:
## cols(
## .default = col_double(),
## `Country/Territory` = col_character(),
## Status = col_character()
## )
## See spec(...) for full column specifications.
## Warning: Missing column names filled in: 'X1' [1]
## Parsed with column specification:
## cols(
## X1 = col_double(),
## Country = col_character(),
## ISO = col_character(),
## ef = col_double(),
## gdp = col_double(),
## continent = col_character()
## )
Country/Territory <chr> | Status <chr> | PR Rating <dbl> | CL Rating <dbl> | A1 <dbl> | A2 <dbl> | A3 <dbl> | A <dbl> | B1 <dbl> | B2 <dbl> | |
---|---|---|---|---|---|---|---|---|---|---|
Abkhazia | PF | 4 | 5 | 3 | 2 | 1 | 6 | 2 | 3 | |
Afghanistan | NF | 5 | 6 | 1 | 0 | 1 | 2 | 2 | 2 | |
Albania | PF | 3 | 3 | 3 | 3 | 2 | 8 | 3 | 4 | |
Algeria | NF | 6 | 5 | 1 | 1 | 1 | 3 | 1 | 1 | |
Andorra | F | 1 | 1 | 4 | 4 | 4 | 12 | 4 | 4 | |
Angola | NF | 6 | 6 | 0 | 2 | 1 | 3 | 2 | 1 | |
Antigua and Barbuda | F | 2 | 2 | 4 | 4 | 4 | 12 | 3 | 4 | |
Argentina | F | 2 | 2 | 4 | 4 | 3 | 11 | 4 | 3 | |
Armenia | PF | 5 | 4 | 1 | 1 | 2 | 4 | 2 | 2 | |
Australia | F | 1 | 1 | 4 | 4 | 4 | 12 | 4 | 4 |
X1 <dbl> | Country <chr> | ISO <chr> | ef <dbl> | gdp <dbl> | continent <chr> |
---|---|---|---|---|---|
1 | Albania | ALB | 7.40 | 4543.0880 | Europe |
2 | Algeria | DZA | 5.15 | 4784.1943 | Africa |
3 | Angola | AGO | 5.08 | 4153.1463 | Africa |
4 | Argentina | ARG | 4.81 | 10501.6603 | Americas |
5 | Australia | AUS | 7.93 | 54688.4459 | Oceania |
6 | Austria | AUT | 7.56 | 47603.7968 | Europe |
7 | Bahrain | BHR | 7.60 | 22347.9708 | Asia |
8 | Bangladesh | BGD | 6.35 | 972.8807 | Asia |
9 | Belgium | BEL | 7.51 | 45181.4382 | Europe |
10 | Benin | BEN | 6.22 | 804.7240 | Africa |
The polfreedom
dataset is still a bit messy. Let’s overwrite it (or assign to something like polfreedom2
) and select Country/Territory
and Total
(total freedom score) and rename Country.Territory
to Country
.
Now we can try to merge these two datasets into one. Since they both have Country
as a variable, we can merge these tibbles using left_join(econfreedom, polfreedom, by="Country")
7 and save this as a new tibble (something like freedom
).
This one is a bit advanced to explain (but see the last few slides of 1.5 for more), so just copy what I gave you!
Now make a scatterplot of Political Freedom (total
)8 as y
on Economic Freedom (ef
) as x
and color
by continent
.
## Warning: Removed 1 rows containing missing values (geom_point).
Let’s do this again, but highlight some key countries. Pick three countries, and make a new tibble from freedom
that is only the observations of those countries. Additionally, install and load a packaged called ggrepel
9 Next, redo your plot from question 11, but now add a layer: geom_label_repel
and set its data
to your three-country tibble, use same aes
thetics as your overall plot, but be sure to add label = ISO
, to use the ISO country code to label.10
# install.packages("ggrepel") install for first use
library(ggrepel) # load
interest<-filter(freedom, ISO %in% c("CHN", "NOR", "USA"))
ggplot(data=freedom, aes(x=ef,y=Total))+
geom_point(aes(color=continent))+
geom_label_repel(data=interest, aes(ef, Total, label=ISO,color=continent),alpha=0.6)+
xlab("Economic Freedom Score")+ylab("Political Freedom Score")+theme_bw()+labs(caption="Sources: Frasier Institute, Freedom House")+
theme_classic(base_family = "Fira Sans Condensed", base_size=16)
## Warning: Removed 1 rows containing missing values (geom_point).
Make another plot similar to 12, except this time use GDP per Capita (gdp
) as y
. Feel free to try to put a regression line with geom_smooth()
!11. Those of you in my Development course, you just made my graphs from Lesson 2!
ggplot(data=freedom, aes(x=ef,y=gdp))+
geom_point(aes(color=continent))+
geom_smooth(data=freedom)+
geom_label_repel(data=interest, aes(ef, Total, label=ISO,color=continent),alpha=0.6)+
xlab("Economic Freedom Score")+ylab("Political Freedom Score")+theme_bw()+labs(caption="Sources: Frasier Institute, Freedom House")+
theme_classic(base_family = "Fira Sans Condensed", base_size=16)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
Or percent
, if you made that variable, as I did.↩︎
If your name isn’t in there :(, pick a random name.↩︎
Hint: if you do this, you’ll get the number of rows (years) there are in the data. You want to add the number of babies in each row (n
), so inside count
, add wt=n
to weight the count by n
.↩︎
Hint: once you’ve got all the right conditions, you’ll get a table with a lot of data. You only want to slice
the 1
st row for each table.↩︎
This is hard to define. For our purposes, let’s define this as names where between 48 and 52% of the babies with the name are Male.↩︎
If you want, try downloading them from the websites yourself!↩︎
Note, if you saved as something else in question 9., use that instead of polfreedom
!↩︎
Feel free to rename
these!↩︎
This automatically adjusts labels so they don’t cover points on a plot!↩︎
You might also want to set a low alpha
level to make sure the labels don’t obscure other points!↩︎
If you do, be sure to set its data to the full freedom
, not just your three countries!↩︎