• Getting Set Up
  • Warm Up to dplyr With gapminder
    • 4.
    • 5.
    • 6.
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
  • Example I: Majors
    • 18.
  • Wrangling Data (with dplyr)
    • 19. Look at the data with glimpse(). This is a suped-up version of str() in tidyverse.
    • 20.
    • 21.
      • Which major has the lowest unemployment rate?
    • 22.
      • 23.
      • 24.
      • 25.

Getting Set Up

Before we begin, start a new file with File New File R Script. As you work through this sheet in the console in R, also add (copy/paste) your commands that work into this new file. At the end, save it, and run to execute all of your commands at once.

First things first, load tidyverse.

library("tidyverse")
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0     ✔ purrr   0.3.2
## ✔ tibble  2.1.3     ✔ dplyr   0.8.3
## ✔ tidyr   0.8.3     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

Warm Up to dplyr With gapminder

1. Load gapminder. select() the variables year, lifeExp, country.

library(gapminder)
# select(gapminder, c(year, lifeExp, country))

# using the pipe

gapminder %>%
  select(c(year, lifeExp, country))
ABCDEFGHIJ0123456789
year
<int>
lifeExp
<dbl>
country
<fctr>
195228.80100Afghanistan
195730.33200Afghanistan
196231.99700Afghanistan
196734.02000Afghanistan
197236.08800Afghanistan
197738.43800Afghanistan
198239.85400Afghanistan
198740.82200Afghanistan
199241.67400Afghanistan
199741.76300Afghanistan

2. select() all variables except pop.

# select(gapminder, -pop)

# using the pipe

gapminder %>%
  select(-pop)
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
gdpPercap
<dbl>
AfghanistanAsia195228.80100779.4453
AfghanistanAsia195730.33200820.8530
AfghanistanAsia196231.99700853.1007
AfghanistanAsia196734.02000836.1971
AfghanistanAsia197236.08800739.9811
AfghanistanAsia197738.43800786.1134
AfghanistanAsia198239.85400978.0114
AfghanistanAsia198740.82200852.3959
AfghanistanAsia199241.67400649.3414
AfghanistanAsia199741.76300635.3414

3. rename() continent to cont.

# rename(gapminder, cont=continent)

# using the pipe

gapminder %>%
  rename(cont=continent)
ABCDEFGHIJ0123456789
country
<fctr>
cont
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
AfghanistanAsia195228.801008425333779.4453
AfghanistanAsia195730.332009240934820.8530
AfghanistanAsia196231.9970010267083853.1007
AfghanistanAsia196734.0200011537966836.1971
AfghanistanAsia197236.0880013079460739.9811
AfghanistanAsia197738.4380014880372786.1134
AfghanistanAsia198239.8540012881816978.0114
AfghanistanAsia198740.8220013867957852.3959
AfghanistanAsia199241.6740016317921649.3414
AfghanistanAsia199741.7630022227415635.3414

4.

arrange() by year.

# arrange(gapminder, year)

# with the pipe
gapminder %>%
  arrange(year)
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
AfghanistanAsia195228.801008425333779.4453
AlbaniaEurope195255.2300012826971601.0561
AlgeriaAfrica195243.0770092795252449.0082
AngolaAfrica195230.0150042320953520.6103
ArgentinaAmericas195262.48500178769565911.3151
AustraliaOceania195269.12000869121210039.5956
AustriaEurope195266.8000069277726137.0765
BahrainAsia195250.939001204479867.0848
BangladeshAsia195237.4840046886859684.2442
BelgiumEurope195268.0000087304058343.1051

5.

arrange() by year, but in descending order.

# arrange(gapminder, desc(year))

# with the pipe
gapminder %>%
  arrange(desc(year))
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
AfghanistanAsia200743.8280031889923974.5803
AlbaniaEurope200776.4230036005235937.0295
AlgeriaAfrica200772.30100333332166223.3675
AngolaAfrica200742.73100124204764797.2313
ArgentinaAmericas200775.320004030192712779.3796
AustraliaOceania200781.235002043417634435.3674
AustriaEurope200779.82900819978336126.4927
BahrainAsia200775.6350070857329796.0483
BangladeshAsia200764.062001504483391391.2538
BelgiumEurope200779.441001039222633692.6051

6.

arrange() by year, then by life expectancy.

# arrange(gapminder, year, lifeExp)

# with the pipe

gapminder %>%
  arrange(year, lifeExp)
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
AfghanistanAsia195228.801008425333779.4453
GambiaAfrica195230.00000284320485.2307
AngolaAfrica195230.0150042320953520.6103
Sierra LeoneAfrica195230.331002143249879.7877
MozambiqueAfrica195231.286006446316468.5260
Burkina FasoAfrica195231.975004469979543.2552
Guinea-BissauAfrica195232.50000580653299.8503
Yemen, Rep.Asia195232.548004963829781.7176
SomaliaAfrica195232.9780025269941135.7498
GuineaAfrica195233.609002664249510.1965

7.

filter() observations with pop greater than 1 billion.

# arrange(filter, pop>1000000000)

# with the pipe

gapminder %>%
  filter(pop>1000000000)
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
ChinaAsia198265.5251000281000962.4214
ChinaAsia198767.27410840350001378.9040
ChinaAsia199268.69011649700001655.7842
ChinaAsia199770.42612300750002289.2341
ChinaAsia200272.02812804000003119.2809
ChinaAsia200772.96113186830964959.1149
IndiaAsia200262.87910341725471746.7695
IndiaAsia200764.69811103963312452.2104

8.

Of those, look only at India.

gapminder %>%
  filter(pop>1000000000) %>%
  filter(country=="India")
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
IndiaAsia200262.87910341725471746.769
IndiaAsia200764.69811103963312452.210

9.

Try out the pipe (%) if you haven’t already, by chaining commands: select() your data to look only at year, gdpPercap, and country in the year 1997, for countries that have a gdpPercap greater than 20,000, and arrange() them alphabetically.

gapminder %>%
  select(year, gdpPercap, country) %>%
  filter(year==1997,
         gdpPercap>20000) %>%
  arrange(country)
ABCDEFGHIJ0123456789
year
<int>
gdpPercap
<dbl>
country
<fctr>
199726997.94Australia
199729095.92Austria
199720292.02Bahrain
199727561.20Belgium
199728954.93Canada
199729804.35Denmark
199723723.95Finland
199725889.78France
199727788.88Germany
199728377.63Hong Kong, China

10.

mutate() a new variable called GDP that is equal to gdpPercap * pop.

gapminder %>%
  mutate(GDP = gdpPercap * pop)
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
GDP
<dbl>
AfghanistanAsia195228.801008425333779.44536.567086e+09
AfghanistanAsia195730.332009240934820.85307.585449e+09
AfghanistanAsia196231.9970010267083853.10078.758856e+09
AfghanistanAsia196734.0200011537966836.19719.648014e+09
AfghanistanAsia197236.0880013079460739.98119.678553e+09
AfghanistanAsia197738.4380014880372786.11341.169766e+10
AfghanistanAsia198239.8540012881816978.01141.259856e+10
AfghanistanAsia198740.8220013867957852.39591.182099e+10
AfghanistanAsia199241.6740016317921649.34141.059590e+10
AfghanistanAsia199741.7630022227415635.34141.412200e+10

11.

mutate() a new population variable that is the pop in millions.

gapminder %>%
  mutate(popm = pop/1000000)
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
popm
<dbl>
AfghanistanAsia195228.801008425333779.44538.425333
AfghanistanAsia195730.332009240934820.85309.240934
AfghanistanAsia196231.9970010267083853.100710.267083
AfghanistanAsia196734.0200011537966836.197111.537966
AfghanistanAsia197236.0880013079460739.981113.079460
AfghanistanAsia197738.4380014880372786.113414.880372
AfghanistanAsia198239.8540012881816978.011412.881816
AfghanistanAsia198740.8220013867957852.395913.867957
AfghanistanAsia199241.6740016317921649.341416.317921
AfghanistanAsia199741.7630022227415635.341422.227415

12.

summarize() to get the average GDP per capita.

gapminder %>%
  summarize(mean(gdpPercap))
ABCDEFGHIJ0123456789
mean(gdpPercap)
<dbl>
7215.327

13.

Get the number of observations, average, minimum, maximum, and standard deviation for GDP per capita.

gapminder %>%
  summarize(Obs = n(),
            Average = mean(gdpPercap),
            Minimum = min(gdpPercap),
            Maximum = max(gdpPercap),
            SD = sd(gdpPercap))
ABCDEFGHIJ0123456789
Obs
<int>
Average
<dbl>
Minimum
<dbl>
Maximum
<dbl>
SD
<dbl>
17047215.327241.1659113523.19857.455

14.

Get the average GDP per capita over time. Hint, first group_by() year.

gapminder %>%
  group_by(year) %>%
  summarize(Average_GDP = mean(gdpPercap))
ABCDEFGHIJ0123456789
year
<int>
Average_GDP
<dbl>
19523725.276
19574299.408
19624725.812
19675483.653
19726770.083
19777313.166
19827518.902
19877900.920
19928158.609
19979090.175

15.

Get the average GDP per capita by continent.

gapminder %>%
  group_by(continent) %>%
  summarize(Average_GDP = mean(gdpPercap))
ABCDEFGHIJ0123456789
continent
<fctr>
Average_GDP
<dbl>
Africa2193.755
Americas7136.110
Asia7902.150
Europe14469.476
Oceania18621.609

16.

Get the average GDP per capita by year and by continent.1 Then save this as another tibble called gdp. Create a ggplot of a line graph of average continent GDP over time using the gdp data.

gdp <- gapminder %>%
  group_by(year, continent) %>%
  summarize(Average_GDP = mean(gdpPercap))

ggplot(data = gdp)+
  aes(x = year,
      y = Average_GDP,
      color = continent)+
  geom_line()

17.

Try it again all in one command with the pipe %>%. Instead of saving the data as gdp, pipe it right into ggplot!2

gapminder %>%
  group_by(year, continent) %>%
  summarize(Average_GDP = mean(gdpPercap)) %>%
ggplot(data = .)+
  aes(x = year,
      y = Average_GDP,
      color = continent)+
  geom_line()

Example I: Majors

Now let’s step it up to work with some data “in the wild” to answer some research questions. This will have you combine your dplyr skills and add some new things such as importing with readr.

Let’s look at fivethirtyeight’s article " The Economic Guide To Picking A College Major ". fivethirtyeight is great about making the data behind their articles public, we can download all of their data here. Search for college majors and click download (the blue arrow button).3 We will look at the recent-grads.csv file.

The description in the readme file for the data is as follows:

Header Description
Rank Rank by median earnings
Major_code Major code, FO1DP in ACS PUMS
Major Major description
Major_category Category of major from Carnevale et al
Total Total number of people with major
Sample_size Sample size (unweighted) of full-time, year-round ONLY (used for earnings)
Men Male graduates
Women Female graduates
ShareWomen Women as share of total
Employed Number employed (ESR == 1 or 2)
Full_time Employed 35 hours or more
Part_time Employed less than 35 hours
Full_time_year_round Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35)
Unemployed Number unemployed (ESR == 3)
Unemployment_rate Unemployed / (Unemployed + Employed)
Median Median earnings of full-time, year-round workers
P25th 25th percentile of earnigns
P75th 75th percentile of earnings
College_jobs Number with job requiring a college degree
Non_college_jobs Number with job not requiring a college degree
Low_wage_jobs Number in low-wage service jobs

18.

Import the data with read_csv() and assign it to tibble called majors.4 The first argument of this command the name of the original file, in quotes.5

majors<-read_csv("../data/recent-grads.csv") # note my location on my computer is different than yours!
## Parsed with column specification:
## cols(
##   .default = col_double(),
##   Major = col_character(),
##   Major_category = col_character()
## )
## See spec(...) for full column specifications.

Wrangling Data (with dplyr)

19. Look at the data with glimpse(). This is a suped-up version of str() in tidyverse.

glimpse(majors)
## Observations: 173
## Variables: 21
## $ Rank                 <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ Major_code           <dbl> 2419, 2416, 2415, 2417, 2405, 2418, 6202, 5…
## $ Major                <chr> "PETROLEUM ENGINEERING", "MINING AND MINERA…
## $ Total                <dbl> 2339, 756, 856, 1258, 32260, 2573, 3777, 17…
## $ Men                  <dbl> 2057, 679, 725, 1123, 21239, 2200, 2110, 83…
## $ Women                <dbl> 282, 77, 131, 135, 11021, 373, 1667, 960, 1…
## $ Major_category       <chr> "Engineering", "Engineering", "Engineering"…
## $ ShareWomen           <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132,…
## $ Sample_size          <dbl> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, 3…
## $ Employed             <dbl> 1976, 640, 648, 758, 25694, 1857, 2912, 152…
## $ Full_time            <dbl> 1849, 556, 558, 1069, 23170, 2038, 2924, 10…
## $ Part_time            <dbl> 270, 170, 133, 150, 5180, 264, 296, 553, 13…
## $ Full_time_year_round <dbl> 1207, 388, 340, 692, 16697, 1449, 2482, 827…
## $ Unemployed           <dbl> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, 3…
## $ Unemployment_rate    <dbl> 0.018380527, 0.117241379, 0.024096386, 0.05…
## $ Median               <dbl> 110000, 75000, 73000, 70000, 65000, 65000, …
## $ P25th                <dbl> 95000, 55000, 50000, 43000, 50000, 50000, 5…
## $ P75th                <dbl> 125000, 90000, 105000, 80000, 75000, 102000…
## $ College_jobs         <dbl> 1534, 350, 456, 529, 18314, 1142, 1768, 972…
## $ Non_college_jobs     <dbl> 364, 257, 176, 102, 4440, 657, 314, 500, 16…
## $ Low_wage_jobs        <dbl> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 31…

20.

What are all of the unique values of Major? How many are there?

majors %>%
  distinct(Major)
ABCDEFGHIJ0123456789
Major
<chr>
PETROLEUM ENGINEERING
MINING AND MINERAL ENGINEERING
METALLURGICAL ENGINEERING
NAVAL ARCHITECTURE AND MARINE ENGINEERING
CHEMICAL ENGINEERING
NUCLEAR ENGINEERING
ACTUARIAL SCIENCE
ASTRONOMY AND ASTROPHYSICS
MECHANICAL ENGINEERING
ELECTRICAL ENGINEERING
majors %>%
  summarize(n_distinct(Major))
ABCDEFGHIJ0123456789
n_distinct(Major)
<int>
173

21.

Which major has the lowest unemployment rate?

majors %>%
  arrange(Unemployment_rate) %>%
  select(Major, Unemployment_rate)
ABCDEFGHIJ0123456789
Major
<chr>
Unemployment_rate
<dbl>
MATHEMATICS AND COMPUTER SCIENCE0.000000000
MILITARY TECHNOLOGIES0.000000000
BOTANY0.000000000
SOIL SCIENCE0.000000000
EDUCATIONAL ADMINISTRATION AND SUPERVISION0.000000000
ENGINEERING MECHANICS PHYSICS AND SCIENCE0.006334343
COURT REPORTING0.011689692
MATHEMATICS TEACHER EDUCATION0.016202835
PETROLEUM ENGINEERING0.018380527
GENERAL AGRICULTURE0.019642463

22.

What are the top 3 majors that have the highest percentage of women?

majors %>%
  arrange(desc(ShareWomen)) %>%
  select(Major, ShareWomen) %>%
  slice(1:3) # head(., n=3) also works
ABCDEFGHIJ0123456789
Major
<chr>
ShareWomen
<dbl>
EARLY CHILDHOOD EDUCATION0.9689537
COMMUNICATION DISORDERS SCIENCES AND SERVICES0.9679981
MEDICAL ASSISTING SERVICES0.9278072

23.

Make a boxplot of Median wage by Major_Category.6

ggplot(data = majors)+
  aes(x = Major_category,
      y = Median,
      fill = Major_category)+
  geom_boxplot()+
  theme(axis.text.x=element_text(angle=45, hjust=1), legend.position = "")

24.

25.

Is there a systematic difference between STEM majors and non-STEM majors? First define:

stem_categories <- c("Biology & Life Science",
                     "Computers & Mathematics",
                     "Engineering",
                     "Physical Sciences")

Next, make a variable stem, for whether or not a Major_category is "stem" or "not_stem".8

majors <- majors %>%
  mutate(stem = ifelse(Major_category %in% stem_categories,
                       "stem",
                       "not_stem")) %>%
  select(stem, everything()) # just so we can see it up front to verify

Then summarize() median for stem and not stem groups.

majors %>%
  group_by(stem) %>%
  summarize(mean(Median))
ABCDEFGHIJ0123456789
stem
<chr>
mean(Median)
<dbl>
not_stem35624.77
stem47860.94

  1. Hint: do year first, if you do continent first, there are no years to group by!↩︎

  2. Hint: You can use . as a placeholder.↩︎

  3. This will download a .zip file that contains many spreadsheets. Unzip it with a program that unzips files (such as WinZip, 7-zip, the Unarchiver, etc).↩︎

  4. One way to avoid error messages is to move it to the same file as R’s working directory, which again you can determine with getwd().↩︎

  5. If the file is in a different folder, the argument is the full path in quotes.↩︎

  6. You won’t be able to read the labels easily, so add theme(axis.text.x=element_text(angle=45, hjust=1) to angle x-axis labels (and move them down by 1)↩︎

  7. Hint: use group_by first.↩︎

  8. Hint: try out the ifelse() function which has three inputs: condition(s) for a variable(s), what to do if TRUE (the if), and what to if FALSE (the else).↩︎