dplyr
With gapminder
dplyr
)glimpse()
. This is a suped-up version of str()
in tidyverse
.Before we begin, start a new file with File
→ New File
→ R Script
. As you work through this sheet in the console in R
, also add (copy/paste) your commands that work into this new file. At the end, save it, and run to execute all of your commands at once.
First things first, load tidyverse
.
## ── Attaching packages ───────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 3.2.0 ✔ purrr 0.3.2
## ✔ tibble 2.1.3 ✔ dplyr 0.8.3
## ✔ tidyr 0.8.3 ✔ stringr 1.4.0
## ✔ readr 1.3.1 ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
dplyr
With gapminder
gapminder
. select()
the variables year
, lifeExp
, country
.year <int> | lifeExp <dbl> | country <fctr> | ||
---|---|---|---|---|
1952 | 28.80100 | Afghanistan | ||
1957 | 30.33200 | Afghanistan | ||
1962 | 31.99700 | Afghanistan | ||
1967 | 34.02000 | Afghanistan | ||
1972 | 36.08800 | Afghanistan | ||
1977 | 38.43800 | Afghanistan | ||
1982 | 39.85400 | Afghanistan | ||
1987 | 40.82200 | Afghanistan | ||
1992 | 41.67400 | Afghanistan | ||
1997 | 41.76300 | Afghanistan |
select()
all variables except pop
.country <fctr> | continent <fctr> | year <int> | lifeExp <dbl> | gdpPercap <dbl> |
---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.80100 | 779.4453 |
Afghanistan | Asia | 1957 | 30.33200 | 820.8530 |
Afghanistan | Asia | 1962 | 31.99700 | 853.1007 |
Afghanistan | Asia | 1967 | 34.02000 | 836.1971 |
Afghanistan | Asia | 1972 | 36.08800 | 739.9811 |
Afghanistan | Asia | 1977 | 38.43800 | 786.1134 |
Afghanistan | Asia | 1982 | 39.85400 | 978.0114 |
Afghanistan | Asia | 1987 | 40.82200 | 852.3959 |
Afghanistan | Asia | 1992 | 41.67400 | 649.3414 |
Afghanistan | Asia | 1997 | 41.76300 | 635.3414 |
rename()
continent
to cont
.country <fctr> | cont <fctr> | year <int> | lifeExp <dbl> | pop <int> | gdpPercap <dbl> |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.80100 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.33200 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.99700 | 10267083 | 853.1007 |
Afghanistan | Asia | 1967 | 34.02000 | 11537966 | 836.1971 |
Afghanistan | Asia | 1972 | 36.08800 | 13079460 | 739.9811 |
Afghanistan | Asia | 1977 | 38.43800 | 14880372 | 786.1134 |
Afghanistan | Asia | 1982 | 39.85400 | 12881816 | 978.0114 |
Afghanistan | Asia | 1987 | 40.82200 | 13867957 | 852.3959 |
Afghanistan | Asia | 1992 | 41.67400 | 16317921 | 649.3414 |
Afghanistan | Asia | 1997 | 41.76300 | 22227415 | 635.3414 |
arrange()
by year.country <fctr> | continent <fctr> | year <int> | lifeExp <dbl> | pop <int> | gdpPercap <dbl> |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.80100 | 8425333 | 779.4453 |
Albania | Europe | 1952 | 55.23000 | 1282697 | 1601.0561 |
Algeria | Africa | 1952 | 43.07700 | 9279525 | 2449.0082 |
Angola | Africa | 1952 | 30.01500 | 4232095 | 3520.6103 |
Argentina | Americas | 1952 | 62.48500 | 17876956 | 5911.3151 |
Australia | Oceania | 1952 | 69.12000 | 8691212 | 10039.5956 |
Austria | Europe | 1952 | 66.80000 | 6927772 | 6137.0765 |
Bahrain | Asia | 1952 | 50.93900 | 120447 | 9867.0848 |
Bangladesh | Asia | 1952 | 37.48400 | 46886859 | 684.2442 |
Belgium | Europe | 1952 | 68.00000 | 8730405 | 8343.1051 |
arrange()
by year, but in descending order.country <fctr> | continent <fctr> | year <int> | lifeExp <dbl> | pop <int> | gdpPercap <dbl> |
---|---|---|---|---|---|
Afghanistan | Asia | 2007 | 43.82800 | 31889923 | 974.5803 |
Albania | Europe | 2007 | 76.42300 | 3600523 | 5937.0295 |
Algeria | Africa | 2007 | 72.30100 | 33333216 | 6223.3675 |
Angola | Africa | 2007 | 42.73100 | 12420476 | 4797.2313 |
Argentina | Americas | 2007 | 75.32000 | 40301927 | 12779.3796 |
Australia | Oceania | 2007 | 81.23500 | 20434176 | 34435.3674 |
Austria | Europe | 2007 | 79.82900 | 8199783 | 36126.4927 |
Bahrain | Asia | 2007 | 75.63500 | 708573 | 29796.0483 |
Bangladesh | Asia | 2007 | 64.06200 | 150448339 | 1391.2538 |
Belgium | Europe | 2007 | 79.44100 | 10392226 | 33692.6051 |
arrange()
by year, then by life expectancy.country <fctr> | continent <fctr> | year <int> | lifeExp <dbl> | pop <int> | gdpPercap <dbl> |
---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.80100 | 8425333 | 779.4453 |
Gambia | Africa | 1952 | 30.00000 | 284320 | 485.2307 |
Angola | Africa | 1952 | 30.01500 | 4232095 | 3520.6103 |
Sierra Leone | Africa | 1952 | 30.33100 | 2143249 | 879.7877 |
Mozambique | Africa | 1952 | 31.28600 | 6446316 | 468.5260 |
Burkina Faso | Africa | 1952 | 31.97500 | 4469979 | 543.2552 |
Guinea-Bissau | Africa | 1952 | 32.50000 | 580653 | 299.8503 |
Yemen, Rep. | Asia | 1952 | 32.54800 | 4963829 | 781.7176 |
Somalia | Africa | 1952 | 32.97800 | 2526994 | 1135.7498 |
Guinea | Africa | 1952 | 33.60900 | 2664249 | 510.1965 |
filter()
observations with pop
greater than 1 billion.country <fctr> | continent <fctr> | year <int> | lifeExp <dbl> | pop <int> | gdpPercap <dbl> |
---|---|---|---|---|---|
China | Asia | 1982 | 65.525 | 1000281000 | 962.4214 |
China | Asia | 1987 | 67.274 | 1084035000 | 1378.9040 |
China | Asia | 1992 | 68.690 | 1164970000 | 1655.7842 |
China | Asia | 1997 | 70.426 | 1230075000 | 2289.2341 |
China | Asia | 2002 | 72.028 | 1280400000 | 3119.2809 |
China | Asia | 2007 | 72.961 | 1318683096 | 4959.1149 |
India | Asia | 2002 | 62.879 | 1034172547 | 1746.7695 |
India | Asia | 2007 | 64.698 | 1110396331 | 2452.2104 |
%
) if you haven’t already, by chaining commands: select()
your data to look only at year
, gdpPercap
, and country
in the year 1997
, for countries that have a gdpPercap
greater than 20,000, and arrange()
them alphabetically.year <int> | gdpPercap <dbl> | country <fctr> | ||
---|---|---|---|---|
1997 | 26997.94 | Australia | ||
1997 | 29095.92 | Austria | ||
1997 | 20292.02 | Bahrain | ||
1997 | 27561.20 | Belgium | ||
1997 | 28954.93 | Canada | ||
1997 | 29804.35 | Denmark | ||
1997 | 23723.95 | Finland | ||
1997 | 25889.78 | France | ||
1997 | 27788.88 | Germany | ||
1997 | 28377.63 | Hong Kong, China |
mutate()
a new variable called GDP
that is equal to gdpPercap * pop
.country <fctr> | continent <fctr> | year <int> | lifeExp <dbl> | pop <int> | gdpPercap <dbl> | GDP <dbl> |
---|---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.80100 | 8425333 | 779.4453 | 6.567086e+09 |
Afghanistan | Asia | 1957 | 30.33200 | 9240934 | 820.8530 | 7.585449e+09 |
Afghanistan | Asia | 1962 | 31.99700 | 10267083 | 853.1007 | 8.758856e+09 |
Afghanistan | Asia | 1967 | 34.02000 | 11537966 | 836.1971 | 9.648014e+09 |
Afghanistan | Asia | 1972 | 36.08800 | 13079460 | 739.9811 | 9.678553e+09 |
Afghanistan | Asia | 1977 | 38.43800 | 14880372 | 786.1134 | 1.169766e+10 |
Afghanistan | Asia | 1982 | 39.85400 | 12881816 | 978.0114 | 1.259856e+10 |
Afghanistan | Asia | 1987 | 40.82200 | 13867957 | 852.3959 | 1.182099e+10 |
Afghanistan | Asia | 1992 | 41.67400 | 16317921 | 649.3414 | 1.059590e+10 |
Afghanistan | Asia | 1997 | 41.76300 | 22227415 | 635.3414 | 1.412200e+10 |
mutate()
a new population variable that is the pop
in millions.country <fctr> | continent <fctr> | year <int> | lifeExp <dbl> | pop <int> | gdpPercap <dbl> | popm <dbl> |
---|---|---|---|---|---|---|
Afghanistan | Asia | 1952 | 28.80100 | 8425333 | 779.4453 | 8.425333 |
Afghanistan | Asia | 1957 | 30.33200 | 9240934 | 820.8530 | 9.240934 |
Afghanistan | Asia | 1962 | 31.99700 | 10267083 | 853.1007 | 10.267083 |
Afghanistan | Asia | 1967 | 34.02000 | 11537966 | 836.1971 | 11.537966 |
Afghanistan | Asia | 1972 | 36.08800 | 13079460 | 739.9811 | 13.079460 |
Afghanistan | Asia | 1977 | 38.43800 | 14880372 | 786.1134 | 14.880372 |
Afghanistan | Asia | 1982 | 39.85400 | 12881816 | 978.0114 | 12.881816 |
Afghanistan | Asia | 1987 | 40.82200 | 13867957 | 852.3959 | 13.867957 |
Afghanistan | Asia | 1992 | 41.67400 | 16317921 | 649.3414 | 16.317921 |
Afghanistan | Asia | 1997 | 41.76300 | 22227415 | 635.3414 | 22.227415 |
Obs <int> | Average <dbl> | Minimum <dbl> | Maximum <dbl> | SD <dbl> |
---|---|---|---|---|
1704 | 7215.327 | 241.1659 | 113523.1 | 9857.455 |
tibble
called gdp
. Create a ggplot
of a line
graph of average continent GDP over time using the gdp
data.%>%
. Instead of saving the data as gdp
, pipe it right into ggplot
!2Now let’s step it up to work with some data “in the wild” to answer some research questions. This will have you combine your dplyr
skills and add some new things such as importing with readr
.
Let’s look at fivethirtyeight’s article " The Economic Guide To Picking A College Major ". fivethirtyeight is great about making the data behind their articles public, we can download all of their data here. Search for college majors
and click download (the blue arrow button).3 We will look at the recent-grads.csv
file.
The description in the readme
file for the data is as follows:
Header | Description |
---|---|
Rank |
Rank by median earnings |
Major_code |
Major code, FO1DP in ACS PUMS |
Major |
Major description |
Major_category |
Category of major from Carnevale et al |
Total |
Total number of people with major |
Sample_size |
Sample size (unweighted) of full-time, year-round ONLY (used for earnings) |
Men |
Male graduates |
Women |
Female graduates |
ShareWomen |
Women as share of total |
Employed |
Number employed (ESR == 1 or 2) |
Full_time |
Employed 35 hours or more |
Part_time |
Employed less than 35 hours |
Full_time_year_round |
Employed at least 50 weeks (WKW == 1) and at least 35 hours (WKHP >= 35) |
Unemployed |
Number unemployed (ESR == 3) |
Unemployment_rate |
Unemployed / (Unemployed + Employed) |
Median |
Median earnings of full-time, year-round workers |
P25th |
25th percentile of earnigns |
P75th |
75th percentile of earnings |
College_jobs |
Number with job requiring a college degree |
Non_college_jobs |
Number with job not requiring a college degree |
Low_wage_jobs |
Number in low-wage service jobs |
read_csv()
and assign it to tibble
called majors
.4 The first argument of this command the name of the original file, in quotes.5## Parsed with column specification:
## cols(
## .default = col_double(),
## Major = col_character(),
## Major_category = col_character()
## )
## See spec(...) for full column specifications.
dplyr
)glimpse()
. This is a suped-up version of str()
in tidyverse
.## Observations: 173
## Variables: 21
## $ Rank <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, …
## $ Major_code <dbl> 2419, 2416, 2415, 2417, 2405, 2418, 6202, 5…
## $ Major <chr> "PETROLEUM ENGINEERING", "MINING AND MINERA…
## $ Total <dbl> 2339, 756, 856, 1258, 32260, 2573, 3777, 17…
## $ Men <dbl> 2057, 679, 725, 1123, 21239, 2200, 2110, 83…
## $ Women <dbl> 282, 77, 131, 135, 11021, 373, 1667, 960, 1…
## $ Major_category <chr> "Engineering", "Engineering", "Engineering"…
## $ ShareWomen <dbl> 0.1205643, 0.1018519, 0.1530374, 0.1073132,…
## $ Sample_size <dbl> 36, 7, 3, 16, 289, 17, 51, 10, 1029, 631, 3…
## $ Employed <dbl> 1976, 640, 648, 758, 25694, 1857, 2912, 152…
## $ Full_time <dbl> 1849, 556, 558, 1069, 23170, 2038, 2924, 10…
## $ Part_time <dbl> 270, 170, 133, 150, 5180, 264, 296, 553, 13…
## $ Full_time_year_round <dbl> 1207, 388, 340, 692, 16697, 1449, 2482, 827…
## $ Unemployed <dbl> 37, 85, 16, 40, 1672, 400, 308, 33, 4650, 3…
## $ Unemployment_rate <dbl> 0.018380527, 0.117241379, 0.024096386, 0.05…
## $ Median <dbl> 110000, 75000, 73000, 70000, 65000, 65000, …
## $ P25th <dbl> 95000, 55000, 50000, 43000, 50000, 50000, 5…
## $ P75th <dbl> 125000, 90000, 105000, 80000, 75000, 102000…
## $ College_jobs <dbl> 1534, 350, 456, 529, 18314, 1142, 1768, 972…
## $ Non_college_jobs <dbl> 364, 257, 176, 102, 4440, 657, 314, 500, 16…
## $ Low_wage_jobs <dbl> 193, 50, 0, 0, 972, 244, 259, 220, 3253, 31…
What are all of the unique values of Major
? How many are there?
Major <chr> | |||
---|---|---|---|
PETROLEUM ENGINEERING | |||
MINING AND MINERAL ENGINEERING | |||
METALLURGICAL ENGINEERING | |||
NAVAL ARCHITECTURE AND MARINE ENGINEERING | |||
CHEMICAL ENGINEERING | |||
NUCLEAR ENGINEERING | |||
ACTUARIAL SCIENCE | |||
ASTRONOMY AND ASTROPHYSICS | |||
MECHANICAL ENGINEERING | |||
ELECTRICAL ENGINEERING |
n_distinct(Major) <int> | ||||
---|---|---|---|---|
173 |
Major <chr> | Unemployment_rate <dbl> | |
---|---|---|
MATHEMATICS AND COMPUTER SCIENCE | 0.000000000 | |
MILITARY TECHNOLOGIES | 0.000000000 | |
BOTANY | 0.000000000 | |
SOIL SCIENCE | 0.000000000 | |
EDUCATIONAL ADMINISTRATION AND SUPERVISION | 0.000000000 | |
ENGINEERING MECHANICS PHYSICS AND SCIENCE | 0.006334343 | |
COURT REPORTING | 0.011689692 | |
MATHEMATICS TEACHER EDUCATION | 0.016202835 | |
PETROLEUM ENGINEERING | 0.018380527 | |
GENERAL AGRICULTURE | 0.019642463 |
Major <chr> | ShareWomen <dbl> | |||
---|---|---|---|---|
EARLY CHILDHOOD EDUCATION | 0.9689537 | |||
COMMUNICATION DISORDERS SCIENCES AND SERVICES | 0.9679981 | |||
MEDICAL ASSISTING SERVICES | 0.9278072 |
Major_category <chr> | n <int> | |||
---|---|---|---|---|
Interdisciplinary | 1 | |||
Communications & Journalism | 4 | |||
Law & Public Policy | 5 | |||
Industrial Arts & Consumer Services | 7 | |||
Arts | 8 | |||
Psychology & Social Work | 9 | |||
Social Science | 9 | |||
Agriculture & Natural Resources | 10 | |||
Physical Sciences | 10 | |||
Computers & Mathematics | 11 |
Next, make a variable stem
, for whether or not a Major_category
is "stem"
or "not_stem"
.8
Then summarize()
median
for stem and not stem groups.
stem <chr> | mean(Median) <dbl> | |||
---|---|---|---|---|
not_stem | 35624.77 | |||
stem | 47860.94 |
Hint: do year
first, if you do continent
first, there are no years to group by!↩︎
Hint: You can use .
as a placeholder.↩︎
This will download a .zip
file that contains many spreadsheets. Unzip it with a program that unzips files (such as WinZip, 7-zip, the Unarchiver, etc).↩︎
One way to avoid error messages is to move it to the same file as R’s working directory, which again you can determine with getwd()
.↩︎
If the file is in a different folder, the argument is the full path in quotes.↩︎
You won’t be able to read the labels easily, so add theme(axis.text.x=element_text(angle=45, hjust=1)
to angle x
-axis labels (and move them down by 1)↩︎
Hint: use group_by
first.↩︎
Hint: try out the ifelse()
function which has three inputs: condition(s) for a variable(s), what to do if TRUE
(the if), and what to if FALSE
(the else).↩︎