• Getting Set Up
  • Exploring the Data
    • 1.
    • 2
      • a.
      • b.
      • c.
      • d.
  • Simple Plots in Base R
    • 3.
    • 4.
    • 5.
    • 6.
  • Plots with ggplot2
    • 7.
    • 8.
    • 9.
    • 10.
    • 11.
    • 12.
    • 13.
    • 14.
    • 15.
    • 16.
    • 17.
    • 18.
    • 19.
    • 20.
    • 21.
    • 22.
    • 23.

Getting Set Up

Before we begin, start a new file with File New File R Script. As you work through this sheet in the console in R, also add (copy/paste) your commands that work into this new file. At the end, save it, and run to execute all of your commands at once.

Exploring the Data

1.

We will look at GDP per Capita and Life Expectancy using some data from the gapminder project. There is a handy package called gapminder that uses a small snippet of this data for exploratory analysis. Install and load the package gapminder. Type ?gapminder and hit enter to see a description of the data.

# first time only
# install.packages("gapminder")

# load gapminder
library(gapminder)

# get help
?gapminder

2

Let’s get a quick look at gapminder to see what we’re dealing with.

a.

Get the structure of the gapminder data.

str(gapminder)
## Classes 'tbl_df', 'tbl' and 'data.frame':    1704 obs. of  6 variables:
##  $ country  : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
##  $ year     : int  1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
##  $ lifeExp  : num  28.8 30.3 32 34 36.1 ...
##  $ pop      : int  8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
##  $ gdpPercap: num  779 821 853 836 740 ...

b.

What variables are there?

# - country: a factor
# - continent: a factor
# - year: an integer
# - lifeExp: a number
# - gdpPercap: a number

c.

Look at the head of the dataset to get an idea of what the data looks like.

head(gapminder)
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
AfghanistanAsia195228.8018425333779.4453
AfghanistanAsia195730.3329240934820.8530
AfghanistanAsia196231.99710267083853.1007
AfghanistanAsia196734.02011537966836.1971
AfghanistanAsia197236.08813079460739.9811
AfghanistanAsia197738.43814880372786.1134

d.

Get summary statistics of all variables.

summary(gapminder)
##         country        continent        year         lifeExp     
##  Afghanistan:  12   Africa  :624   Min.   :1952   Min.   :23.60  
##  Albania    :  12   Americas:300   1st Qu.:1966   1st Qu.:48.20  
##  Algeria    :  12   Asia    :396   Median :1980   Median :60.71  
##  Angola     :  12   Europe  :360   Mean   :1980   Mean   :59.47  
##  Argentina  :  12   Oceania : 24   3rd Qu.:1993   3rd Qu.:70.85  
##  Australia  :  12                  Max.   :2007   Max.   :82.60  
##  (Other)    :1632                                                
##       pop              gdpPercap       
##  Min.   :6.001e+04   Min.   :   241.2  
##  1st Qu.:2.794e+06   1st Qu.:  1202.1  
##  Median :7.024e+06   Median :  3531.8  
##  Mean   :2.960e+07   Mean   :  7215.3  
##  3rd Qu.:1.959e+07   3rd Qu.:  9325.5  
##  Max.   :1.319e+09   Max.   :113523.1  
## 

Simple Plots in Base R

3.

Let’s make sure you can do some basic plots before we get into the gg. Use base R’s hist() function to plot a histogram of gdpPercap.

hist(gapminder$gdpPercap)

4.

Use base R’s boxplot() function to plot a boxplot of gdpPercap.

boxplot(gapminder$gdpPercap)

5.

Now make it a boxplot by continent.1

boxplot(gapminder$gdpPercap~gapminder$continent)

# alternate method
# boxplot(gdpPercap~continent, data = gapminder)

6.

Now make a scatterplot of gdpPercap on the x-axis and LifeExp on the y-axis.

plot(gapminder$lifeExp~gapminder$gdpPercap)

# alternate method
# boxplot(lifeExp~gdpPercap, data = gapminder)

Plots with ggplot2

7.

Load the package ggplot2 (you should have installed it previously. If not, install first with install.packages("ggplot2")).

# install if you don't have
# install.packages("ggplot2")

# load ggplot2 
library(ggplot2)

8.

Let’s first make a bar graph to see how many countries are in each continent. The only aesthetic you need is to map continent to x. Bar graphs are great for representing categories, but not quantitative data.

ggplot(data = gapminder,
       aes(x = continent))+
  geom_bar()

9.

For quantitative data, we want a histogram to visualize the distribution of a variable. Make a histogram of gdpPercap. Your only aesthetic here is to map gdpPercap to x.

ggplot(data = gapminder,
       aes(x = gdpPercap))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

10.

Now let’s try adding some color, specifically, add an aesthetic that maps continent to fill.2

ggplot(data = gapminder,
       aes(x = gdpPercap,
           fill = continent))+
  geom_histogram()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

11.

Instead of a histogram, change the geom to make it a density graph. To avoid overplotting, add alpha=0.4 to the geom argument (alpha changes the transparency of a fill).

ggplot(data = gapminder,
       aes(x = gdpPercap,
           fill = continent))+
  geom_density(alpha=0.4)

12.

Redo your plot from 11 for lifeExp instead of gdpPercap.

ggplot(data = gapminder,
       aes(x = lifeExp,
           fill = continent))+
  geom_density(alpha=0.4)

13.

Now let’s try a scatterplot for lifeExp (as y) on gdpPercap (as x). You’ll need both for aesthetics. The geom here is geom_point().

ggplot(data = gapminder,
       aes(x = gdpPercap,
           y = lifeExp))+
  geom_point()

14.

Add some color by mapping continent to color in your aesthetics.

ggplot(data = gapminder,
       aes(x = gdpPercap,
           y = lifeExp,
           color = continent))+
  geom_point()

15.

Now let’s try adding a regression line with geom_smooth(). Add this layer on top of your geom_point() layer.

ggplot(data = gapminder,
       aes(x = gdpPercap,
           y = lifeExp,
           color = continent))+
  geom_point()+
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

16.

Did you notice that you got multiple regression lines (colored by continent)? That’s because we set a global aesthetic of mapping continent to color. If we want just one regression line, we need to instead move the color = continent inside the aes of geom_point. This will only map continent to color for points, not for anything else.

ggplot(data = gapminder,
       aes(x = gdpPercap,
           y = lifeExp))+
  geom_point(aes(color = continent))+
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

17.

Now add an aesthetic to your points to map pop to size.

ggplot(data = gapminder,
       aes(x = gdpPercap,
           y = lifeExp))+
  geom_point(aes(color = continent, 
                 size = pop))+
  geom_smooth()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

18.

Change the color of the regression line to "black". Try first by putting this inside an aes() in your geom_smooth, and try a second time by just putting it inside geom_smooth without an aes(). What’s the difference, and why?

ggplot(data = gapminder,
       aes(x = gdpPercap,
           y = lifeExp))+
  geom_point(aes(color = continent, 
                 size = pop))+
  geom_smooth(aes(color = "black"))
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# putting it inside aesthetics tries to map color to something
# in the da ta called "black", since R can't find "black", 
# it will produce some random color

ggplot(data = gapminder,
       aes(x = gdpPercap,
           y = lifeExp))+
  geom_point(aes(color = continent, 
                 size = pop))+
  geom_smooth(color = "black")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

# putting it outside aesthetics (correctly) sets color to black

19.

Another way to separate out continents is with faceting. Add +facet_wrap(~continent) to create subplots by continent.

ggplot(data = gapminder,
       aes(x = gdpPercap,
           y = lifeExp))+
  geom_point(aes(color = continent, 
                 size = pop))+
  geom_smooth(color = "black")+
  facet_wrap(~continent)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

20.

Remove the facet layer. The scale is quite annoying for the x-axis, a lot of points are clustered on the lower level. Let’s try changing the scale by adding a layer: +scale_x_log10().

ggplot(data = gapminder,
       aes(x = gdpPercap,
           y = lifeExp))+
  geom_point(aes(color = continent, 
                 size = pop))+
  geom_smooth(color="black")+
  scale_x_log10()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

21.

Now let’s fix the labels by adding +labs(). Inside labs, make proper axes titles for x, y, and a title to the plot. If you want to change the name of the legends (continent color), add one for color and size.

ggplot(data = gapminder,
       aes(x = gdpPercap,
           y = lifeExp))+
  geom_point(aes(color = continent, 
                 size = pop))+
  geom_smooth(color="black")+
  scale_x_log10()+
  labs(x = "GDP per Capita",
       y = "Life Expectancy",
       color = "Continent",
       size = "Population")
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'

22.

Now let’s try subsetting by looking only at North America. Take the gapminder dataframe and subset it to only look at continent=="Americas"). Assign this to a new dataframe object (call it something like america.) Now, use this as your data, and redo the graph from question 17. (You might want to take a look at your new dataframe to make sure it worked first!)

america<-gapminder[gapminder$continent=="Americas",]

# verify this worked
america
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
ArgentinaAmericas195262.485178769565911.315
ArgentinaAmericas195764.399196105386856.856
ArgentinaAmericas196265.142212837837133.166
ArgentinaAmericas196765.634229342258052.953
ArgentinaAmericas197267.065247797999443.039
ArgentinaAmericas197768.4812698382810079.027
ArgentinaAmericas198269.942293413748997.897
ArgentinaAmericas198770.774316209189139.671
ArgentinaAmericas199271.868339589479308.419
ArgentinaAmericas199773.2753620346310967.282
ggplot(data = america,
       aes(x = gdpPercap,
           y = lifeExp))+
  geom_point(aes(color = continent, 
                 size = pop))+
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

23.

Try this again for the whole world, but just for observations in the year 2002.

gap_2002<-gapminder[gapminder$year==2002,]

# verify this worked
gap_2002
ABCDEFGHIJ0123456789
country
<fctr>
continent
<fctr>
year
<int>
lifeExp
<dbl>
pop
<int>
gdpPercap
<dbl>
AfghanistanAsia200242.12925268405726.7341
AlbaniaEurope200275.65135085124604.2117
AlgeriaAfrica200270.994312871425288.0404
AngolaAfrica200241.003108661062773.2873
ArgentinaAmericas200274.340383311218797.6407
AustraliaOceania200280.3701954679230687.7547
AustriaEurope200278.980814831232417.6077
BahrainAsia200274.79565639723403.5593
BangladeshAsia200262.0131356567901136.3904
BelgiumEurope200278.3201031197030485.8838
ggplot(data = gap_2002,
       aes(x = gdpPercap,
           y = lifeExp))+
  geom_point(aes(color = continent, 
                 size = pop))+
  geom_smooth()
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'


  1. Hint: use formula notation with~.↩︎

  2. In general, color refers to the outside borders of a geom (except points), fill is the interior of an object.↩︎