This is a notebook that introduces the most important features of the tidyverse: ggplot() (for plots), piping, and some key functions for data wrangling (filter, arrange, gather, spread, group_by & summarise).

GETTING STARTED

First, we install and load the most important package (environment) in R for Data Science: the tidyverse.

if (!require("tidyverse")) install.packages(c("tidyverse")) # Only installs if missing
library(tidyverse)

A glimpse of the data: the first step is to have a look! We use the head() function: it gives the first 6 lines of its argument. We will work with the diamond dataset, which is embedded within the tidyverse. The tail() function shows the last 6 lines.

head(diamonds) # diamonds is a built-in dataset in the tidyverse (ggplot2, actually)
# Below the name of the variable, you will find the variable type:
# dbl  = double = real number,
# int  = integer = integer number,
# ord  = ordered factor,
# fact = unordered factor,
# char = character.
tail(diamonds)

FIRST PLOTS

A first example of one use of ggplot, the master function for plotting in R. The syntax is a bit strange at first, but after many examples, it becomes clear. The type of graph is determined by a ‘geom’: geom_point for scatter plot, geom_line for a line and geom_histogram… is explicit. Plot are always defined by the variable shown on the x-axis and sometimes by that featured on the y-axis. This information is embedded in the aesthetics wrapper aes(). The aes() can be specified either at the root of ggplot() or at the geom level. As we show below, many (many) plotting options are available.

ggplot(diamonds) + geom_point(aes(x = carat, y = price), size = 0.5, color = "#004C99") + xlim(0,3) + 
    stat_smooth(aes(x =carat, y = price), method = "lm", color = "red")

Note: 32 instances are absent because the corresponing diamonds are too big. The red line (stat_smooth part) shows the linearized relationship between size and price. The color can be provided using the hexadecimal code for RGB colors. See, e.g., https://www.w3schools.com/colors/colors_picker.asp

We can add features: titles, limitation on both axis, nonlinear scales for these axes, etc. In the graph below, the clarity of each diamond is shown with a color and the quality of cut represented by its size. The first choice is a good one, the second one could be debated.

ggplot(diamonds) + geom_point(aes(x = carat, y = price, color = clarity, size = cut), alpha = 3/10) + 
  scale_y_sqrt() + ggtitle("Plot of diamond prices") + xlim(0.1,4) 

Another popular graph type is the barplot. It does not require a y-axis specification: it counts the number of occurrence in the sample. If a y is specified, then inside the geom_bar, ‘stat = “identity”’ must be activated. One example will show how to proceed below.

ggplot(diamonds) + geom_bar(aes(x = cut, fill = clarity)) + ggtitle("Number of diamonds") 

Plotting is closely related to layers (like in Photoshop). Several layers can be superposed. Alternative colour palettes can be used. Below, we add one layer that consists of one horizontal line.

ggplot(diamonds,aes(x = price, fill = color)) + geom_hline(yintercept = 1000) + 
  geom_histogram(bins = 15, position= "dodge") + ggtitle("Number of diamonds") +
    scale_fill_brewer(palette = "RdYlBu")

Further references for graph types:
https://developers.google.com/chart/interactive/docs/gallery
https://plot.ly/r/
https://www.r-graph-gallery.com/all-graphs/
https://plot.ly/r/shiny-gallery/ (dynamic & interactive)

DATA MANIPULATION

First, it is imperative to get to know the data. Usually, we first resort to descriptive statistics.

summary(diamonds)
     carat               cut        color        clarity          depth           table      
 Min.   :0.2000   Fair     : 1610   D: 6775   SI1    :13065   Min.   :43.00   Min.   :43.00  
 1st Qu.:0.4000   Good     : 4906   E: 9797   VS2    :12258   1st Qu.:61.00   1st Qu.:56.00  
 Median :0.7000   Very Good:12082   F: 9542   SI2    : 9194   Median :61.80   Median :57.00  
 Mean   :0.7979   Premium  :13791   G:11292   VS1    : 8171   Mean   :61.75   Mean   :57.46  
 3rd Qu.:1.0400   Ideal    :21551   H: 8304   VVS2   : 5066   3rd Qu.:62.50   3rd Qu.:59.00  
 Max.   :5.0100                     I: 5422   VVS1   : 3655   Max.   :79.00   Max.   :95.00  
                                    J: 2808   (Other): 2531                                  
     price             x                y                z         
 Min.   :  326   Min.   : 0.000   Min.   : 0.000   Min.   : 0.000  
 1st Qu.:  950   1st Qu.: 4.710   1st Qu.: 4.720   1st Qu.: 2.910  
 Median : 2401   Median : 5.700   Median : 5.710   Median : 3.530  
 Mean   : 3933   Mean   : 5.731   Mean   : 5.735   Mean   : 3.539  
 3rd Qu.: 5324   3rd Qu.: 6.540   3rd Qu.: 6.540   3rd Qu.: 4.040  
 Max.   :18823   Max.   :10.740   Max.   :58.900   Max.   :31.800  
                                                                   

If need be, the data can be ordered, according to specific variables, using the arrange() function.

arrange(diamonds, desc(carat), price) # First, descending carat, then, increasing price.

The data is first ordered according to descending weight, and for a given weight (carat), it is sorted according to increasing price.

Filters

Often, people are interested by subsets of databases. Subsetting (i.e. filtering) can be performed over rows (occurrences). filter() is an amazing function that does just that.

filter(diamonds, carat > 4)
filter(diamonds, carat > 3 & cut == "Ideal")

And subsetting can be performed over columns as well. Indeed, sometimes, only a few columns matter and it can be useful to only keep those (possibly in a separate variable). We use select() to this purpose.

select(diamonds, carat, cut, color, price)
select(diamonds, - clarity, - x, - y, - z) # Using the minus sign performs the opposite manipulation and removes the corresponding variable

Piping

Data manipulation is all about sequences of tasks: filtering rows, selecting columns, re-arranging, plotting, etc. A very convenient way to write these sequences is to resort to the PIPE operator %>%. It works like this:
Data variable %>% Instruction 1 %>% Instruction 2 %>% Instruction 3 %>% …
Below, we apply successively a filter and a selection. On top of that, you can add a plot.

filter(diamonds, carat > 4) %>% select(carat, cut, color, price) # Or, equivalently,
diamonds %>% 
    filter(carat > 4) %>% 
    select(carat, cut, color, price)
diamonds %>% 
    filter(cut == "Fair") %>% 
    ggplot(aes(x = carat, y = price)) + geom_point() + geom_smooth() 

# The last part aims to ease visual pattern detection

The blue curve show the conditional (local) average of prices for a given level of size (carat). For ease of readability, it is customary to skip a line after a pipe. I often fail to comply to this rule and apologise for that.

Pivot tables

The tidyverse is very efficient at building pivot tables. Effortlessly (almost), they can then be plotted. The procedure requires 2 steps and 2 functions. First, the variables of interest are specified via group_by(). Second, the desired metric is defined via summarise().

diamonds %>% 
    group_by(cut) %>%                                # We focus on cut
    summarise(number = n(), avg_price = mean(price)) # n() simply counts the number of occurrences
diamonds %>% 
    group_by(cut, clarity) %>% 
    summarise(avg_carat = mean(carat), avg_price = mean(price)) %>%
    ggplot() + geom_point(aes(x = avg_carat, y = avg_price, color = clarity, shape = cut, size = 2)) # Yes, you can control the point size!

diamonds %>% 
    group_by(clarity) %>%
    summarise(avg_carat = mean(carat)) %>%
    ggplot() + geom_bar(aes(x = clarity, y = avg_carat), stat = "identity")

The second graph shows clusters of colours, hence of clarity. Clarity is a big deal for large diamonds. Internally flawless diamonds are very expensive. We used shape to denote cut, but this decision is ill-advised. Size would be slightly better, though not perfect.
The last graph shows that, as purity increases, diamond size shrinks (on average).

Spreading and gathering

Rectangular data can be organized is several ways. Plot via ggplot() can only be obtained via ‘columnised’ data. From a matrix (2 variables: one for rows and one for columns), you obtain a columnised variant via gather(). The opposite operation from spread().

diamonds %>% group_by(cut, color) %>% summarise(avg_price = mean(price)) 
# this is one way to see things.. how do I get a matrix out of this? => spread()!
diamonds %>% group_by(cut, color) %>% summarise(avg_price = mean(price)) %>% spread(key = cut, value = avg_price)

The other way around, via gather().

m <- matrix(1:15, nrow = 5) %>% data.frame      # Creates a matrix embedded into a dataframe
colnames(m) <- c("small", "medium", "large")    # Let's add some column names
m                                               # Let's see it    
gather(m, key = size, value = number)           # 'columnised version'

The gather() function has taken all columns and concatenated them vertically. The name of each column is stored in the key argument (here, it is ‘size’), the content inside the original matrix has name value (‘number’ in our case).

EXERCISES

First steps

  1. Set your working directory and load the tidyverse package
  2. Import the anes dataset (choose the RData file: technical spoiler).
  3. Have a quick look at the data (e.g., the first 6 lines).
  4. Summarise the dataset very briefly.

Working tidily

  1. Filter out the males aged exactly 87 at the time of the study. How numerous are they?
  2. How old is the oldest man/woman in the sample?
  3. Create a pivot table (PT) that computes, for each gender and religion, the average age of the respondents.
  4. From that PT, create a 2*4 matrix (each line corresponds to a gender, each column to a religion).

Plotting

  1. Plot the histogram of the Age variable. Choose the ‘best’ number of bins!
  2. Create a pivot table that counts the number of respondents for each gender, religion and party affiliation.
  3. Plot (in any way you find relevant) the resulting table.
