R and RStudio (1/2)
- R is a programming language
- RStudio is the most widespread IDE associated to R
- With Python, R is one of the two top data science languages
- Millions of R users worldwide and the number increases rapidly
- Examples of use cases:
Roughly speaking (very personal opinion):
- R is usually better for statistics;
- Python is preferred by computer scientists and people working on very Deep Learning;
- both are amazing for data science, graphs and simple machine learning.
R and RStudio (2/2)
Beyond statistical analyses and data science, you can do lots of things with R: reports, websites, books, applications, and slides (like these ones).
+ doc & ppt (see https://ardata-fr.github.io/officeverse/), but the latter are usually ugly.
R & RStudio can easily be combined to other languages (Python, C/C++, SQL, JavaScript, etc.).
Moreover, the R community is very inclusive and kind.
In short, R is pretty cool. 😎
The learning curve
“Failure the greatest teacher is.” Master Yoda in The Last Jedi.
“Failure is an option. If things are not failing, you are not innovating enough.”
The end of the learning curve
What does “where I want you to be” mean?
Here are 2 examples of some of my students’ dynamic dashboards:
My role/hope is to make you code similar objects.
But you need to start working hard early.
The philosophy of the course
Knowledge is not free: you have to pay attention.
- Learning comes from YOU! More than 80% of what you will get from the course will come from
your efforts. Passive listening \(\neq\) learning.
- Remote teaching is a major barrier! Because attention spans are short, especially in front of pedagogical videos!
Remote teaching is not a major hurdle! Because anyway, progress will only be made by practice (on your computer, outside the video sessions).
- I will
always be there to help. Google, stackoverflow & chatGPT are your best friends. I’m next on the list.
- To optimize my feedback, be as precise as possible. (best solution: send me files & code, not screenshots!)
About LLMs
- Large Language Models (LLMs) such as GPT & co can be useful.
- But the wise student should know the difference between asking for help and outsourcing.
- The goal is to learn to code, not to prompt!
- If chatGPT can do a job: you won’t get the job!
- Also: chatGPT makes a lot of errors. You need to (double) check.
Errors, errors, errors
=> Debugging!
The data science workflow
In this course, we will mostly overlook the Model part.
My only goal: that Excel becomes marginal in your workflow! 😉
How shall we proceed?
Course structure (1/2)
- In-class sessions: mix between slides & tutorials
- Exercises: practice is the most important
- A personal project, in two steps:
- A short presentation of the project + dataset search & formatting (
~10-15h work), due
- The full report, code & deployment (
~20-60h work), due
- Please do not ask for adjournments (the deadlines are comfortable).
They are crucial: don’t be shy, there is no such thing as a bad question.
There are only bad teachers :)
Seriously, don’t be shy. 😊
Often, seemingly ridicule questions are not ridicule at all.
R is new to you, it can be overwhelming (and I know it).
Course structure (2/2)
- Introduction to R/RStudio & the tidyverse
- Baseline R & data structures + import/export
- Plots + options
Shiny 1 - User interface & Server
Shiny 2 - UI layout (organization: tabs, rows, columns, boxes, menus, etc.)
Shiny 3 - Deployment + further options (CSS, themes & advanced tricks)
Geocomputing / Text Mining / APIs
- Leveraging chatGPT / Advanced modelling / + options \(\rightarrow\) possibly to be defined together
\(\rightarrow\) Don’t hesitate to submit ideas or wishes!
What RStudio looks like
The greatness of notebooks
About packages
One of the great strengths of R is a (very) large collection of packages.
Packages are libraries that expand the capabilities of R.
The most important one is in fact a collection of packages called thetidyverse.
It includes:
- ggplot: the best plotting engine in the world (seriously)
- dplyr: for data manipulation
- tidyr: helps you work with tidy data (more on that soon)
- readr and readxl: import rectangular data files (excel, etc.)
- and more!
Install packages
There are two steps: first you need to download the package (only once, like pip in Python). The packages are downloaded from servers all around the world. So we start by choosing one.
# chooseCRANmirror() to set downloading location, chooseCRANmirror(ind = 1) # see getCRANmirrors() for geographical details
install.packages("tidyverse") # Install (download) package: only once
NOTE (reminder): it’s easy to install packages directly in RStudio (“Tools” tab). + .libPaths()
Then, if you want to work with it, you need to load/activate it (for every session).
library(tidyverse) # Loading, like "import..." in Python
“Unofficial” packages
Sometimes, packages are not verified by the CRAN (Comprehensive R Archive Network).
They are simply hosted on Github.
To install them, use use the devtools package:
But you have to install devtools first!
Comments and output
- One hashtag # precedes a comment in R code.
- Two hashtags ## precede output from a code sequence in the slides or notebooks.
1+1 # Test!
## [1] 2
In these slides, code will appear in grey areas (rectangles).
The assignment operator <-
In R, we don’t use “=” to create variables, but an arrow sign “<-”.
Though “=” works, too.
a <- 6 # This creates a variable but does not show it! a # If you want to see it, ask for it!
## [1] 6
b <- 11:42 # This creates a variable but does not show it! b # If you want to see it, ask for it!
## [1] 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 ## [26] 36 37 38 39 40 41 42
The brackets (ex: [26]) indicate the position of the first element in the row.
Tidy data via the package tidyr
Many ways to organize data in rectangles, but:
source: Allison Horst & Julia Stewart Lowndes https://www.openscapes.org/blog/2020/10/12/tidy-data/
Instances vs variables
The diamonds database is included in the tidyverse. The head() function shows the first lines of a dataset. The tail() function shows the last lines.
head(diamonds, 4) # The number gives the amount of rows shown
carat | cut | color | clarity | depth | table | price | x | y | z |
0.23 | Ideal | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 |
0.21 | Premium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 |
0.23 | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 |
0.29 | Premium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 |
One instance = one observation = one row.
One variable = one unique characteristic = one column.
Variable types
- number (numerical): integer (int) or double (dbl)
- character: text (chr)
- factor: categorical, ordered (ord) or not (fct)
- boolean: TRUE or FALSE / T or F (bool)
- date: day precision (date) or second precision (time): usually starts 1970-01-01
NOTE: we are only concerned with rectangular / structured datasets.
Tidyness structures data & thought
source: Allison Horst & Julia Stewart Lowndes
Tidy data benefits from hundreds of tools
source: Allison Horst & Julia Stewart Lowndes
Tidy data: example via gapminder
install.packages("gapminder") # Install (download) package: only once
Tidy data satisfies the (row = instance) & (column = variable) structure.
library(gapminder) # Activate: each time you launch RStudio head(gapminder, 3) # Have a look!
country | continent | year | lifeExp | pop | gdpPercap |
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 |
Tidy data: counter-example
The table below shows the evolution of population of countries.
pivot_wider(gapminder[c(1:4,13:16,25:28), c(1,3,5)], # Don't look at this code! names_from = "country", values_from = "pop")
year | Afghanistan | Albania | Algeria |
1952 | 8425333 | 1282697 | 9279525 |
1957 | 9240934 | 1476505 | 10270856 |
1962 | 10267083 | 1728137 | 11000948 |
1967 | 11537966 | 1984060 | 12760499 |
\(\rightarrow\) Not tidy! The columns are not VARIABLES!
(This is typically the excel format.)
Tidy tools (illustration)
Tidy tools
The tidyverse has two functions that switch from matrix/excel format to tidy data and back:
- pivot_longer(): from matrix/excel to tidy data (wide_to_long/melt in pandas)
- pivot_wider(): from tidy data to matrix/excel (pivot in pandas)
Year | France | Germany | UK |
1970 | 52 | 61 | 56 |
1990 | 59 | 80 | 57 |
2010 | 65 | 82 | 63 |
BE VERY CAREFUL: type case matters in R!
Continent \(\neq\) continent.
When referring to a variable (column names), a mistake will lead to an error.
Tidy tool: pivot_longer()! From wide to long.
Tidy tool: pivot_longer()! From wide to long.
Gather joins/concatenates columns which belong to the same variable.
tidy_pop <- pivot_longer(not_tidy_pop, cols = -Year, names_to = "Country", values_to = "Population")
tidy_pop[1:7,] # First 7 lines (only) shown
Year | Country | Population |
1970 | France | 52 |
1970 | Germany | 61 |
1970 | UK | 56 |
1990 | France | 59 |
1990 | Germany | 80 |
1990 | UK | 57 |
2010 | France | 65 |
The syntax is the following:
\(\quad\) cols = columns to tidy,
\(\quad\) names_to = name_of_the_new_variable,
\(\quad\) values_to = name_of_the_column_with_values
names_to = Country because the columns are all countries.
values_to = Population because the data pertains to population values.
We use -Year because the Year variable is excluded from the pivoting
Source: software carpentry
Tidy tools: pivot_wider()! From long to wide.
The reverse operation (no need for the “cols” argument this time).
pivot_wider(tidy_pop, names_from = "Country", values_from = "Population")
Year | France | Germany | UK |
1970 | 52 | 61 | 56 |
1990 | 59 | 80 | 57 |
2010 | 65 | 82 | 63 |
\(\quad\) names_from = variable_to_be_put_in_columns,
\(\quad\) values_from = where_to_get_values
Data manipulation via the package dplyr
filter() rows - Part I
Often, analyses are performed on subsets of data (query in Python).
filter(gapminder, lifeExp > 81.5) # Countries where people live long lives on average
country | continent | year | lifeExp | pop | gdpPercap |
Hong Kong, China | Asia | 2007 | 82.208 | 6980412 | 39724.98 |
Iceland | Europe | 2007 | 81.757 | 301931 | 36180.79 |
Japan | Asia | 2002 | 82.000 | 127065841 | 28604.59 |
Japan | Asia | 2007 | 82.603 | 127467972 | 31656.07 |
Switzerland | Europe | 2007 | 81.701 | 7554661 | 37506.42 |
filter() rows - Part II
Filters can be combined (with commas preferably, the & operator works, too).
filter(gapminder, country == "Japan", year > 2000)
country | continent | year | lifeExp | pop | gdpPercap |
Japan | Asia | 2002 | 82.000 | 127065841 | 28604.59 |
Japan | Asia | 2007 | 82.603 | 127467972 | 31656.07 |
Only two observations for Japan post-2000.
NOTE: as in all languages, there are TWO EQUAL SIGNS (==) for the comparison.
One “=” is like the arrow (<-) and is used to assign values.
select() columns
Sometimes, you might want to keep just a few variables to ease readability.
select(gapminder[1:5,], country, year, pop)
country | year | pop |
Afghanistan | 1952 | 8425333 |
Afghanistan | 1957 | 9240934 |
Afghanistan | 1962 | 10267083 |
Afghanistan | 1967 | 11537966 |
Afghanistan | 1972 | 13079460 |
Use select(data, -variable) to remove variable: the minus sign!
Sort via arrange()
This is when you want to order your data (sort_values in pandas). Here, from smallest pop to largest.
head(arrange(gapminder, pop)) # Alternative: arrange(gapminder, desc(lifeExp)); desc() is for descending
country | continent | year | lifeExp | pop | gdpPercap |
Sao Tome and Principe | Africa | 1952 | 46.471 | 60011 | 879.5836 |
Sao Tome and Principe | Africa | 1957 | 48.945 | 61325 | 860.7369 |
Djibouti | Africa | 1952 | 34.812 | 63149 | 2669.5295 |
Sao Tome and Principe | Africa | 1962 | 51.893 | 65345 | 1071.5511 |
Sao Tome and Principe | Africa | 1967 | 54.425 | 70787 | 1384.8406 |
Djibouti | Africa | 1957 | 37.328 | 71851 | 2864.9691 |
Create new columns via mutate()
With population and gdpPercap you can infer total GDP!
head(mutate(gapminder, gdp = pop * gdpPercap))
country | continent | year | lifeExp | pop | gdpPercap | gdp |
Afghanistan | Asia | 1952 | 28.801 | 8425333 | 779.4453 | 6567086330 |
Afghanistan | Asia | 1957 | 30.332 | 9240934 | 820.8530 | 7585448670 |
Afghanistan | Asia | 1962 | 31.997 | 10267083 | 853.1007 | 8758855797 |
Afghanistan | Asia | 1967 | 34.020 | 11537966 | 836.1971 | 9648014150 |
Afghanistan | Asia | 1972 | 36.088 | 13079460 | 739.9811 | 9678553274 |
Afghanistan | Asia | 1977 | 38.438 | 14880372 | 786.1134 | 11697659231 |
Piping: %>%
(or |>)
Definition: sequences of operations
Very often, one simple analysis will require several steps. They can be combined via the %>% or |> operators.
A fake sequence:
me %>%
\(\quad\) wake_up(time = “06:20”) %>%
\(\quad\) shower(temp = 40) %>%
\(\quad\) go_to(place = “baker”, with = “scooter”) %>%
\(\quad\) buy(item = “bread”) %>%
\(\quad\) go_to(place = “home”, with = “scooter”) %>%
\(\quad\) breakfast(drink = “hot_chocolate”, eat = “toast”, eat = “kiwi”) %>%
\(\quad\) toothbrush(duration = 2)
Example (short)
select(filter(diamonds, carat > 4), carat, price, clarity) # BEURK!
diamonds |> filter(carat > 4) |> select(carat, price, clarity) # So simple!
carat | price | clarity |
4.01 | 15223 | I1 |
4.01 | 15223 | I1 |
4.13 | 17329 | I1 |
5.01 | 18018 | I1 |
4.50 | 18531 | I1 |
Example (long)
diamonds %>% filter(carat > 2, cut == "Ideal") %>% # First we filter mutate(car_price_ratio = carat/price) %>% # Then, we create a new column arrange(desc(car_price_ratio)) %>% # We order the data select(-x, -y, -z) %>% # We take out some superfluous columns head(4) # Finally, we ask for the top 4 instances
carat | cut | color | clarity | depth | table | price | car_price_ratio |
3.50 | Ideal | H | I1 | 62.8 | 57 | 12587 | 0.0002781 |
3.22 | Ideal | I | I1 | 62.6 | 55 | 12545 | 0.0002567 |
2.16 | Ideal | H | I1 | 62.2 | 56 | 8709 | 0.0002480 |
2.25 | Ideal | E | I1 | 61.4 | 54 | 9072 | 0.0002480 |
Pivot tables
“A pivot table is a table of statistics that summarizes the data of a more extensive table.”
— Wikipedia
There are two dimensions in a pivot table:
- which variable(s) we want to analyze (gender, continent/country, size, etc.);
- which statistic we want to compute (mean, min, max, number of instances, variance etc.).
In R, these two steps are separated via two functions: group_by() and summarise()
Example I
diamonds |> group_by(clarity, cut) |> # Define the variables summarise(avg_price = mean(price), # Define the statistics max_price = max(price), avg_carat = mean(carat), max_carat = max(carat)) |> head(3)
clarity | cut | avg_price | max_price | avg_carat | max_carat |
I1 | Fair | 3703.533 | 18531 | 1.361000 | 5.01 |
I1 | Good | 3596.635 | 11548 | 1.203021 | 3.00 |
I1 | Very Good | 4078.226 | 15984 | 1.281905 | 4.00 |
Example II
You can even pipe inside a function!
gapminder %>% group_by(continent, year) %>% summarise(avg_lifeExp = mean(lifeExp) %>% round(2)) %>% head(4)
continent | year | avg_lifeExp |
Africa | 1952 | 39.14 |
Africa | 1957 | 41.27 |
Africa | 1962 | 43.32 |
Africa | 1967 | 45.33 |
The round() function rounds numbers up to some decimals.
Bonus: tutors!
- visualize tidyverse code: https://tidydatatutor.com
- equivalent for pandas: https://pandastutor.com
R is an incredibly powerful tool for data science. The preferred environment is the tidyverse. As its name indicates, the core concept is TIDY DATA!
In short, it’s just a question of functions:
- pivot_longer() and pivot_wider() to work with tidy data;
- filter(), select(), arrange() and mutate() for wrangling/manipulation;
- group_by() and summarize() for pivot tables;
- head() and tail() to see the first and last lines of a dataset.
That’s it!
Resources / Links
- an online series of exercises with solutions!:
https://gcoqueret.shinyapps.io/Exercises/ - the Bible of Data Science with R:
http://r4ds.hadley.nz/ - Two great books on data science:
https://rafalab.github.io/dsbook/ - Shiny use cases:
https://shiny.posit.co/r/gallery/ - One other app (marketing):
What are your questions?
- Programming AND learning to code is HARD:
- Ask questions so I can make it (slightly) easier.
- I can’t guess your questions, don’t be shy!
- Nothing will replace practice & making mistakes.
Tip: for your project, choose a dataset with a mixture of numerical and categorical data. It is the best combination to create a nice looking dashboard!