The very first step after launching RStudio is to specify the working directory. You can do it directly in the Files pane (using the blue wheel), or via the setwd() function. The getwd() yields the location of the current working directory.
getwd()
[1] "/Users/coqueret/Documents/IT/Cours/RStats/Git"
Working with packages requires:
1 - an installation, only once (basically it downloads the code/files on your computer) - except if you wan to update version
2 - an activation, each time you start RStudio
if(!require(openxlsx)) {install.packages("openxlsx") } # Only installs if missing
if(!require(readxl)) {install.packages("readxl") } # Only installs if missing
# The hashtag is used for comments: the program does not read the line
library(openxlsx)
library(readxl)
library(tidyverse)
This is usually done directly in the user interface (Files pane), or with packages like openxlsx or readxl (to import Excel files) with the function read.xlsx() or read_excel(). The basic case looks like that: test_data <- read.xlsx(“MyFile.xlsx”)
OR
test_data <- read_excel(“MyFile.xlsx”).
This stores your data into the test_data variable. This assumes that the Excel file “MyFile.xlsx” exists in your working directory.
anes <- read.xlsx("anes.xlsx")
anes <- read_excel("anes.xlsx") # Same, with another packages
You can create data from scratch, using the colon operator for instance. Or the seq() function for sequences. You can replicate items (e.g., sequences) using the rep() function.
1:10 # All integers from 1 to 10
[1] 1 2 3 4 5 6 7 8 9 10
3:17 # All integers from 3 to 17
[1] 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
seq(1,2,0.1) # The syntax is: seq(begin, end, step size)
[1] 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
rep("HA", 6)
[1] "HA" "HA" "HA" "HA" "HA" "HA"
rep(c(1,2,3,4), 3) # Syntax: rep(what_you_want_to_replicate, nb_replications)
[1] 1 2 3 4 1 2 3 4 1 2 3 4
A very important function: the c() function concatenates and encapsulates numbers (or text):
c(2,5,7)
[1] 2 5 7
c(1:6,12:20)
[1] 1 2 3 4 5 6 12 13 14 15 16 17 18 19 20
c("R", " is ", "awesome")
[1] "R" " is " "awesome"
Another way to concatenate data is to use row-bind and column-bind functions rbind() and cbind().
rbind(c(2,5,7),c(3,1,8)) # Binding rows
[,1] [,2] [,3]
[1,] 2 5 7
[2,] 3 1 8
cbind(c(2,5,7),c(3,1,8)) # Binding columns
[,1] [,2]
[1,] 2 3
[2,] 5 1
[3,] 7 8
You can also fill in matrices: the syntax is matrix(DATA, nrow = r, ncol = c) where the length of DATA must be equal to r*c! If one dimension is omitted, R will do the division (assuming the format is correct).
m <- matrix(1:20, nrow = 4)
m2 <- matrix(1:20, nrow = 4, byrow = T) # A matrix can be filled by rows or by columns
m
[,1] [,2] [,3] [,4] [,5]
[1,] 1 5 9 13 17
[2,] 2 6 10 14 18
[3,] 3 7 11 15 19
[4,] 4 8 12 16 20
m2
[,1] [,2] [,3] [,4] [,5]
[1,] 1 2 3 4 5
[2,] 6 7 8 9 10
[3,] 11 12 13 14 15
[4,] 16 17 18 19 20
Simple matrix operations in R: transposing, multiplying.
t(m) # t() is used for transposing; it works for vectors too! By default, vectors are columns
[,1] [,2] [,3] [,4]
[1,] 1 2 3 4
[2,] 5 6 7 8
[3,] 9 10 11 12
[4,] 13 14 15 16
[5,] 17 18 19 20
m*m2 # This is term-by-term multiplication
[,1] [,2] [,3] [,4] [,5]
[1,] 1 10 27 52 85
[2,] 12 42 80 126 180
[3,] 33 84 143 210 285
[4,] 64 136 216 304 400
m%*%t(m2) # This is matrix multiplication
[,1] [,2] [,3] [,4]
[1,] 175 400 625 850
[2,] 190 440 690 940
[3,] 205 480 755 1030
[4,] 220 520 820 1120
R is great to generate random data.
runif(15) # uniform distribution: 15 samples
[1] 0.29740595 0.43197770 0.99838659 0.79067341 0.99508619 0.08679626 0.74249126 0.99556576
[9] 0.43868022 0.08526857 0.05788177 0.68824065 0.13493421 0.04587817 0.91213121
rnorm(10) # Gaussian distribution (parameters could be specified), 10 samples
[1] 0.70532545 0.26341877 -0.27780366 0.40497219 -0.10642416 -0.66211115 0.38518040 0.06023922
[9] 0.20728269 1.15380725
An overview of all distributions available across R packages can be found here: https://cran.r-project.org/web/views/Distributions.html
Datasets often mix text and numbers. R can do that too, with data frames (the modern version of dataframe is the tibble). Let’s create one with the data.frame() function. We use the round() function which rounds up numbers.
nb_gender <- 7 # Number of people of each gender
Gender <- rep(c("Male"),nb_gender) # nb_gender men in total
Weight <- rnorm(nb_gender, mean = 70, sd = 8) %>% round() # in kilos
Height <- rnorm(nb_gender, mean = 178, sd = 10) %>% round() # in cm
Age <- rnorm(nb_gender, mean = 40, sd = 7) %>% round()
data <- data.frame(Gender,Weight,Height,Age) # data with only men
Gender <- rep(c("Female"),nb_gender) # nb_gender women in total
Weight <- rnorm(nb_gender, 60, sd = 8) %>% round() # in kilos
Height <- rnorm(nb_gender, 167, sd = 10) %>% round() # in cm
Age <- rnorm(nb_gender, mean = 40, sd = 7) %>% round()
data <- rbind(data, data.frame(Gender,Weight,Height,Age)) # grouping women with men
data
You can use rownames() or colnames() to get or set the names of rows or columns: colnames(data).
colnames(data)[4] <- "Years" # Change last column name
head(data) # Show the impact
colnames(data)[4] <- "Age" # Coming back to original name
You can obtain the dimension of a matrix or data frame with the dim() function: dim(data). (Nb rows and nb columns). Each dimension can be obtained separately with nrow() and ncol() For vectors, the number of elements can be found with the length() function.
dim(data)
[1] 14 4
nrow(data) # Number of rows
[1] 14
ncol(data) # Number of columns
[1] 4
length(3:35) # Length of a vector
[1] 33
In R, it is usefulto perform tests. For instance, given the sequence 1:12, we want to know which values are strictly greater than 6. The simple command 1:12>6 will provide the answer: the statement is false for the first six elements (1 to 6) and true for the last six (7 to 12).
(1:12) > 6
[1] FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE TRUE TRUE
Accessing the values of a variable can be done with the square brackets [ ] thanks to indexing. For instance, the value in the third row and second column of data is data[3,2].
When columns have names, it is possible to use it to isolate a particular column with the dollar ( $ ) operator:
r
r data[data$Age > 42, ] # Mind the comma: we keep all columns! r data[data\(Age > 42 & data\)Weight >70, ] # the & operator allows to add sorting criteria
Another way to proceed is to omit to specify the row numbers: since Height is the third column of data, then the result is the same with data[,3]. This give you all of the third column. Likewise, data[3,] will return all of the third row.
data[,3] # Third column
[1] 174 171 192 173 182 170 188 159 181 166 169 171 167 181
data[3,] # Third row
You can extract data with boolean vectors! For instance, if we want to select the people who are older than 42 years old: simple!
data$Age>42
[1] TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE FALSE TRUE FALSE FALSE
will provide the corresponding indices. To extract the data, you just need to select the right rows and all columns:
r
r head(data, 8) # First 8 rows (default number of rows is 6) r tail(data) # Last 6 rows
Only the TRUE rows are kept. As we see, the filter() function of the tidyverse does just that.
Writing on data frames, vectors, or matrices can be done with the arrow operator:
data[3,2] <- 99
data[c(7,9),3] <- 166 # Replace 2 cells at a time! Seventh and ninth row on the third column.
data[c(6,8),3] <- c(199,176) # Same, but with 2 different values.
data # CHECK where the new values are!
Unlike in Excel, the data is not directly shown in R. You have to ask for it! To see the content of a variable, you have to type its name and press ENTER. Sometimes, the content of the variable is huge and cannot properly be displayed. You can also double-click on the variable name in the environment pane.
The head() function shows the first 6 lines and the tail() function shows the last 6 lines.
head(data, 8) # First 8 rows (default number of rows is 6)
tail(data) # Last 6 rows
If you want to see the different possible values of a factor (categorical variables), you can use the levels() function.
levels(diamonds$clarity)
[1] "I1" "SI2" "SI1" "VS2" "VS1" "VVS2" "VVS1" "IF"
Though honestly, you would get the information using the summary() function as well - because there aren’t too many of them.
There are usually several types of loops, but we will focus on the for loop. Its structure is simple: the idea is to repeat a task a finite number of times. This allows to automate the changes in a variable. For instance, the Fibonacci sequence:
nb <- 20 # Number of desired numbers
Fib <- 1 # First value
Fib[2] <- 1 # Second value
for(k in 3:nb){
Fib[k] <- Fib[k-1] + Fib[k-2] # New value equals the sum of the 2 previous ones
}
Fib # Show the sequence
[1] 1 1 2 3 5 8 13 21 34 55 89 144 233 377 610 987 1597 2584 4181
[20] 6765
GOING FURTHER: loops can take time. To speed things up, have a look at the map() family of functions. The apply functions are also cool (see below).
Below, we present a few useful functions.
r
r c(, , ) # Numbers viewed as text
[1] \3\ \8\ \7\
r
r c(, , ) %>% as.numeric() # Change the above into true numbers
[1] 3 8 7
r
r c(3,4,6) %>% as.character() # The opposite: change fields into characters
[1] \3\ \4\ \6\
r
r data$Age %>% as.factor() %>% summary()
29 34 36 37 39 40 42 44 45 46
1 1 1 1 1 2 3 1 1 2
r
r # as.factor() transforms the fields into catagories, the final step computes the number of elements in each catergory fac <- rep(c(1,2,3),2) %>% as.factor() # We create a simple factor and change its values below fac <- fac %>% recode_factor(1
= , 2
= , 3
= , .ordered = TRUE)
Usual modes for variables are:
- logical (Boolean, TRUE or FALE), NOTE: they are also converted into number: FALSE = 0, TRUE = 1
- numeric (numbers),
- character (text),
- factor (unordered category) and
- ordered factor (ordered category)
It is sometimes possible to switch from one to another. One counter-example is: translating a charater into a number. For a factor, the levels is the values that can be taken by the variable. See levels(). Some examples below.
c("3", "8", "7") # Numbers viewed as text
[1] "3" "8" "7"
c("3", "8", "7") %>% as.numeric() # Change the above into *true* numbers
[1] 3 8 7
c(3,4,6) %>% as.character() # The opposite: change fields into characters
[1] "3" "4" "6"
data$Age %>% as.factor() %>% summary()
21 27 30 33 34 37 38 42 43 46 49 52
1 1 1 1 1 1 1 2 1 2 1 1
# as.factor() transforms the fields into catagories, the final step computes the number of elements in each catergory
fac <- rep(c(1,2,3),2) %>% as.factor() # We create a simple factor and change its values below
fac <- fac %>% recode_factor(`1` = "Low", `2`= "Medium", `3`= "High", .ordered = TRUE)
fac
[1] Low Medium High Low Medium High
Levels: Low < Medium < High
GOING FURTHER:
- lists in R are very flexible structures that can embed different types of modes. - arrays are like matrices, but are allowed to have higher dimensions (>2). - tibbles are the dataframes 2.0 stemming from the tidyverse.
When given a rectangular dataset with numbers only, it is often useful to apply the same function to all rows or all columns. Normally, this would require a for loop. Fortunately, alternative solutions exist. For simple functions, like sums and means, dedicated functions have been created.
data_num <- select(data, -Gender) # Take out gender because it's not a number
colMeans(data_num) # Mean of all columns
Weight Height Age
69.14286 175.21429 38.57143
rowSums(data_num) # Sum of all rows: not very meaningful
[1] 289 279 324 306 284 315 250 276 254 266 274 281 276 287
For more general functions, apply() is the way to go. You need to specify the dimension across which the function is computed.
apply(data_num, 2, mean) # The syntax is: apply(data, dimension, function), dimension 2 = column
Weight Height Age
69.14286 175.21429 38.57143
apply(data_num, 1, sum) # Dimension 1 = row
[1] 289 279 324 306 284 315 250 276 254 266 274 281 276 287
apply(data_num, 2, sd) # Computing the standard deviation of each column
Weight Height Age
10.875924 10.116084 8.829272
Note that you can use apply() on arrays with more than two dimensions.
R let’s you create your own functions! Below, we create \(f(x)=(1+x^2)^{-1}\) and plot it.
dens = function(x){
return(1/(1+x^2))
}
dens(1:3) # Test values
[1] 0.5 0.2 0.1
ggplot(data.frame(x = c(-4,4)), aes(x = x)) + stat_function(fun = dens) # Plot
This is a common problem in data science. Here, we only aim to locate missing data - using the is.na() function. Dealing with absent values is out of our scope.
data[3,4] <- NA # NA is often the default for missing data points
is.na(data) # This is the brute-force method. Ok for small datasets
Gender Weight Height Age
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE TRUE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE
is.na(data) %>% rowSums # Locating rows
[1] 0 0 1 0 0 0 0 0 0 0 0 0 0 0
is.na(data) %>% colSums # Locating columns
Gender Weight Height Age
0 0 0 1
The is.na combined with row and column sums locates the missing data very simply.
The purpose of the questions below is to manipulate functions with integrated loops.
1) Create a function with one argument, n, that returns the values of j^j for j = 1…n. Test it on n = 10 and check! Anyway we could to this faster?