R is an open source programming language that is extremely versatile and rapidly becoming the top choice for data analysis tasks both in academia and in the industry. It is different from ‘point and click’ software (such as SPSS) since you need to write code to tell R what you want it to do. This means a steep learning curve but even on the short run will allow to produce code that is custom made to your specific problem and can be reproduced reliably by you or others.
Evidence 1:
Evidence 2:
Source: https://stackoverflow.blog/2017/10/10/impressive-growth-r/
Supplimentary resources: RStudio IDE Cheat Sheet
R comes with a pretty spartan GUI so we will work with the RStudio IDE (integrated development environment).
The workflow with RStudio consists of using:
Some tips for using RStudio:
Under Tools -> Global Options
you can change the following:
Code -> Editing -> Soft wrap R source files
If you check this, the lines in your script file do not run “out of the window”.Appearance
: You can select your colour scheme here. If you stare at the screen for a long time, white text against a darker background might be less hard on the eyes.Pane layout
: Here you can select how the window space in R studio is arranged. It might be useful to keep your source file on the one side and the console on the other side and not on top of each other. (see pic below)A few essential keyboard short-cuts (for Windows):
Using Projcets with RStudio will simplify your workflow. Essentially, all your project related files are collected in your selected folder so you don’t need to specify a working directory. Your project will be able to run as long as you copy the entire folder.
How to set one up: File -> New Project
then choose a directory where you want to have your R scripts, data and history files. You should also disable the “Restore most recently opened project at startup” and “Restore .RData ino workspace at startup”, and also set “Save workspace to .RData on exit” to Never in Tools -> Global Options -> General
For more help and materials on using projects, see RStudio’s own resource page or a well argued reasoning from Jenny Brian
Let’s talk about keeping your R and other projects safe from tornadoes, toddlers, toasters or T-rexes. Ideally, your work lives on your (1) hard drive AND a (2) back-up hard drive (preferably an SSD) AND a (3) cloud service (such as Dropbox, Google Drive, MS OneDrive).
I would recommend Dropbox as it seems to be the most robust out of these three. It also has file version history, so even if you accidentally delete something can get it back. A big plus for Dropbox and R projects is that they play nice with each other, as opposed to Google Drive and R projects which will annoy you into oblivion with error messages because of writing and reading conflicts.
General tips:
- Check the R coding style guide
- Comment your codes heavily (with the
#
) because now seemingly straightforward code will not be so in the future
- Use sensible file names (e.g.:
01_data_cleaning.R
)
- R is case sensitive, so use lowercase file and variable names. Separate words with underscore
_
(e.g.:ols_reg_1
) (or you can do the camelCase thing, but be consistent)
It is OK to get stuck in R and look for help. Don’t worry if don’t remember a function’s name or arguments by heart, as with everything the more you write R, the more you can recall from memory. Programming (R included) requires great Google search skills (or DuckDuckGo, if you are not keen on Google) but just like drawing, math or sword forging for the Japanese emperor it requires a great amount of practice and not some innate mystical ability that only 5% of the living population posess. My advice: find your pet projects, find joy in R, do not give up and use Google and StackOverflow without any hesitation.
Some effective ways to seek help with R related problems:
?function name
and you will be shown the function help. This is often not that informative.Since R is an open source project it is a common courtesy to cite R and the packages you use, as people (often in academia) put many hours into developing tools and it is in our common interest to give some public recognition to these efforts and contributions. To see how to cite R or you can just type the following:
citation()
#>
#> To cite R in publications use:
#>
#> R Core Team (2020). R: A language and environment for statistical
#> computing. R Foundation for Statistical Computing, Vienna, Austria.
#> URL https://www.R-project.org/.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Manual{,
#> title = {R: A Language and Environment for Statistical Computing},
#> author = {{R Core Team}},
#> organization = {R Foundation for Statistical Computing},
#> address = {Vienna, Austria},
#> year = {2020},
#> url = {https://www.R-project.org/},
#> }
#>
#> We have invested a lot of time and effort in creating R, please cite it
#> when using it for data analysis. See also 'citation("pkgname")' for
#> citing R packages.
You can cite a specific package with the following:
citation("quanteda")
#>
#> Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A
#> (2018). "quanteda: An R package for the quantitative analysis of
#> textual data." _Journal of Open Source Software_, *3*(30), 774. doi:
#> 10.21105/joss.00774 (URL: https://doi.org/10.21105/joss.00774), <URL:
#> https://quanteda.io>.
#>
#> A BibTeX entry for LaTeX users is
#>
#> @Article{,
#> title = {quanteda: An R package for the quantitative analysis of textual data},
#> journal = {Journal of Open Source Software},
#> author = {Kenneth Benoit and Kohei Watanabe and Haiyan Wang and Paul Nulty and Adam Obeng and Stefan Müller and Akitaka Matsuo},
#> doi = {10.21105/joss.00774},
#> url = {https://quanteda.io},
#> volume = {3},
#> number = {30},
#> pages = {774},
#> year = {2018},
#> }
Main packages used:
base R
Main functions covered:help
,c()
,typeof()
,lenght()
,sum()
,data.frame()
,matrix()
,list()
,[
,[[
Supplementary resources: Base R Cheat Sheet
To get started:
Ctrl + Shift + N
(or from the top menus).Ctrl + s
)!You can copy and paste code from this html document to your script and run it, but I recommend that you type everything as it allows for deeper learning experience. If you get stuck you can always check this document. Don’t forget to comment your code with #
(hashtag). Anything in the given line after #
will not be taken into consideration when the code runs.
We can make R carry out basic calculations with the usual symbols: + - / *
. You can run the current line (don’t need to select the code) with the Ctrl + Enter
shortcut.
In addition to carry out numerical operations, you can ask R to check if certain logical conditions are met, such as a value is greater or less or equal to another. It is essentially asking R the question of “is this value greater than that value?” to which we will receive an output of FALSE
or TRUE
.
5 > 4 # greater than
#> [1] TRUE
6 < 8
#> [1] TRUE
7 == 7 # equal with
#> [1] TRUE
10 >= 10 # greater or equal than
#> [1] TRUE
42 != 42 # not equal to
#> [1] FALSE
The conditions that you can use in R:
a == b
Are equala != b
Not equala > b
Greater thana < b
Less thana >= b
Greater than or equal toa <= b
Less than or equal to!x
Not xx | y
x OR yx & y
x AND yis.na(a)
Is missingis.null(a)
Is nullFunctions do the heavy lifting in R. They have the format below:
For example check the following code for computing a square root of 7 by hand and by using a built in sqrt()
function of R.
R comes with a variety of math functions if needed. Some examples are below. For the log()
example, you can see that the first computes the natural logarithm. If you have something else in mind, you can specify it with the base =
argument.
R let’s you save data by storing it in an object (it’s just a fancy name for stored data basically). You can do it with an assign operator: <-
(shortcut: Left Alt + -
). The =
sign also works but it is R coding practice to use <-
to assign values to objects and use =
within functions. Using the shortcut helps!
Let’s create two objects, where we store the results of some calculations.
Objects are essential part of the R workflow. You can see your current objects in the righ pane named ‘Environment’.
You can check (evaluate) your object by running it’s name. Writing the name of your object is equivalent to printing it to your console.
More importantly, we can perform all sorts of operations on our objects which will be the foundation of our workflow later on. This mean that we can have multiple datasets and objects containing all sorts of information (regression outputs, plots, etc.) in the memory.
A data frame is a rectangular data structure, where usually each row is an observation and each column is a variable. It can contain multiple types of data but columns can only contain one type. Data frames are made up of various columns that can contain various types of data. The below data frame called df
looks like this. Note the <chr>
, <dbl>
and <fctr>
tags below their names!
df
#> country pop continent
#> 1 Thailand 68.7 Asia
#> 2 Norway 5.2 Europe
#> 3 North Korea 24.0 Asia
#> 4 Canada 47.8 North America
#> 5 Slovenia 2.0 Europe
#> 6 France 63.6 Europe
#> 7 Venezuela 31.6 South America
To understand how each of these types works and how a data frame is constructed we will have to have a more in-depth look at each one. In R parlance, each column is a vector of a given data type.
You can also combine values into a vector. To do this, we use the c()
function. Below we will create numeric vectors with lenght of four. When you perform operations with vectors keep in mind that R matches the first element of the first vector to the first element of the second vector (called element-wise execution). This will result in a new vector with the same lenght as the originals. You can specify each element of the vector or give a range (e.g.: c(1:4)
)
c(5, 10, 15, 20)
#> [1] 5 10 15 20
# operations with vectors
c(1:4) + c(10,20,30,40)
#> [1] 11 22 33 44
QUICK EXCERCISE: check what happens if you try to do operations on numerical vectors of different size!
These vectors can have six types: doubles, integers, characters, logicals, complex, and raw. To check if we are indeed dealing with a vector, we can perform the is.type
question, as below. We can also check its lenght, just in case. If you are not sure about the type you can skip the trial and error with the typeof()
function. (we’ll skip complex and raw, as they are so niche that you can just check in case you ever need those)
If you want to refer to a specific value in a vector, you must use square brackets after the name of the object: [
and [[
. The brackets contain the sequence number of the value you want to refer to. Such indexing can also be used to replace values in objects. BEWARE that R happily overwrites your objects without any warning or double checks and there is no undo button! It is best to create new objects if you plan to further tinker with them.
Assigning a new value to the n-th element of our vector works with combining the assignment operator (<-
) and the [
indexing we just learned.
R functions use the name “double” and “numerics” interchangeably (and so will I during the course). (doubles comes from computer science and refers to the number of bytes it takes to store a number) Numerics can be positive, negative, have digits or not, they are regular numbers. If you insist on having an integer vector you can specify it by adding an L
after the numeric value. In most of the cases you will use numerics instead of integers and R defaults to numerics as well if you do not specify your needs.
For characters, you have to wrap the values between " " (or ’ ’) for R to recognize it as such.
# a vector with character (string) values, with a length of 3 and 1
text1 <- c("Han", "shot", "first")
text2 <- c("Hello world")
typeof(text1)
#> [1] "character"
length(text1)
#> [1] 3
length(text2)
#> [1] 1
QUICK EXCERCISE: create a character vector, which would give the following result.
solution:
#> [1] "42" "4" "2"
You can also combine vectors into one with the c()
function.
text3 <- c(text1, text2, "this is", "R")
text3
#> [1] "Han" "shot" "first" "Hello world" "this is"
#> [6] "R"
QUICK EXERCISE: combine our previous numerical vector into one. You should see the same result as below (num and a and b). What happens if you try to mix the two type of vectors (num and text1)?
#> [1] 5 10 15 20 96 6
You can store logical values in a vector as well. R assigns numerical values to them in some cases, where TRUE
is 1, and FALSE
is 0. See the below example.
logic <- c(TRUE, FALSE, FALSE)
typeof(logic)
#> [1] "logical"
# or store the result of a logical evaluation
test <- text2 == "Hello world"
test
#> [1] TRUE
# to count how many `TRUE` values we have, let's sum up the logic vector
sum(logic)
#> [1] 1
This latter function comes handy if we want to know for example, how many values are above or below a certain treshold in our vector. We are going to use the sum
function for this.
Another common data type in R is factor variable where you assign discrete levels to your values. It is commonly used in survey responses or for other categorical data (eye color, gender, political party preference, etc.). we can create a factor variable with the factor
function, where we can add the elements and specify the levels.
party_pref <- c("social democrat", "social conservative", "liberal", "green", "green", "social conservative")
# transform our character vector to factor
party_pref <- factor(party_pref, levels = c("social democrat", "social conservative", "liberal", "green"))
party_pref
#> [1] social democrat social conservative liberal
#> [4] green green social conservative
#> Levels: social democrat social conservative liberal green
# if we want to set a given order, we can do that too.
survey_response <- factor(c("agree", "neutral", "disagree", "neutral", "disagree", "disagree", "agree"),
levels = c("agree", "neutral", "disagree"),
ordered = TRUE)
survey_response
#> [1] agree neutral disagree neutral disagree disagree agree
#> Levels: agree < neutral < disagree
Missing values are denoted with NA
.
You can check if a value is missing with the is.na
function.
QUICK EXCERCISE: Check how many NAs we have in the object
v
, we just created. The correct solution should be the following output. (Hint: remember that logicals have numerical values!)
#> [1] 2
You can change the type of data inside a vector. This is fairly straightforward and not used regularly. Some examples include from integer to double:
integers <- c(1L, 5L, 10L)
typeof(integers)
#> [1] "integer"
# then converting
numerics <- as.numeric(integers)
typeof(numerics)
#> [1] "double"
Coercing functions starts with as.*
, where * marks the datatype. Start typing as.
in RStudio and see how many functions are suggested with this beggining.
As promised before, we can weave all the vectors into one data frame. To do this, we use the data.frame
function. First, we will create some vectors and then do the combination.
student <- c("Weber", "Hobbs", "Curie", "Lovelace", "Perlman")
grade <- factor(c("A", "C", "A", "B", "A"), levels = c("A", "B", "C"), ordered = TRUE)
height <- c(178, 165, 170, 190, 157)
Now combining the various vectors into one data frame, which we will call appropriately pupils
.
You can select individual rows and columns similarly as we did before with vectors. R uses the following logic: data_frame[rows, columns]
. While this approach works for rectangular data (such as data frames and matrices) you can also refer to column by their names. For this, use the $
sign. Remember: rows by columns is the order for indexing in R!
# check the second row
pupils[2, ]
#> student grade height
#> 2 Hobbs C 165
# check the first column
pupils[, 1]
#> [1] "Weber" "Hobbs" "Curie" "Lovelace" "Perlman"
Note that the data.frame()
function creates factors from our character vector. If you want to avoid this (which is usually the case) by an additional argument telling R not to do that: data.frame(country, pop, stringsAsFactors = FALSE)
Access columns by their name. After the $
sign, press tab and RStudio will give you a list of column in the data frame.
What just happened?
Kind reminder: R is case sensitive. This is annoying at first, but you get used to it fast (as it is a common source of errors).
You can check the attributes of your object with the attributes
function.
At this point we want more than what base R
can offer to us. Let’s install and load some packages! Packages are the cornerstone of the R ecosystem: there are thousands of super useful packages (the most common repository for them is CRAN). Whenever you face a specific problem (that can be highly domain specific) there is a good chance that there is at least one package that offers a solution.
An R package is a collection of functions that works much the same way as we saw earlier. These functions and packages are written by R users and shared with the community. The focus and range of these packages are wide: from data cleaning, to data visualization, through ecological and environmental data analysis there is a package for everyone and everything. This ample supply might be intimidating first but this also means that there is a solution out there to a given problem.
To install a package from the CRAN repository we will use the install.packages()
function. Note that it requres the package’s name as a character.
# data import / export
install.packages("readr")
install.packages("readtext")
install.packages("quanteda") # for text analysis
install.packages("dplyr") # for data manipulation
install.packages("ggplot2") # for data visualization
installed.packages("stringr") # for string manipulation
After you installed a given package we need to load it to be able to use its functions. We do this by the library()
command. It is good practice that you load all the packages at the beggining of your script.
Important note: whenever there is a conflicting function name (e.g:two packages have the same function name) you can specify what function you want to use with the package::function
syntax. Below, when loading in the data, I use readr::read_csv
to signify which package the function comes from.
We will look at the Quality of Government basic data set and import it with different file extensions. First let’s load the .csv file (stands for comma separated values). You can either load it from your project folder or directly from the GitHub repo. We are using the readr
package that can read comma separated values with the read_csv
function. It is a specific case of the read_delim
function, where you can specify the character that is used as a delimiter (e.g.: in Europe comma is used as a decimal, so the delimiter is often a semicolon.)
In the code below, I put the data file into the data folder within my project folder. (the path looks like this: mydrive:/folder/project_folder/data
). The "\data\file.csv
is called the relative path, as when using project we do not need to type out the whole path to the file, just its relative location to our main project folder.
With the readr::read_csv
I specified that I use the function from that specific package. The package::function
is useful if there are conflicting functions in the loaded packages or you want to make your package use explicit when functions have very similar names. In this case, base R
also have a read.csv
function, that is a bit slower than the one in readr
.
We use the readtext
package to import texts into R. The data is the first UN General Assembly speech by US presidents after their inauguration. The readtext()
function can read all text documents in a given folder with the *.txt
expression. It is a versatile package and can read texts from urls, zips, with strange encodings.
From this session we will mostly start using packages from the tidyverse
ecosystem. These include:
readr
for reading text data (.csv and .tsv)tidyr
for reshaping your datadplyr
for wrangling data (filtering your data, subsetting, transforming and recoding variables, etc.)purrr
for functional programmingggplot2
for data visualizationR markdown
for creating reports straight from R (.pdf, .html or .doc)dplyr
to wrangle dataMain packages used:
dplyr
Main functions covered:dplyr::filter()
,dplyr::select()
,dplyr::mutate()
,dplyr::*_join()
,is.na()
,tidyr::drop_na()
Supplementary resources:
The pipe is this operator: %>%
You can access with the shortcut of Ctrl+Shift+M
(or type it out every time, but there are better things in life than that).
What it does, is it passes object on left hand side as first argument (or . argument) of function on righthand side.
As an example:
Piping together various steps of our data manipulation process greatly increases code readability and our quality of life. In a moment we will see how piping can be super useful.
We will use the gapminder data for demonstrations.
For subsetting rows we use the filter()
function from dplyr
. For the argument we can give similar logical operators as before. If we we want to see data for countries in 1962 where life expectancy was above 70 yrs we can do it with the following code:
gapminder_df %>%
filter(year == 1962, lifeExp > 70)
#> # A tibble: 16 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Australia Oceania 1962 70.9 10794968 12217.
#> 2 Belgium Europe 1962 70.2 9218400 10991.
#> 3 Canada Americas 1962 71.3 18985849 13462.
#> 4 Denmark Europe 1962 72.4 4646899 13583.
#> 5 France Europe 1962 70.5 47124000 10560.
#> 6 Germany Europe 1962 70.3 73739117 12902.
#> 7 Iceland Europe 1962 73.7 182053 10350.
#> 8 Ireland Europe 1962 70.3 2830000 6632.
#> 9 Netherlands Europe 1962 73.2 11805689 12791.
#> 10 New Zealand Oceania 1962 71.2 2488550 13176.
#> 11 Norway Europe 1962 73.5 3638919 13450.
#> 12 Slovak Republic Europe 1962 70.3 4237384 7481.
#> 13 Sweden Europe 1962 73.4 7561588 12329.
#> 14 Switzerland Europe 1962 71.3 5666000 20431.
#> 15 United Kingdom Europe 1962 70.8 53292000 12477.
#> 16 United States Americas 1962 70.2 186538000 16173.
You can filter based on logical operators and string matching as well. Here we want to see data for sweden after 1990.
#> # A tibble: 4 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Sweden Europe 1992 78.2 8718867 23880.
#> 2 Sweden Europe 1997 79.4 8897619 25267.
#> 3 Sweden Europe 2002 80.0 8954175 29342.
#> 4 Sweden Europe 2007 80.9 9031088 33860.
We could also use the x %in% y
expression which will filter every row where x matches one of the values of y. With this we can filter for two countries in our data.
gapminder_df %>%
filter(country %in% c("Sweden", "Norway"))
#> # A tibble: 24 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Norway Europe 1952 72.7 3327728 10095.
#> 2 Norway Europe 1957 73.4 3491938 11654.
#> 3 Norway Europe 1962 73.5 3638919 13450.
#> 4 Norway Europe 1967 74.1 3786019 16362.
#> 5 Norway Europe 1972 74.3 3933004 18965.
#> 6 Norway Europe 1977 75.4 4043205 23311.
#> 7 Norway Europe 1982 76.0 4114787 26299.
#> 8 Norway Europe 1987 75.9 4186147 31541.
#> 9 Norway Europe 1992 77.3 4286357 33966.
#> 10 Norway Europe 1997 78.3 4405672 41283.
#> # ... with 14 more rows
Filtering on a range can be done with two logical requirement or the dplyr::between()
argument.
#> # A tibble: 16 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Benin Africa 1957 40.4 1925173 960.
#> 2 Bolivia Americas 1952 40.4 2883315 2677.
#> 3 Cambodia Asia 1972 40.3 7450606 422.
#> 4 Cameroon Africa 1957 40.4 5359923 1313.
#> 5 Cote d'Ivoire Africa 1952 40.5 2977019 1389.
#> 6 Eritrea Africa 1962 40.2 1666618 381.
#> 7 Ethiopia Africa 1962 40.1 25145372 419.
#> 8 Gabon Africa 1962 40.5 455661 6631.
#> 9 India Asia 1957 40.2 409000000 590.
#> 10 Mozambique Africa 1972 40.3 9809596 725.
#> 11 Niger Africa 1967 40.1 4534062 1054.
#> 12 Oman Asia 1957 40.1 561977 2243.
#> 13 Rwanda Africa 1952 40 2534927 493.
#> 14 Sierra Leone Africa 1987 40.0 3868905 1294.
#> 15 Vietnam Asia 1952 40.4 26246839 605.
#> 16 Zambia Africa 1997 40.2 9417789 1071.
They give the same results.
gapminder_df %>%
filter(between(lifeExp, 40, 40.5))
#> # A tibble: 16 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Benin Africa 1957 40.4 1925173 960.
#> 2 Bolivia Americas 1952 40.4 2883315 2677.
#> 3 Cambodia Asia 1972 40.3 7450606 422.
#> 4 Cameroon Africa 1957 40.4 5359923 1313.
#> 5 Cote d'Ivoire Africa 1952 40.5 2977019 1389.
#> 6 Eritrea Africa 1962 40.2 1666618 381.
#> 7 Ethiopia Africa 1962 40.1 25145372 419.
#> 8 Gabon Africa 1962 40.5 455661 6631.
#> 9 India Asia 1957 40.2 409000000 590.
#> 10 Mozambique Africa 1972 40.3 9809596 725.
#> 11 Niger Africa 1967 40.1 4534062 1054.
#> 12 Oman Asia 1957 40.1 561977 2243.
#> 13 Rwanda Africa 1952 40 2534927 493.
#> 14 Sierra Leone Africa 1987 40.0 3868905 1294.
#> 15 Vietnam Asia 1952 40.4 26246839 605.
#> 16 Zambia Africa 1997 40.2 9417789 1071.
We should try out more logical operators to filter. If you are just interested in the top results, you can select rows by their position with the dplyr::slice()
function.
slice(gapminder_df, 1:8) # select the first 8 rows
#> # A tibble: 8 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
Some exaples of certain logical operators. To see the role of each line, check the comments in the code snippet below.
#> # A tibble: 27 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Botswana Africa 1997 52.6 1536536 8647.
#> 2 Botswana Africa 2002 46.6 1630347 11004.
#> 3 Botswana Africa 2007 50.7 1639131 12570.
#> 4 Equatorial Guinea Africa 2007 51.6 551201 12154.
#> 5 Gabon Africa 1967 44.6 489004 8359.
#> 6 Gabon Africa 1972 48.7 537977 11402.
#> 7 Gabon Africa 1977 52.8 706367 21746.
#> 8 Gabon Africa 1982 56.6 753874 15113.
#> 9 Gabon Africa 1987 60.2 880397 11864.
#> 10 Gabon Africa 1992 61.4 985739 13522.
#> # ... with 17 more rows
gapminder_df %>%
filter(!continent %in% c("Africa", "Europe") ) # everything but Africa and Europe (!%in% won't work)
#> # A tibble: 720 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Afghanistan Asia 1952 28.8 8425333 779.
#> 2 Afghanistan Asia 1957 30.3 9240934 821.
#> 3 Afghanistan Asia 1962 32.0 10267083 853.
#> 4 Afghanistan Asia 1967 34.0 11537966 836.
#> 5 Afghanistan Asia 1972 36.1 13079460 740.
#> 6 Afghanistan Asia 1977 38.4 14880372 786.
#> 7 Afghanistan Asia 1982 39.9 12881816 978.
#> 8 Afghanistan Asia 1987 40.8 13867957 852.
#> 9 Afghanistan Asia 1992 41.7 16317921 649.
#> 10 Afghanistan Asia 1997 41.8 22227415 635.
#> # ... with 710 more rows
gapminder_df %>%
filter(year > 1990, !lifeExp < 80) # we filter for the years after 1990 where the lifeExp < 80 condition is FALSE
#> # A tibble: 22 x 6
#> country continent year lifeExp pop gdpPercap
#> <fct> <fct> <int> <dbl> <int> <dbl>
#> 1 Australia Oceania 2002 80.4 19546792 30688.
#> 2 Australia Oceania 2007 81.2 20434176 34435.
#> 3 Canada Americas 2007 80.7 33390141 36319.
#> 4 France Europe 2007 80.7 61083916 30470.
#> 5 Hong Kong, China Asia 1997 80 6495918 28378.
#> 6 Hong Kong, China Asia 2002 81.5 6762476 30209.
#> 7 Hong Kong, China Asia 2007 82.2 6980412 39725.
#> 8 Iceland Europe 2002 80.5 288030 31163.
#> 9 Iceland Europe 2007 81.8 301931 36181.
#> 10 Israel Asia 2007 80.7 6426679 25523.
#> # ... with 12 more rows
For selection of columns (variables) we will use the, dplyr::select()
function. The logic is the same as for filtering rows.
gapminder_df %>%
select(continent)
#> # A tibble: 1,704 x 1
#> continent
#> <fct>
#> 1 Asia
#> 2 Asia
#> 3 Asia
#> 4 Asia
#> 5 Asia
#> 6 Asia
#> 7 Asia
#> 8 Asia
#> 9 Asia
#> 10 Asia
#> # ... with 1,694 more rows
you can select multiple columns easily by their name
gapminder_df %>%
select(continent, year)
#> # A tibble: 1,704 x 2
#> continent year
#> <fct> <int>
#> 1 Asia 1952
#> 2 Asia 1957
#> 3 Asia 1962
#> 4 Asia 1967
#> 5 Asia 1972
#> 6 Asia 1977
#> 7 Asia 1982
#> 8 Asia 1987
#> 9 Asia 1992
#> 10 Asia 1997
#> # ... with 1,694 more rows
or give a range
gapminder_df %>%
select(country:year)
#> # A tibble: 1,704 x 3
#> country continent year
#> <fct> <fct> <int>
#> 1 Afghanistan Asia 1952
#> 2 Afghanistan Asia 1957
#> 3 Afghanistan Asia 1962
#> 4 Afghanistan Asia 1967
#> 5 Afghanistan Asia 1972
#> 6 Afghanistan Asia 1977
#> 7 Afghanistan Asia 1982
#> 8 Afghanistan Asia 1987
#> 9 Afghanistan Asia 1992
#> 10 Afghanistan Asia 1997
#> # ... with 1,694 more rows
The select function works if you have a very large dataset and want to access columns by their location rather than their name. Let’s say we want the first two and the fifth variable.
gapminder_df %>%
select(1:2, 5)
#> # A tibble: 1,704 x 3
#> country continent pop
#> <fct> <fct> <int>
#> 1 Afghanistan Asia 8425333
#> 2 Afghanistan Asia 9240934
#> 3 Afghanistan Asia 10267083
#> 4 Afghanistan Asia 11537966
#> 5 Afghanistan Asia 13079460
#> 6 Afghanistan Asia 14880372
#> 7 Afghanistan Asia 12881816
#> 8 Afghanistan Asia 13867957
#> 9 Afghanistan Asia 16317921
#> 10 Afghanistan Asia 22227415
#> # ... with 1,694 more rows
You can have remove columns with select(data, -column)
. This code removs columns between year and gdp per capita.
gapminder_df %>%
select(-(year:gdpPercap))
#> # A tibble: 1,704 x 2
#> country continent
#> <fct> <fct>
#> 1 Afghanistan Asia
#> 2 Afghanistan Asia
#> 3 Afghanistan Asia
#> 4 Afghanistan Asia
#> 5 Afghanistan Asia
#> 6 Afghanistan Asia
#> 7 Afghanistan Asia
#> 8 Afghanistan Asia
#> 9 Afghanistan Asia
#> 10 Afghanistan Asia
#> # ... with 1,694 more rows
There are various helper functions that you can embed within select
:
starts_with("xyz")
: selects column where the name matches the specified "xyz"
string.ends_with("jfk")
: matches the string (“jfk” in this case) with the end of the column namecontains("klm")
: matches names that contain “klm”num_range("x", 1:3)
: matches x1, x2, x3The select()
function also lets us do some other data manipulation tasks as well. You can use it to reorder and rename your variables. The order you specify the columns in the select()
function will be the new order. You can also set the name with select(newname = oldname)
, altough it that case it will drop all other columns not specified. To avoid this, you can be explicit about renaming with the dplyr::rename()
function.
# reorder our variables and rename them.
gapminder_df %>%
select(country, continent, year, gdpPercap, lifeExp, -pop) %>% # we reorder the columns and drop the pop column
rename(gdp_percap = gdpPercap, life_exp = lifeExp)
#> # A tibble: 1,704 x 5
#> country continent year gdp_percap life_exp
#> <fct> <fct> <int> <dbl> <dbl>
#> 1 Afghanistan Asia 1952 779. 28.8
#> 2 Afghanistan Asia 1957 821. 30.3
#> 3 Afghanistan Asia 1962 853. 32.0
#> 4 Afghanistan Asia 1967 836. 34.0
#> 5 Afghanistan Asia 1972 740. 36.1
#> 6 Afghanistan Asia 1977 786. 38.4
#> 7 Afghanistan Asia 1982 978. 39.9
#> 8 Afghanistan Asia 1987 852. 40.8
#> 9 Afghanistan Asia 1992 649. 41.7
#> 10 Afghanistan Asia 1997 635. 41.8
#> # ... with 1,694 more rows
If you want you can store the column names in a character vector and plug that in to the function.
vars <- c("lifeExp", "pop", "gdpPercap") # columns we want selected
gapminder_df %>%
select(vars)
#> # A tibble: 1,704 x 3
#> lifeExp pop gdpPercap
#> <dbl> <int> <dbl>
#> 1 28.8 8425333 779.
#> 2 30.3 9240934 821.
#> 3 32.0 10267083 853.
#> 4 34.0 11537966 836.
#> 5 36.1 13079460 740.
#> 6 38.4 14880372 786.
#> 7 39.9 12881816 978.
#> 8 40.8 13867957 852.
#> 9 41.7 16317921 649.
#> 10 41.8 22227415 635.
#> # ... with 1,694 more rows
We can also re-order our cases by a given column, either in descending or ascending order. The dplyr::arrange()
function will re-order in ascending order by default.
# lets pipe together a select and arrange function
gapminder_df %>%
select(lifeExp) %>%
arrange(lifeExp)
#> # A tibble: 1,704 x 1
#> lifeExp
#> <dbl>
#> 1 23.6
#> 2 28.8
#> 3 30
#> 4 30.0
#> 5 30.3
#> 6 30.3
#> 7 31.2
#> 8 31.3
#> 9 31.6
#> 10 32.0
#> # ... with 1,694 more rows
You can use dplyr::desc()
within arrange()
to order the values in descending order.
We can also combine select
and filter
for filtering for all of the selected variables. To do this, we use the filter_all
function and the all_vars()
within it.
dplyr
makes it easy to recode our columns and create new ones with the dplyr::mutate()
and dplyr::transmute()
functions. mutate()
let’s you do all the stuff that we covered when we looked at vectors. You have the option to have the calculation results in a new column (preferable) or overwrite an existing one (probably not the best idea).
Let’s recode the pop variable to show population by a thousand using the mutate
function. We will call our new variable pop_k.
gapminder_df %>%
select(country, year, pop) %>%
mutate(pop_k = pop/1000) # creating the new column, pop_k
#> # A tibble: 1,704 x 4
#> country year pop pop_k
#> <fct> <int> <int> <dbl>
#> 1 Afghanistan 1952 8425333 8425.
#> 2 Afghanistan 1957 9240934 9241.
#> 3 Afghanistan 1962 10267083 10267.
#> 4 Afghanistan 1967 11537966 11538.
#> 5 Afghanistan 1972 13079460 13079.
#> 6 Afghanistan 1977 14880372 14880.
#> 7 Afghanistan 1982 12881816 12882.
#> 8 Afghanistan 1987 13867957 13868.
#> 9 Afghanistan 1992 16317921 16318.
#> 10 Afghanistan 1997 22227415 22227.
#> # ... with 1,694 more rows
We can carry out operations with our existing columns as well. Let’s calculate the GDP from the GDP per capita and population data.
#> Warning: Problem with `mutate()` input `gdp`.
#> i longer object length is not a multiple of shorter object length
#> i Input `gdp` is `gdpPercap * pop`.
#> Warning in gdpPercap * pop: longer object length is not a multiple of shorter
#> object length
#> # A tibble: 1,704 x 4
#> country year gdpPercap gdp
#> <fct> <int> <dbl> <dbl>
#> 1 Afghanistan 1952 779. 53548.
#> 2 Afghanistan 1957 821. 4268.
#> 3 Afghanistan 1962 853. 20474.
#> 4 Afghanistan 1967 836. 39970.
#> 5 Afghanistan 1972 740. 1480.
#> 6 Afghanistan 1977 786. 49997.
#> 7 Afghanistan 1982 978. 30905.
#> 8 Afghanistan 1987 852. 58560.
#> 9 Afghanistan 1992 649. 3377.
#> 10 Afghanistan 1997 635. 15248.
#> # ... with 1,694 more rows
What is the problem? We should be careful about the order we pipe together various functions.
gapminder_df %>%
mutate(gdp_mil = ((gdpPercap * pop)/10^6)) %>% # multiply the two columns and then divide by a million
select(country, year, gdp_mil)
#> # A tibble: 1,704 x 3
#> country year gdp_mil
#> <fct> <int> <dbl>
#> 1 Afghanistan 1952 6567.
#> 2 Afghanistan 1957 7585.
#> 3 Afghanistan 1962 8759.
#> 4 Afghanistan 1967 9648.
#> 5 Afghanistan 1972 9679.
#> 6 Afghanistan 1977 11698.
#> 7 Afghanistan 1982 12599.
#> 8 Afghanistan 1987 11821.
#> 9 Afghanistan 1992 10596.
#> 10 Afghanistan 1997 14122.
#> # ... with 1,694 more rows
Using data visualization is a great way to get acquinted with your data and sometimes it makes more sense than looking at large tables. In this section we get into the ggplot2
package which we’ll use throughout the class. It is the cutting edge of R’s data visualization toolset (not just in academia, but in business and data journalism as well).
ggplot2
We will spend most of our time using ggplot2
for visualizing in the class and I would personally encourage the course participants to stick to ggplot2
. If for some reason you would like a non ggplot way of plotting in R, there is a section on base R plotting at the end of this notebook.
The name stands for grammar of graphics and it enables you to build your plot layer by layer and having the ability to control every detail of the output (if you so wish). It is used by many in academia, by Uber, StackOverflow, AirBnB, the Financial Times, BBC and FiveThirtyEight writers, among many others.
You create plots with the below syntax:
Source: Kieran, Healy. Data Visualisation: A Practical Introduction. PRINCETON University Press, 2018. (Ch.3)
To have some idea about our variables, lets plot them on a histogram. First, we examine the GDP per capita variable from our gapminder dataset. To this, we just use the geom_histogram()
function of ggplot2
. It gives a bare-bones histogram of the (frequency distribution of our choosen continous variable) of the choosen variable.
Let’s create the foundation of our plot by specifying for ggplot
the data we use and the variable we want to plot.
We need to specify what sort of shape we want our data to be displayed. We can do this by adding the geom_histogram()
function with a +
Looks a little bit skewed. Let’s log transform our variable with the scale_x_log10()
function.
ggplot(data = gapminder_df,
mapping = aes(x = gdpPercap)) +
geom_histogram() +
scale_x_log10()
#> `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
As the message says, we can mess around with the binwidth argument, so let’s do that.
ggplot(data = gapminder_df,
mapping = aes(x = gdpPercap)) +
geom_histogram(binwidth = 0.05) +
scale_x_log10()
Of course if one prefers a boxplot, that is possible as well. We will check how life expectancy varies between and within continents. We’ll use geom_boxplot()
. In this approach, we create an object for our plot. You don’t need to do this (there are instances where it is useful), but this shows you that just about everything can be an object in R
p_box <- ggplot(data = gapminder_df,
mapping = aes(x = continent,
y = lifeExp)) +
geom_boxplot()
p_box
Interpretation of the box plot is that the following. The box contains 50% of the values, the whiskers are the minimum and maximum values without the outliers, the line inside the box is the median. The upper and lower edges of the box are the first and third quartiles, respectively.
In visual form:
Let’s use the gapminder dataset we have loaded and investigate the life expectancy and gdp per capita variables. We’ll use the geom_point()
argument.
Let’s refine this plot slightly: add labels, title, caption, and also transform the GDP variable. (plus some other minor cosmetics)
Check the comments in the code snippet to see what each line does!
ggplot(data = gapminder_df,
mapping = aes(x = gdpPercap,
y = lifeExp)) +
geom_point(alpha = 0.25) + # inside the geom_ we can modify its attributes. Here we set the transparency levels of the points
scale_x_log10() + # rescale our x axis
labs(x = "GDP per capita",
y = "Life expectancy",
title = "Connection between GDP and Life expectancy",
subtitle = "Points are country-years",
caption = "Source: Gapminder")
So far so good. With some minor additions the plot looks all right. But what if we want to see how each continent fares in this relationship? We need to change the p1
object to include a new argument in the mapping function: color = variable
. Now it is clear that European countries (country-years) are clustered in the high-GDP/high life longevity upper right corner.
ggplot(data = gapminder_df,
mapping = aes(x = gdpPercap,
y = lifeExp,
color = continent)) + # this is where we specify that we want to color the data by continents.
geom_point(alpha = 0.75) +
scale_x_log10() + # rescale our x axis
labs(x = "GDP per capita (log $)",
y = "Life expectancy",
title = "Connection between GDP and Life expectancy",
subtitle = "Points are country-years",
caption = "Source: Gapminder dataset")
When we are done with our nice figure, we can save it as well. I’d suggest to always save with code, and never from the “plots” pane on the right.
ggsave("gapminder_scatter.png", dpi = 600) # the higher the dpi, the smoother your plot'll look like.
We can see how life expectancy changed in Mexico, Afghanistan, Sudan and Slovenia by using the geom_line()
geom. For this, we create a new dataset by subsetting the gapminder one. The %in%
operator does the same thing as the ==
but for multiple values. For subsetting we use the dplyr::filter()
function. Don’t worry if this sounds too much, we will spend a whole session on how to subset and clean our data.
#subset the dataset to have our selected countries.
comp_df <- gapminder_df %>%
filter(country %in% c("Mexico", "Afghanistan", "Sudan", "Slovenia"))
# create the ggplot object with the data and mapping info
ggplot(data = comp_df,
mapping = aes(x = year,
y = lifeExp,
color = country)) +
geom_line(aes(group = country)) # we need to tell ggplot that we want to group our lines by countries
ggplot2
makes it easy to create individual subplots for each category by “faceting” our data. Let’s plot the growth in life expectancy over time on each continent. We use the geom_line()
function to draw a line and we tell ggplot to facet by adding the facet_wrap(~ variable)
function.
ggplot(data = gapminder_df,
mapping = aes(x = year,
y = lifeExp)) +
geom_line(aes(group = country)) + # we need to tell ggplot that we want to group our lines by countries
facet_wrap(~ continent) # create a small graph for each continent