class: center, middle, inverse, title-slide # Tiny introduction to the Tidyverse ##
R-ladies BCN ### Sarah Bonnin ### 2019-11-14 --- ## What is the Tidyverse ? -- * A set of packages designed for **data science**: * Preparing / cleaning * Wrangling * Visualizing -- * All packages share **good practices** in terms of: * philosophy * grammar * data structure. --- ## Why you might want to learn it? * More **intuitive programming** -- * Code **easier to read** than with R base -- * More **efficient** -- **Tidyverse** ```r diamonds %>% select(cut, color, carat, price) %>% filter(cut == "Ideal") %>% arrange(desc(price)) ``` -- **R base** ```r diamonds2 <- diamonds[diamonds$cut == "Ideal", c("cut", "color", "carat", "price")] diamonds2[order(diamonds2$price, decreasing = TRUE),] ``` --- class: inverse, center, middle ## Disclaimer -- Old R user... -- But rather new Tidyverse user ! -- * First tutorial on the Tidyverse ... .center[ ![](https://media.giphy.com/media/26ufq8k6RuyKjmdTW/giphy.gif) ] --- ## Tidyverse core packages As of Tidyverse 1.2.0, the following 8 packages are included in the core tidyverse: -- * Data Wrangling and Transformation dplyr tidyr stringr forcats -- * Data Import and Management tibble readr -- * Functional Programming purrr -- * Data Visualization and Exploration ggplot2 --- class: inverse ## Tidyverse core packages ### Data Wrangling and Transformation <table cellspacing="0" cellpadding="0" style="width:100%"> <tr> <td><p style="width: 50px;"><img src="https://dplyr.tidyverse.org/logo.png"></p></td> <td><b>dplyr</b>: Package for data manipulation and exploratory data analysis.</td> </tr> </table> -- <table cellspacing="0" cellpadding="0" style="width:100%"> <tr> <td><p style="width: 50px;"><img src="https://tidyr.tidyverse.org/logo.png"></p></td> <td></b>tidyr</b>: Package that aims at creating tidy data. Tidy data describe a standard way of storing data.</td> </tr> </table> -- <table cellspacing="0" cellpadding="0" style="width:100%"> <tr> <td><p style="width: 50px;"><img src="https://stringr.tidyverse.org/logo.png"></p></td> <td><b>stringr</b>: Package that provides a set of functions for user-friendly string manipulation.</td> </tr> </table> -- <table cellspacing="0" cellpadding="0" style="width:100%"> <tr> <td><p style="width: 50px;"><img src="https://forcats.tidyverse.org/logo.png"></p></td> <td><b>forcats</b>: Package that helps you deal with factors.</td> </tr> </table> --- class: inverse ## Tidyverse core packages ### Data Import and Management <table cellspacing="0" cellpadding="0" style="width:100%"> <tr> <td><p style="width: 50px;"><img src="https://readr.tidyverse.org/logo.png"></p></td> <td><b>readr</b>: Package for fast and efficient import and export of data. </td> </tr> </table> -- <table cellspacing="0"; cellpadding="0"; style="width:100%"> <tr> <td><p style="width: 50px;"><img src="https://ih1.redbubble.net/image.543363717.2207/flat,750x,075,f-pad,750x1000,f8f8f8.jpg"></p></td> <td><b>tibble</b>: Tibbles are improved - easier to manage - data frames.</td> </tr> </table> --- class: inverse ## Tidyverse core packages ### Functional Programming <table cellspacing="0" cellpadding="0" style="width:100%"> <tr> <td><p style="width: 50px;"><img src="https://purrr.tidyverse.org/logo.png"></p></td> <td><b>purrr</b>: Package that aims at enhancing R's functional programming toolkit. It provides a set of tools for working with functions and vectors.</td> </tr> </table> --- class: inverse ## Tidyverse core packages ### Data visualization and exploration <table cellspacing="0" cellpadding="0" style="width:100%"> <tr> <td><p style="width: 50px;"><img src="https://ggplot2.tidyverse.org/logo.png"></p></td> <td><b>ggplot2</b>: Package for data vizualization of graphics based on Leland Wilkinson's' <b>G</b>rammar of <b>G</b>raphics: graphics are built one layer at a time.</td> </tr> </table> --- ## Outline In the teeny tiny workshop, we will focus on: * dplyr (mainly) * tidyr * tibble -- And we will see a tiny bit of: * stringr * ggplot2 --- ## Load all tidyverse package ```r library(tidyverse) ``` ``` ## ── Attaching packages ─────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ── ``` ``` ## ✔ ggplot2 3.2.1 ✔ purrr 0.3.3 ## ✔ tibble 2.1.3 ✔ dplyr 0.8.3 ## ✔ tidyr 1.0.0 ✔ stringr 1.4.0 ## ✔ readr 1.3.1 ✔ forcats 0.4.0 ``` ``` ## ── Conflicts ────────────────────────────────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ── ## ✖ dplyr::filter() masks stats::filter() ## ✖ dplyr::lag() masks stats::lag() ``` --- ## tibble What are **tibbles**? * Modern re-thinking of data frames. * Leave behing old user-unfriendly features of data frames. -- Let's create a simple tibble with **tibble()**: ```r mytibble <- tibble( letters = LETTERS, numbers = 1:26 ) ``` --- ## tibble .pull-left[ ``` ## # A tibble: 26 x 2 ## letters numbers ## <chr> <int> ## 1 A 1 ## 2 B 2 ## 3 C 3 ## 4 D 4 ## 5 E 5 ## 6 F 6 ## 7 G 7 ## 8 H 8 ## 9 I 9 ## 10 J 10 ## # … with 16 more rows ``` ] -- .pull-right[ Why do we like tibbles? - **Dimensions** shown. - Information about **data types**. - **No character to factor conversion**. - No automatic change of column names. - Only the first rows are displayed. ] --- ## tibble Print the first 15 rows: ```r *print(mytibble, n=15) ``` ``` ## # A tibble: 26 x 2 ## letters numbers ## <chr> <int> ## 1 A 1 ## 2 B 2 ## 3 C 3 ## 4 D 4 ## 5 E 5 ## 6 F 6 ## 7 G 7 ## 8 H 8 ## 9 I 9 ## 10 J 10 ## 11 K 11 ## 12 L 12 ## 13 M 13 ## 14 N 14 ## 15 O 15 ## # … with 11 more rows ``` --- ## tibble Print all rows: ```r *print(mytibble, n=Inf) ``` ``` ## # A tibble: 26 x 2 ## letters numbers ## <chr> <int> ## 1 A 1 ## 2 B 2 ## 3 C 3 ## 4 D 4 ## 5 E 5 ## 6 F 6 ## 7 G 7 ## 8 H 8 ## 9 I 9 ## 10 J 10 ## 11 K 11 ## 12 L 12 ## 13 M 13 ## 14 N 14 ## 15 O 15 ## 16 P 16 ## 17 Q 17 ## 18 R 18 ## 19 S 19 ## 20 T 20 ## 21 U 21 ## 22 V 22 ## 23 W 23 ## 24 X 24 ## 25 Y 25 ## 26 Z 26 ``` --- ## tidyr ### Tidy data The goal of **{tidyr}** is to help you create **tidy data**. <br> Tidy data is data where: -- - Each **column** describes a **variable**. -- - Each **row** describes an **observation**. -- - Each **value** is a **cell**. --- ## tidyr ### separate & unite * **separate()**: separate a column into 2 (or more) * separate(tibble, col, into, sep) -- * **unite()**: does just the opposite! * unite(tibble, col, *column names*, sep) -- Let's practice on the **table5** data set: ```r table5 ``` ``` ## # A tibble: 6 x 4 ## country century year rate ## * <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745/19987071 ## 2 Afghanistan 20 00 2666/20595360 ## 3 Brazil 19 99 37737/172006362 ## 4 Brazil 20 00 80488/174504898 ## 5 China 19 99 212258/1272915272 ## 6 China 20 00 213766/1280428583 ``` --- ## tidyr ### separate & unite * **Separate** column **rate** into 2: * **cases** and **population** * **Unite** columns **century** and **year** into 1: * **year**: **1999** instead of **19** and **99** --- ## tidyr ### separate & unite ```r # separate column "rate" table5a <- separate(table5, col=rate, into=c("cases", "population"), sep="/" ) table5a ``` ``` ## # A tibble: 6 x 5 ## country century year cases population ## <chr> <chr> <chr> <chr> <chr> ## 1 Afghanistan 19 99 745 19987071 ## 2 Afghanistan 20 00 2666 20595360 ## 3 Brazil 19 99 37737 172006362 ## 4 Brazil 20 00 80488 174504898 ## 5 China 19 99 212258 1272915272 ## 6 China 20 00 213766 1280428583 ``` --- ## tidyr ### separate & unite ```r # unite columns "century" and "year" table5b <- unite(table5a, col=year, c("century", "year"), sep="") table5b ``` ``` ## # A tibble: 6 x 4 ## country year cases population ## <chr> <chr> <chr> <chr> ## 1 Afghanistan 1999 745 19987071 ## 2 Afghanistan 2000 2666 20595360 ## 3 Brazil 1999 37737 172006362 ## 4 Brazil 2000 80488 174504898 ## 5 China 1999 212258 1272915272 ## 6 China 2000 213766 1280428583 ``` --- ## tidyr ### Tidy data Practice a bit more: let's create a toy **untidy tibble**: ```r patients <- tibble( names = c("A", "B", "C", "D"), age = c( 21, 32, 25, 43), c("188cm/93kg", "167cm/55kg", "155cm/51kg", "175cm/72kg") ) ``` --- ## tidyr ### Tidy data ``` ## # A tibble: 4 x 3 ## names age `c("188cm/93kg", "167cm/55kg", "155cm/51kg", "175cm/72kg")` ## <chr> <dbl> <chr> ## 1 A 21 188cm/93kg ## 2 B 32 167cm/55kg ## 3 C 25 155cm/51kg ## 4 D 43 175cm/72kg ``` What is wrong here ? -- **2 variables in the third column !** -- * Split the column with **separate()**: -- ```r # data: tibble/data frame # col: column to separate # sep: character to use to split the column # into: names of the columns that are created after separation patients <- separate(data=patients, col=3, sep="/", into=c("height", "weight")) ``` --- ## tidyr ### Tidy data Anything else wrong with **patients** now ? ``` ## # A tibble: 4 x 4 ## names age height weight ## <chr> <dbl> <chr> <chr> ## 1 A 21 188cm 93kg ## 2 B 32 167cm 55kg ## 3 C 25 155cm 51kg ## 4 D 43 175cm 72kg ``` -- **Extra characters in the height and weight columns!** -- * Remove "cm" and "kg" ! -- * Here we introduce the **str_remove()** function from the **{stringr}** package ```r patients$height <- str_remove(patients$height, "cm") patients$weight <- str_remove(patients$weight, "kg") ``` --- ## tidyr ### Tidy data Is there still a problem ? ``` ## # A tibble: 4 x 4 ## names age height weight ## <chr> <dbl> <chr> <chr> ## 1 A 21 188 93 ## 2 B 32 167 55 ## 3 C 25 155 51 ## 4 D 43 175 72 ``` -- **Columns **height** and **weight** are treated as characters !** -- * We need to convert them to numeric. -- * Here we introduce **mutate_at()** from the **{dplyr}** package: ```r # first argument: the tibble # second argument: a vector of column names to mutate # third argument: how to mutate those columns patients <- mutate_at(patients, c("height", "weight"), as.numeric) ``` --- ## tidyr ### Tidy data **patients** is now tidy ! ``` ## # A tibble: 4 x 4 ## names age height weight ## <chr> <dbl> <dbl> <dbl> ## 1 A 21 188 93 ## 2 B 32 167 55 ## 3 C 25 155 51 ## 4 D 43 175 72 ``` --- ## dplyr Introduces a grammar of data manipulation. [Cheat sheet](https://github.com/rstudio/cheatsheets/blob/master/data-transformation.pdf) -- We will introduce the **5 intuitively-named key functions** from **{dplyr}**: -- * **mutate()** adds new variables (columns) that are functions of existing variables -- * **select()** picks variables (columns) based on their names. -- * **filter()** picks observations (rows) based on their values. -- * **summarise()** collapses multiple values down to a single summary. -- * **arrange()** changes the ordering of the rows. --- ## dplyr All 5 functions work in a similar and consistent way: -- * The first argument is a **data frame** or a **tibble**. -- * The result is a new data frame. > *Note that* ***{dplyr}*** *never modifies the input: you need to* ***redirect the output*** *and save in a new - or the same - object.* --- ## dplyr Let's try! We will use the **presidential** data set. *It contains data of the terms of* ***presidents of the USA***, *from Eisenhower to Obama:* * Name * Term starting date * Term ending date of mandate * Political party -- ```r print(presidential, n=6) ``` ``` ## # A tibble: 11 x 4 ## name start end party ## <chr> <date> <date> <chr> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican ## 2 Kennedy 1961-01-20 1963-11-22 Democratic ## 3 Johnson 1963-11-22 1969-01-20 Democratic ## 4 Nixon 1969-01-20 1974-08-09 Republican ## 5 Ford 1974-08-09 1977-01-20 Republican ## 6 Carter 1977-01-20 1981-01-20 Democratic ## # … with 5 more rows ``` --- ## dplyr ### mutate & transmute **mutate()** allows to create new columns that are functions of the existing ones. -- * Create a new column with the duration of each term: -- ```r # Subtracting column start to colum end mutate(presidential, * duration_days=end - start) ``` ``` ## # A tibble: 11 x 5 ## name start end party duration_days ## <chr> <date> <date> <chr> <drtn> ## 1 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days ## 2 Kennedy 1961-01-20 1963-11-22 Democratic 1036 days ## 3 Johnson 1963-11-22 1969-01-20 Democratic 1886 days ## 4 Nixon 1969-01-20 1974-08-09 Republican 2027 days ## 5 Ford 1974-08-09 1977-01-20 Republican 895 days ## 6 Carter 1977-01-20 1981-01-20 Democratic 1461 days ## 7 Reagan 1981-01-20 1989-01-20 Republican 2922 days ## 8 Bush 1989-01-20 1993-01-20 Republican 1461 days ## 9 Clinton 1993-01-20 2001-01-20 Democratic 2922 days ## 10 Bush 2001-01-20 2009-01-20 Republican 2922 days ## 11 Obama 2009-01-20 2017-01-20 Democratic 2922 days ``` --- ## dplyr ### mutate & transmute > Use **unquoted** column names -- > Note that columns are added at the end of the data frame. -- > Note that **mutate** keeps all columns. --- ## dplyr ### mutate & transmute Keep only the **newly created column(s)** (drop the remaining ones) with **transmute()** instead of **mutate()**: -- ```r transmute(presidential, * duration_days=end - start) ``` ``` ## # A tibble: 11 x 1 ## duration_days ## <drtn> ## 1 2922 days ## 2 1036 days ## 3 1886 days ## 4 2027 days ## 5 895 days ## 6 1461 days ## 7 2922 days ## 8 1461 days ## 9 2922 days ## 10 2922 days ## 11 2922 days ``` --- ## dplyr ### mutate & transmute Re-assign to a new - or the same - data frame/tibble using the R **assignment operator: <-** ```r presidential <- mutate(presidential, duration_days=end - start) ``` --- ## dplyr ### select Select column **name** only: <table class="table table-striped" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> start </th> <th style="text-align:left;"> end </th> <th style="text-align:left;"> party </th> <th style="text-align:left;"> duration_days </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;background-color: yellow !important;"> Eisenhower </td> <td style="text-align:left;"> 1953-01-20 </td> <td style="text-align:left;"> 1961-01-20 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 2922 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Kennedy </td> <td style="text-align:left;"> 1961-01-20 </td> <td style="text-align:left;"> 1963-11-22 </td> <td style="text-align:left;"> Democratic </td> <td style="text-align:left;"> 1036 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Johnson </td> <td style="text-align:left;"> 1963-11-22 </td> <td style="text-align:left;"> 1969-01-20 </td> <td style="text-align:left;"> Democratic </td> <td style="text-align:left;"> 1886 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Nixon </td> <td style="text-align:left;"> 1969-01-20 </td> <td style="text-align:left;"> 1974-08-09 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 2027 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Ford </td> <td style="text-align:left;"> 1974-08-09 </td> <td style="text-align:left;"> 1977-01-20 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 895 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Carter </td> <td style="text-align:left;"> 1977-01-20 </td> <td style="text-align:left;"> 1981-01-20 </td> <td style="text-align:left;"> Democratic </td> <td style="text-align:left;"> 1461 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Reagan </td> <td style="text-align:left;"> 1981-01-20 </td> <td style="text-align:left;"> 1989-01-20 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 2922 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Bush </td> <td style="text-align:left;"> 1989-01-20 </td> <td style="text-align:left;"> 1993-01-20 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 1461 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Clinton </td> <td style="text-align:left;"> 1993-01-20 </td> <td style="text-align:left;"> 2001-01-20 </td> <td style="text-align:left;"> Democratic </td> <td style="text-align:left;"> 2922 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Bush </td> <td style="text-align:left;"> 2001-01-20 </td> <td style="text-align:left;"> 2009-01-20 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 2922 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Obama </td> <td style="text-align:left;"> 2009-01-20 </td> <td style="text-align:left;"> 2017-01-20 </td> <td style="text-align:left;"> Democratic </td> <td style="text-align:left;"> 2922 days </td> </tr> </tbody> </table> --- ## dplyr ### select ```r select(presidential, name) ``` ``` ## # A tibble: 11 x 1 ## name ## <chr> ## 1 Eisenhower ## 2 Kennedy ## 3 Johnson ## 4 Nixon ## 5 Ford ## 6 Carter ## 7 Reagan ## 8 Bush ## 9 Clinton ## 10 Bush ## 11 Obama ``` --- ## dplyr ### select Select columns **party** and **name** (in that order): ```r select(presidential, * party, name) ``` ``` ## # A tibble: 11 x 2 ## party name ## <chr> <chr> ## 1 Republican Eisenhower ## 2 Democratic Kennedy ## 3 Democratic Johnson ## 4 Republican Nixon ## 5 Republican Ford ## 6 Democratic Carter ## 7 Republican Reagan ## 8 Republican Bush ## 9 Democratic Clinton ## 10 Republican Bush ## 11 Democratic Obama ``` --- ## dplyr ### select Rename a column as you select it: ```r select(presidential, * party, President=name) ``` ``` ## # A tibble: 11 x 2 ## party President ## <chr> <chr> ## 1 Republican Eisenhower ## 2 Democratic Kennedy ## 3 Democratic Johnson ## 4 Republican Nixon ## 5 Republican Ford ## 6 Democratic Carter ## 7 Republican Reagan ## 8 Republican Bush ## 9 Democratic Clinton ## 10 Republican Bush ## 11 Democratic Obama ``` --- ## dplyr ### select Select all columns **except** party: ```r select(presidential, * -party) ``` ``` ## # A tibble: 11 x 4 ## name start end duration_days ## <chr> <date> <date> <drtn> ## 1 Eisenhower 1953-01-20 1961-01-20 2922 days ## 2 Kennedy 1961-01-20 1963-11-22 1036 days ## 3 Johnson 1963-11-22 1969-01-20 1886 days ## 4 Nixon 1969-01-20 1974-08-09 2027 days ## 5 Ford 1974-08-09 1977-01-20 895 days ## 6 Carter 1977-01-20 1981-01-20 1461 days ## 7 Reagan 1981-01-20 1989-01-20 2922 days ## 8 Bush 1989-01-20 1993-01-20 1461 days ## 9 Clinton 1993-01-20 2001-01-20 2922 days ## 10 Bush 2001-01-20 2009-01-20 2922 days ## 11 Obama 2009-01-20 2017-01-20 2922 days ``` --- ## dplyr ### select Select all columns between **start** and **party** (inclusive) ```r select(presidential, * start:party) ``` ``` ## # A tibble: 11 x 3 ## start end party ## <date> <date> <chr> ## 1 1953-01-20 1961-01-20 Republican ## 2 1961-01-20 1963-11-22 Democratic ## 3 1963-11-22 1969-01-20 Democratic ## 4 1969-01-20 1974-08-09 Republican ## 5 1974-08-09 1977-01-20 Republican ## 6 1977-01-20 1981-01-20 Democratic ## 7 1981-01-20 1989-01-20 Republican ## 8 1989-01-20 1993-01-20 Republican ## 9 1993-01-20 2001-01-20 Democratic ## 10 2001-01-20 2009-01-20 Republican ## 11 2009-01-20 2017-01-20 Democratic ``` --- ## dplyr ### select_if Select only columns containing characters with **select_if()**: ```r select_if(presidential, * is.character) ``` ``` ## # A tibble: 11 x 2 ## name party ## <chr> <chr> ## 1 Eisenhower Republican ## 2 Kennedy Democratic ## 3 Johnson Democratic ## 4 Nixon Republican ## 5 Ford Republican ## 6 Carter Democratic ## 7 Reagan Republican ## 8 Bush Republican ## 9 Clinton Democratic ## 10 Bush Republican ## 11 Obama Democratic ``` --- ## dplyr ### filter **filter()** is used to filter rows in a data frame/tibble. Keep rows if party is **Democratic**: <table class="table table-striped" style="font-size: 14px; margin-left: auto; margin-right: auto;"> <thead> <tr> <th style="text-align:left;"> name </th> <th style="text-align:left;"> start </th> <th style="text-align:left;"> end </th> <th style="text-align:left;"> party </th> <th style="text-align:left;"> duration_days </th> </tr> </thead> <tbody> <tr> <td style="text-align:left;"> Eisenhower </td> <td style="text-align:left;"> 1953-01-20 </td> <td style="text-align:left;"> 1961-01-20 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 2922 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Kennedy </td> <td style="text-align:left;background-color: yellow !important;"> 1961-01-20 </td> <td style="text-align:left;background-color: yellow !important;"> 1963-11-22 </td> <td style="text-align:left;background-color: yellow !important;"> Democratic </td> <td style="text-align:left;background-color: yellow !important;"> 1036 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Johnson </td> <td style="text-align:left;background-color: yellow !important;"> 1963-11-22 </td> <td style="text-align:left;background-color: yellow !important;"> 1969-01-20 </td> <td style="text-align:left;background-color: yellow !important;"> Democratic </td> <td style="text-align:left;background-color: yellow !important;"> 1886 days </td> </tr> <tr> <td style="text-align:left;"> Nixon </td> <td style="text-align:left;"> 1969-01-20 </td> <td style="text-align:left;"> 1974-08-09 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 2027 days </td> </tr> <tr> <td style="text-align:left;"> Ford </td> <td style="text-align:left;"> 1974-08-09 </td> <td style="text-align:left;"> 1977-01-20 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 895 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Carter </td> <td style="text-align:left;background-color: yellow !important;"> 1977-01-20 </td> <td style="text-align:left;background-color: yellow !important;"> 1981-01-20 </td> <td style="text-align:left;background-color: yellow !important;"> Democratic </td> <td style="text-align:left;background-color: yellow !important;"> 1461 days </td> </tr> <tr> <td style="text-align:left;"> Reagan </td> <td style="text-align:left;"> 1981-01-20 </td> <td style="text-align:left;"> 1989-01-20 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 2922 days </td> </tr> <tr> <td style="text-align:left;"> Bush </td> <td style="text-align:left;"> 1989-01-20 </td> <td style="text-align:left;"> 1993-01-20 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 1461 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Clinton </td> <td style="text-align:left;background-color: yellow !important;"> 1993-01-20 </td> <td style="text-align:left;background-color: yellow !important;"> 2001-01-20 </td> <td style="text-align:left;background-color: yellow !important;"> Democratic </td> <td style="text-align:left;background-color: yellow !important;"> 2922 days </td> </tr> <tr> <td style="text-align:left;"> Bush </td> <td style="text-align:left;"> 2001-01-20 </td> <td style="text-align:left;"> 2009-01-20 </td> <td style="text-align:left;"> Republican </td> <td style="text-align:left;"> 2922 days </td> </tr> <tr> <td style="text-align:left;background-color: yellow !important;"> Obama </td> <td style="text-align:left;background-color: yellow !important;"> 2009-01-20 </td> <td style="text-align:left;background-color: yellow !important;"> 2017-01-20 </td> <td style="text-align:left;background-color: yellow !important;"> Democratic </td> <td style="text-align:left;background-color: yellow !important;"> 2922 days </td> </tr> </tbody> </table> --- ## dplyr ### filter Keep rows if party is **Democratic**: ```r filter(presidential, * party=="Democratic") ``` ``` ## # A tibble: 5 x 5 ## name start end party duration_days ## <chr> <date> <date> <chr> <drtn> ## 1 Kennedy 1961-01-20 1963-11-22 Democratic 1036 days ## 2 Johnson 1963-11-22 1969-01-20 Democratic 1886 days ## 3 Carter 1977-01-20 1981-01-20 Democratic 1461 days ## 4 Clinton 1993-01-20 2001-01-20 Democratic 2922 days ## 5 Obama 2009-01-20 2017-01-20 Democratic 2922 days ``` --- ## dplyr ### filter You can filter using several variables/columns: ```r filter(presidential, * party=="Republican", name=="Bush") # This implicity uses the "&", i.e. the fact that both conditions have to be TRUE filter(presidential, * party=="Republican" & name=="Bush") # Any logical operators can be used filter(presidential, * name %in% c("Bush", "Kennedy")) ``` --- ## dplyr ### summarise & group_by **summarise()** collapses a data frame to a single row (base R: **aggregate()**) -- Get average length of terms: ```r summarise(presidential, * mean(duration_days)) ``` ``` ## # A tibble: 1 x 1 ## `mean(duration_days)` ## <drtn> ## 1 2125.091 days ``` --- ## dplyr ### summarise & group_by **summarise()** collapses a data frame to a single row (base R: **aggregate()**) Get average length of terms + count ```r summarise(presidential, mean(duration_days), * n()) ``` ``` ## # A tibble: 1 x 2 ## `mean(duration_days)` `n()` ## <drtn> <int> ## 1 2125.091 days 11 ``` --- ## dplyr ### summarise & group_by You can combine **summarise()** with **group_by()** to get the average length of terms **per political party**: -- * **group_by()** defines a grouping based on existing variables. -- * **summarise()** then processes the command based on the grouping -- ```r *groups <- group_by(presidential, * party) summarise(groups, mean(duration_days), n()) ``` ``` ## # A tibble: 2 x 3 ## party `mean(duration_days)` `n()` ## <chr> <drtn> <int> ## 1 Democratic 2045.4 days 5 ## 2 Republican 2191.5 days 6 ``` --- ## dplyr ### arrange Order rows by increasing mandate duration with **arrange()** ```r arrange(presidential, duration_days) ``` ``` ## # A tibble: 11 x 5 ## name start end party duration_days ## <chr> <date> <date> <chr> <drtn> ## 1 Ford 1974-08-09 1977-01-20 Republican 895 days ## 2 Kennedy 1961-01-20 1963-11-22 Democratic 1036 days ## 3 Carter 1977-01-20 1981-01-20 Democratic 1461 days ## 4 Bush 1989-01-20 1993-01-20 Republican 1461 days ## 5 Johnson 1963-11-22 1969-01-20 Democratic 1886 days ## 6 Nixon 1969-01-20 1974-08-09 Republican 2027 days ## 7 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days ## 8 Reagan 1981-01-20 1989-01-20 Republican 2922 days ## 9 Clinton 1993-01-20 2001-01-20 Democratic 2922 days ## 10 Bush 2001-01-20 2009-01-20 Republican 2922 days ## 11 Obama 2009-01-20 2017-01-20 Democratic 2922 days ``` ```r # decreasing order: arrange(presidential2, desc(duration_days)) ``` --- ## dplyr ### arrange You can use several columns for the sorting ```r arrange(presidential, duration_days, name) ``` ``` ## # A tibble: 11 x 5 ## name start end party duration_days ## <chr> <date> <date> <chr> <drtn> ## 1 Ford 1974-08-09 1977-01-20 Republican 895 days ## 2 Kennedy 1961-01-20 1963-11-22 Democratic 1036 days ## 3 Bush 1989-01-20 1993-01-20 Republican 1461 days ## 4 Carter 1977-01-20 1981-01-20 Democratic 1461 days ## 5 Johnson 1963-11-22 1969-01-20 Democratic 1886 days ## 6 Nixon 1969-01-20 1974-08-09 Republican 2027 days ## 7 Bush 2001-01-20 2009-01-20 Republican 2922 days ## 8 Clinton 1993-01-20 2001-01-20 Democratic 2922 days ## 9 Eisenhower 1953-01-20 1961-01-20 Republican 2922 days ## 10 Obama 2009-01-20 2017-01-20 Democratic 2922 days ## 11 Reagan 1981-01-20 1989-01-20 Republican 2922 days ``` --- ## magritrrr ### %>% : forward-pipe operator The **{magritrrr}** package introduced the **forward-pipe operator** <img src="https://magrittr.tidyverse.org/logo.png" width="30%" style="display: block; margin: auto;" /> --- ## magritrrr ### %>% : forward-pipe operator Basic piping: read ***from left to right***: pipes the output of a function forward as the **first argument of the next function**. -- * **mytibble %>% function1** is equivalent to **function1(x)** -- * **mytibble %>% function1(y)** is equivalent to **function1(x, y)** -- * **mytibble %>% function1 %>% function2** is equivalent to **function2(function1(x))** -- Example: ```r mutate(presidential, duration_days=end-start) %>% filter(duration_days < 1000) ``` ``` ## # A tibble: 1 x 5 ## name start end party duration_days ## <chr> <date> <date> <chr> <drtn> ## 1 Ford 1974-08-09 1977-01-20 Republican 895 days ``` --- ## magritrrr ### %>% : forward-pipe operator Example: ```r mutate(presidential, duration_days=end-start) %>% filter(party == "Democratic") %>% summarise(mean(duration_days)) ``` ``` ## # A tibble: 1 x 1 ## `mean(duration_days)` ## <drtn> ## 1 2045.4 days ``` -- ```r # same as presidential %>% mutate(duration_days=end-start) %>% filter(party == "Democratic") %>% summarise(mean(duration_days)) ``` --- ## magritrrr ### %>% : forward-pipe operator Another example: ```r mutate(presidential, duration_days=end-start) %>% group_by(party) %>% summarise(mean(duration_days)) ``` ``` ## # A tibble: 2 x 2 ## party `mean(duration_days)` ## <chr> <drtn> ## 1 Democratic 2045.4 days ## 2 Republican 2191.5 days ``` --- ## dplyr & %>% ### Hands on ! We will work with the **storms** data set: * Positions and attributes of **198 tropical storms**, measured every 6 hours ```r storms ``` ``` ## # A tibble: 10,010 x 13 ## name year month day hour lat long status category wind pressure ## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> ## 1 Amy 1975 6 27 0 27.5 -79 tropi… -1 25 1013 ## 2 Amy 1975 6 27 6 28.5 -79 tropi… -1 25 1013 ## 3 Amy 1975 6 27 12 29.5 -79 tropi… -1 25 1013 ## 4 Amy 1975 6 27 18 30.5 -79 tropi… -1 25 1013 ## 5 Amy 1975 6 28 0 31.5 -78.8 tropi… -1 25 1012 ## 6 Amy 1975 6 28 6 32.4 -78.7 tropi… -1 25 1012 ## 7 Amy 1975 6 28 12 33.3 -78 tropi… -1 25 1011 ## 8 Amy 1975 6 28 18 34 -77 tropi… -1 30 1006 ## 9 Amy 1975 6 29 0 34.4 -75.8 tropi… 0 35 1004 ## 10 Amy 1975 6 29 6 34 -74.8 tropi… 0 40 1002 ## # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>, ## # hu_diameter <dbl> ``` --- ## dplyr & %>% ### Hands on ! We will work with the **storms** data set: 1. Remove columns **month**, **day**, **hour**, **lat**, **long**, **ts_diameter** and **hu_diameter**. * Calculate the **median pressure for each storm status**. 2. Calculate the **minimum wind speed for each storm (name)**. * What storm has the smallest minimum wind speed? 3. Calculate **how many storms happened each year**. * *TIP: find what* ***distinct()*** *from* ***{dplyr}*** can do... * What are the years with the **maximum** number of storms? --- ## dplyr & %>% ### Hands on ! Remove columns **month**, **day**, **hour**, **lat**, **long**, **ts_diameter** and **hu_diameter**. * Calculate the **median pressure for each storm status**. -- ```r # remove columns select(storms, -lat, -long, -ts_diameter, -hu_diameter) # same as: select(storms, -c(lat, long, ts_diameter, hu_diameter)) ``` -- ```r # group by status and calculate median pressure per storm status select(storms, -lat, -long, -ts_diameter, -hu_diameter) %>% group_by(status) %>% summarise(median(pressure)) ``` ``` ## # A tibble: 3 x 2 ## status `median(pressure)` ## <chr> <dbl> ## 1 hurricane 973 ## 2 tropical depression 1008 ## 3 tropical storm 1000 ``` --- ## dplyr & %>% ### Hands on ! Calculate the **minimum wind speed for each storm (name)**. * What storm has the **smallest minimum wind speed**? -- ```r # group storms by name and calculate minimum wind speed storms %>% group_by(name) %>% summarise(min_wind = min(wind)) ``` ``` ## # A tibble: 198 x 2 ## name min_wind ## <chr> <int> ## 1 AL011993 25 ## 2 AL012000 25 ## 3 AL021992 25 ## 4 AL021994 15 ## 5 AL021999 25 ## 6 AL022000 25 ## 7 AL022001 25 ## 8 AL022003 30 ## 9 AL022006 30 ## 10 AL031987 10 ## # … with 188 more rows ``` --- ## dplyr & %>% ### Hands on ! Calculate the **minimum wind speed for each storm (name)**. * What storm has the **smallest minimum wind speed**? ```r # sort by increasing minimum wind speed # introducing top_n: display n number of rows storms %>% group_by(name) %>% summarise(min_wind = min(wind)) %>% arrange(min_wind) %>% top_n(2) ``` ``` ## Selecting by min_wind ``` ``` ## # A tibble: 2 x 2 ## name min_wind ## <chr> <int> ## 1 Sean 40 ## 2 Doris 45 ``` --- ## dplyr & %>% ### Hands on ! 3. Calculate **how many storms happened each year**. * *TIP: find what* ***distinct()*** *from* ***{dplyr}*** can do... * What are the years with the **maximum** number of storms? -- ```r # get unique rows when considering both name and year columns distinct(storms, name, year) ``` -- ```r # group by year and count the number of storms distinct(storms, name, year) %>% group_by(year) %>% summarise(storms_per_year=n()) ``` --- ## dplyr & %>% ### Hands on ! 3. Calculate **how many storms happened each year**. * *TIP: find what* ***distinct()*** *from* ***{dplyr}*** can do... * What are the years with the **maximum** number of storms? ```r # sort by decreasing number of storms distinct(storms, name, year) %>% group_by(year) %>% summarise(storms_per_year=n()) %>% arrange(desc(storms_per_year)) %>% top_n(4) ``` ``` ## Selecting by storms_per_year ``` ``` ## # A tibble: 4 x 2 ## year storms_per_year ## <dbl> <int> ## 1 1995 21 ## 2 2005 21 ## 3 2003 20 ## 4 2010 20 ``` --- ## ggplot2 ### Visualization As Mireia said in October's [ggplot2 workshop](https://mireia-bioinfo.github.io/workshop_ggplot2/index.html): **build plots one layer at a time!** (separated with **+**) * Data * Aesthetics * Geometries -- ```r ggplot(storms, aes(x=wind, y=pressure)) + geom_point() ``` ![](101112_rladies_tidyverse_files/figure-html/unnamed-chunk-52-1.png)<!-- --> --- ## ggplot2 ### Visualization Combine with the data wrangling / selection with **%>%**: -- ```r storms %>% filter(name %in% c("Tony", "Paloma", "Zeta", "Luis", "Juliet", "Clara")) ``` ``` ## # A tibble: 161 x 13 ## name year month day hour lat long status category wind pressure ## <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr> <ord> <int> <int> ## 1 Clara 1977 9 5 12 32.8 -80 tropi… -1 20 1015 ## 2 Clara 1977 9 5 18 33.2 -79 tropi… -1 20 1014 ## 3 Clara 1977 9 6 0 33.6 -78.2 tropi… -1 20 1013 ## 4 Clara 1977 9 6 6 33.8 -77.6 tropi… -1 25 1012 ## 5 Clara 1977 9 6 12 34 -77 tropi… -1 25 1011 ## 6 Clara 1977 9 6 18 34.2 -76.4 tropi… -1 25 1010 ## 7 Clara 1977 9 7 0 34.4 -75.8 tropi… -1 30 1010 ## 8 Clara 1977 9 7 6 34.6 -75 tropi… -1 30 1010 ## 9 Clara 1977 9 7 12 34.7 -74.3 tropi… -1 30 1010 ## 10 Clara 1977 9 7 18 34.9 -73 tropi… -1 30 1010 ## # … with 151 more rows, and 2 more variables: ts_diameter <dbl>, ## # hu_diameter <dbl> ``` --- ## ggplot2 ### Visualization Combine with the data wrangling / selection with **%>%**: ```r storms %>% filter(name %in% c("Tony", "Paloma", "Zeta", "Luis", "Juliet", "Clara")) %>% ggplot(aes(x=wind, y=pressure, col=name)) + geom_point() + theme_classic() + ggtitle("Storms: wind vs pressure") ``` ![](101112_rladies_tidyverse_files/figure-html/unnamed-chunk-54-1.png)<!-- --> --- class: inverse, center, middle # THANK YOU ! -- *slides created with the [xaringan package](https://github.com/yihui/xaringan)* -- Follow us on Twitter ! @RLadiesBCN <img src="https://media.giphy.com/media/SMKiEh9WDO6ze/giphy.gif" width="55%" style="display: block; margin: auto;" /> --- # Some resources * [Tidyverse website](https://www.tidyverse.org/) * [R Studio cheatsheet](https://rstudio.com/resources/cheatsheets/) * [R for data science](https://r4ds.had.co.nz/) * [Text mining with R](https://www.tidytextmining.com/) * [Advanced R](https://adv-r.hadley.nz/) --- class: inverse # Let's see what you have learnt today ! <img src="https://media.giphy.com/media/1BgNCE4bMilwje4SBi/giphy.gif" width="40%" style="display: block; margin: auto;" /> Go to: #[kahoot.it](https://kahoot.it/) ---