class: center, middle, inverse, title-slide .title[ # Reshaping data ] .author[ ### Heike Hofmann ] --- class: middle, inverse, center # The `tidyr` package --- # What is tidy data? > Happy families are all alike; every unhappy family is unhappy in its own way.<br> Leo Tolstoy - `tidyr` vignette available as `vignette("tidy-data", package="tidyr")` - vignette is updated version of the [tidy data paper](https://www.jstatsoft.org/article/view/v059i10) with updated code --- # Outline - Different sources of messiness - Key-Value pairs - `tidyr`: `pivot_longer` and `pivot_wider` (former `gather` and `spread`) --- # Data often is in a spreadsheet format, but ... there's different ways of encoding the same information: Option #1 ``` ## name treatmenta treatmentb ## 1 John Smith NA 18 ## 2 Jane Doe 4 1 ## 3 Mary Johnson 6 7 ``` Option #2 ``` ## treatment John.Smith Jane.Doe Mary.Johnson ## 1 a NA 4 6 ## 2 b 18 1 7 ``` Neither #1 nor #2 are "clean" versions of the data: observed information is part of the data structure; some implicit information is assumed --- # Sources of Messiness 1. Column headers are values, not variable names.<br> e.g. *treatmenta, treatmentb* 2. Multiple variables are stored in one column.<br> e.g. *Fall 2015, Spring 2016* or *"1301 8th St SE, Orange City, Iowa 51041 (42.99755, -96.04149)", "2102 Durant, Harlan, Iowa 51537 (41.65672, -95.33780)"* 3. Multiple observational units are stored in the same table. 4. A single observational unit is stored in multiple tables. --- # Tidy data 1. Each variable forms one column. 2. Each observation forms one row. 3. Each type of observational unit forms a table. --- # Clean version of the example ``` ## treatment patient score ## 1 a John Smith NA ## 2 b John Smith 18 ## 3 a Jane Doe 4 ## 4 b Jane Doe 1 ## 5 a Mary Johnson 6 ## 6 b Mary Johnson 7 ``` - `treatment` and `patient` uniquely describe a single row in the dataset. - `treatment` and `patient` are **key variables**, - `score` is a **measurement variable** - this makes `treatment-patient` and `score` a **key-value pair** --- # Key-value pairs (KVP) **Key-Value pairs** (KVP) - also *attribute-value*, *field-value*, *name-value*: abstract data representation that allows a lot of flexibility One way of telling whether a data set is tidy is to check that all keys for a value are aligned in one row: | | | |:------------- |:------------- | | | | | <img src="images/kvp-unhappy.png" width=150> | <img src="images/kvp-happy.png" width=150> | |Untidy data | Tidy data | --- # Tidying data - Plan of attack Very few functions are needed for tidying data: - Messy (1): `tidyr` functions `pivot_longer` and `pivot_wider`. `pivot_longer(data, cols, names_to = "name", values_to = "values")`: take multiple columns and collapse into key-value pairs <br> `pivot_wider (data, names_from=name, values_from=value)`: spread a key-value pair across multiple columns. - Messy (2): `tidyr` function `separate (data, col, into, sep = "[^[:alnum:]]+")`: separate one column into multiple columns - Messy (3): `dplyr` - some combination of the functions discussed previously - Messy (4): `dplyr` functionality `join` (and friends) to combine multiple data sets --- # French fries data ```r data(french_fries, package="reshape2") ``` - data from sensory experiment conducted at Iowa State University in 2004 - investigators were interested in comparing effects of three different fryer oils on taste of fries ```r french_fries %>% head() ``` ``` ## time treatment subject rep potato buttery grassy rancid painty ## 61 1 1 3 1 2.9 0.0 0.0 0.0 5.5 ## 25 1 1 3 2 14.0 0.0 0.0 1.1 0.0 ## 62 1 1 10 1 11.0 6.4 0.0 0.0 0.0 ## 26 1 1 10 2 9.9 5.9 2.9 2.2 0.0 ## 63 1 1 15 1 1.2 0.1 0.0 1.1 5.1 ## 27 1 1 15 2 8.8 3.0 3.6 1.5 2.3 ``` --- # Gather/Pivot to long ```r ffm <- french_fries %>% pivot_longer(cols = potato:painty, names_to = "scale", values_to = "score") ffm ``` ``` ## # A tibble: 3,480 × 6 ## time treatment subject rep scale score ## <fct> <fct> <fct> <dbl> <chr> <dbl> ## 1 1 1 3 1 potato 2.9 ## 2 1 1 3 1 buttery 0 ## 3 1 1 3 1 grassy 0 ## 4 1 1 3 1 rancid 0 ## 5 1 1 3 1 painty 5.5 ## 6 1 1 3 2 potato 14 ## 7 1 1 3 2 buttery 0 ## 8 1 1 3 2 grassy 0 ## 9 1 1 3 2 rancid 1.1 ## 10 1 1 3 2 painty 0 ## # ℹ 3,470 more rows ``` --- # Gather/Pivot to long <img src="images/gather-spread.png" height="500"> --- # Pivot to wide ```r ffm %>% pivot_wider(names_from = rep, values_from = score) ``` ``` ## # A tibble: 1,740 × 6 ## time treatment subject scale `1` `2` ## <fct> <fct> <fct> <chr> <dbl> <dbl> ## 1 1 1 3 potato 2.9 14 ## 2 1 1 3 buttery 0 0 ## 3 1 1 3 grassy 0 0 ## 4 1 1 3 rancid 0 1.1 ## 5 1 1 3 painty 5.5 0 ## 6 1 1 10 potato 11 9.9 ## 7 1 1 10 buttery 6.4 5.9 ## 8 1 1 10 grassy 0 2.9 ## 9 1 1 10 rancid 0 2.2 ## 10 1 1 10 painty 0 0 ## # ℹ 1,730 more rows ``` --- # Pivot to wide formerly: ```r ffm %>% spread(key = rep, value = score) ``` ``` ## # A tibble: 1,740 × 6 ## time treatment subject scale `1` `2` ## <fct> <fct> <fct> <chr> <dbl> <dbl> ## 1 1 1 3 buttery 0 0 ## 2 1 1 3 grassy 0 0 ## 3 1 1 3 painty 5.5 0 ## 4 1 1 3 potato 2.9 14 ## 5 1 1 3 rancid 0 1.1 ## 6 1 1 10 buttery 6.4 5.9 ## 7 1 1 10 grassy 0 2.9 ## 8 1 1 10 painty 0 0 ## 9 1 1 10 potato 11 9.9 ## 10 1 1 10 rancid 0 2.2 ## # ℹ 1,730 more rows ``` --- # Pivot to wide ```r ffm %>% pivot_wider(names_from = rep, values_from = score, names_prefix="Rep") ``` ``` ## # A tibble: 1,740 × 6 ## time treatment subject scale Rep1 Rep2 ## <fct> <fct> <fct> <chr> <dbl> <dbl> ## 1 1 1 3 potato 2.9 14 ## 2 1 1 3 buttery 0 0 ## 3 1 1 3 grassy 0 0 ## 4 1 1 3 rancid 0 1.1 ## 5 1 1 3 painty 5.5 0 ## 6 1 1 10 potato 11 9.9 ## 7 1 1 10 buttery 6.4 5.9 ## 8 1 1 10 grassy 0 2.9 ## 9 1 1 10 rancid 0 2.2 ## 10 1 1 10 painty 0 0 ## # ℹ 1,730 more rows ``` --- ```r ffm %>% pivot_wider(names_from = rep, values_from = score, names_prefix="Rep") %>% ggplot(aes(x = Rep1, y = Rep2)) + geom_point() + facet_wrap(~scale) + geom_abline(colour = "grey50") ``` ![](05_tidyr_files/figure-html/unnamed-chunk-10-1.png)<!-- --> --- class: inverse # Your turn For this your turn use the `french_fries` data from the `reshape2` package: `data("french_fries", package="reshape2")` - Use `pivot_longer` from the `tidyr` package to combine the different scales for assessing french fries into a single variable. Call the key-value pair "scale" and "score". - Use `pivot_wider` from the `tidyr` package to get a format in which you can directly compare values from week 1 to week 10. Plot a scatterplot of values in week 1 against week 10. Facet by treatment and scale, color by individuals and use different shapes for the replicates. Is there a pattern visible? --- class: inverse # Your turn For this your turn use the `fbiwide` data from the `classdata` package: `data("fbiwide", package="classdata")` - Use `pivot_longer` from the `tidyr` package to combine the variables for the different types of crimes into one variable. Call the key-value pair "Type" and "Incidences". Compute a crime rate - Only consider crime rates for Iowa and Minnesota. Use `pivot_wider` to create incidence columns for each of these states. Plot crimes in Iowa against crimes in Minnesota, colour by type of crime. Note: you need to exclude some variables. --- # `pivot_wider` works as ... the inverse of `pivot_longer` ... almost <img src="images/spread-gather-clean.png" height="450"> --- # If the front part is not identical ... it's not collapsible <img src="images/spread-gather-unclean.png" height=450> --- # Where using `pivot_wider` is tricky - `spread`/`pivot_wider` is the (almost) inverse operation of `gather`/`pivot_longer` - when we use `gather`/`pivot_longer` the key information is duplicated - when using `spread`/`pivot_wider` we need to make sure that the information outside the key and value is identical for all levels (and combinations of levels) of the `key` variable. - If the info is not identical across levels, missing values are introduced into the new data set. --- # `spread`/`pivot_wider` in the example Other state specific values don't allow spread to collapse the front part and introduce missing values: ```r fbi %>% filter(year==2014, state_abbr %in% c("IA", "MN"), type=="burglary") %>% pivot_wider(names_from="state_abbr", values_from="count") ``` ``` ## # A tibble: 2 × 8 ## state state_id year population type violent_crime IA MN ## <chr> <int> <int> <int> <chr> <lgl> <int> <int> ## 1 Iowa 16 2014 3109481 burglary FALSE 14363 NA ## 2 Minnesota 27 2014 5457125 burglary FALSE NA 20773 ``` --- # fixing `pivot_wider` in the example Negative selection of all state specific variables gets the example to "work": ```r fbi %>% filter(year==2014, state_abbr %in% c("IA", "MN"), type=="burglary") %>% select(-state, -state_id, -population) %>% pivot_wider(names_from="state_abbr", values_from="count") ``` ``` ## # A tibble: 1 × 5 ## year type violent_crime IA MN ## <int> <chr> <lgl> <int> <int> ## 1 2014 burglary FALSE 14363 20773 ```