Reshaping data

class: center, middle, inverse, title-slide

.title[
# Reshaping data
]
.author[
### Heike Hofmann
]

---

class: middle, inverse, center
# The `tidyr` package

---

# What is tidy data?

> Happy families are all alike; every unhappy family is unhappy in its own way.<br>
Leo Tolstoy

- `tidyr` vignette available as `vignette("tidy-data", package="tidyr")`

- vignette is updated version of the [tidy data paper](https://www.jstatsoft.org/article/view/v059i10) with updated code

---

# Outline

- Different sources of messiness

- Key-Value pairs

- `tidyr`: `pivot_longer` and `pivot_wider` (former `gather` and `spread`)

---

# Data often is in a spreadsheet format, but ...

there's different ways of encoding the same information:

Option #1

```
##           name treatmenta treatmentb
## 1   John Smith         NA         18
## 2     Jane Doe          4          1
## 3 Mary Johnson          6          7
```

Option #2

```
##   treatment John.Smith Jane.Doe Mary.Johnson
## 1         a         NA        4            6
## 2         b         18        1            7
```

Neither #1 nor #2 are "clean" versions of the data: observed information is part of the data structure; some implicit information is assumed

---

# Sources of Messiness

1. Column headers are values, not variable names.<br>
e.g. *treatmenta, treatmentb*

2. Multiple variables are stored in one column.<br>
e.g. *Fall 2015, Spring 2016* or *"1301 8th St SE, Orange City, Iowa 51041
(42.99755, -96.04149)", "2102 Durant, Harlan, Iowa 51537
(41.65672, -95.33780)"*

3. Multiple observational units are stored in the same table.

4. A single observational unit is stored in multiple tables.

---

# Tidy data

1. Each variable forms one column.

2. Each observation forms one row.

3. Each type of observational unit forms a table.

---

# Clean version of the example

```
##   treatment      patient score
## 1         a   John Smith    NA
## 2         b   John Smith    18
## 3         a     Jane Doe     4
## 4         b     Jane Doe     1
## 5         a Mary Johnson     6
## 6         b Mary Johnson     7
```

- `treatment` and `patient` uniquely describe a single row in the dataset.

- `treatment` and `patient` are **key variables**,

- `score` is a **measurement variable**

- this makes `treatment-patient` and `score` a **key-value pair**

---

# Key-value pairs (KVP)
 
**Key-Value pairs** (KVP) - also *attribute-value*, *field-value*, *name-value*:
abstract data representation that allows a lot of flexibility

One way of telling whether a data set is tidy is to check that all keys for a value are aligned in one row:

---

# Tidying data - Plan of attack

Very few functions are needed for tidying data:

- Messy (1): `tidyr` functions `pivot_longer` and `pivot_wider`.  
`pivot_longer(data,  cols, names_to = "name", values_to = "values")`: take multiple columns and collapse into key-value pairs <br>
`pivot_wider (data, names_from=name, values_from=value)`: spread a key-value pair across multiple columns.

- Messy (2): `tidyr` function `separate (data, col, into, sep = "[^[:alnum:]]+")`:
separate one column into multiple columns

- Messy (3): `dplyr` - some combination of the functions discussed previously

- Messy (4): `dplyr` functionality `join` (and friends) to combine multiple data sets

---

# French fries data

```r
data(french_fries, package="reshape2")
```

- data from sensory experiment conducted at Iowa State University in 2004

- investigators were interested in comparing effects of three different fryer oils on taste of fries

```r
french_fries %>% head()
```

```
##    time treatment subject rep potato buttery grassy rancid painty
## 61    1         1       3   1    2.9     0.0    0.0    0.0    5.5
## 25    1         1       3   2   14.0     0.0    0.0    1.1    0.0
## 62    1         1      10   1   11.0     6.4    0.0    0.0    0.0
## 26    1         1      10   2    9.9     5.9    2.9    2.2    0.0
## 63    1         1      15   1    1.2     0.1    0.0    1.1    5.1
## 27    1         1      15   2    8.8     3.0    3.6    1.5    2.3
```

---

# Gather/Pivot to long

```r
ffm <- french_fries %>% 
  pivot_longer(cols = potato:painty, 
               names_to = "scale", 
               values_to = "score")
ffm
```

```
## # A tibble: 3,480 × 6
##    time  treatment subject   rep scale   score
##    <fct> <fct>     <fct>   <dbl> <chr>   <dbl>
##  1 1     1         3           1 potato    2.9
##  2 1     1         3           1 buttery   0  
##  3 1     1         3           1 grassy    0  
##  4 1     1         3           1 rancid    0  
##  5 1     1         3           1 painty    5.5
##  6 1     1         3           2 potato   14  
##  7 1     1         3           2 buttery   0  
##  8 1     1         3           2 grassy    0  
##  9 1     1         3           2 rancid    1.1
## 10 1     1         3           2 painty    0  
## # ℹ 3,470 more rows
```

---

# Gather/Pivot to long

---

# Pivot to wide

```r
ffm %>% pivot_wider(names_from = rep, values_from = score)
```

```
## # A tibble: 1,740 × 6
##    time  treatment subject scale     `1`   `2`
##    <fct> <fct>     <fct>   <chr>   <dbl> <dbl>
##  1 1     1         3       potato    2.9  14  
##  2 1     1         3       buttery   0     0  
##  3 1     1         3       grassy    0     0  
##  4 1     1         3       rancid    0     1.1
##  5 1     1         3       painty    5.5   0  
##  6 1     1         10      potato   11     9.9
##  7 1     1         10      buttery   6.4   5.9
##  8 1     1         10      grassy    0     2.9
##  9 1     1         10      rancid    0     2.2
## 10 1     1         10      painty    0     0  
## # ℹ 1,730 more rows
```

---

# Pivot to wide

formerly:

```r
ffm %>% spread(key = rep, value = score)
```

```
## # A tibble: 1,740 × 6
##    time  treatment subject scale     `1`   `2`
##    <fct> <fct>     <fct>   <chr>   <dbl> <dbl>
##  1 1     1         3       buttery   0     0  
##  2 1     1         3       grassy    0     0  
##  3 1     1         3       painty    5.5   0  
##  4 1     1         3       potato    2.9  14  
##  5 1     1         3       rancid    0     1.1
##  6 1     1         10      buttery   6.4   5.9
##  7 1     1         10      grassy    0     2.9
##  8 1     1         10      painty    0     0  
##  9 1     1         10      potato   11     9.9
## 10 1     1         10      rancid    0     2.2
## # ℹ 1,730 more rows
```
---

# Pivot to wide

```r
ffm %>% pivot_wider(names_from = rep, values_from = score, names_prefix="Rep")
```

```
## # A tibble: 1,740 × 6
##    time  treatment subject scale    Rep1  Rep2
##    <fct> <fct>     <fct>   <chr>   <dbl> <dbl>
##  1 1     1         3       potato    2.9  14  
##  2 1     1         3       buttery   0     0  
##  3 1     1         3       grassy    0     0  
##  4 1     1         3       rancid    0     1.1
##  5 1     1         3       painty    5.5   0  
##  6 1     1         10      potato   11     9.9
##  7 1     1         10      buttery   6.4   5.9
##  8 1     1         10      grassy    0     2.9
##  9 1     1         10      rancid    0     2.2
## 10 1     1         10      painty    0     0  
## # ℹ 1,730 more rows
```

---

```r
ffm %>% pivot_wider(names_from = rep, values_from = score, names_prefix="Rep") %>%
  ggplot(aes(x = Rep1, y = Rep2)) + geom_point() +
  facet_wrap(~scale) + geom_abline(colour = "grey50")
```

![](05_tidyr_files/figure-html/unnamed-chunk-10-1.png)

---
class: inverse
# Your turn

For this your turn use the `french_fries` data from the `reshape2` package: `data("french_fries", package="reshape2")`

- Use `pivot_longer` from the `tidyr` package to combine the different scales for assessing french fries into a single variable. Call the key-value pair "scale" and "score".

- Use  `pivot_wider` from the `tidyr` package to get a format in which you can directly compare values from week 1 to week 10. Plot a scatterplot of values in week 1 against week 10. Facet by treatment and scale, color by individuals   and use different shapes for the replicates. Is there a pattern visible?

---
class: inverse
# Your turn

For this your turn use the `fbiwide` data from the `classdata` package: `data("fbiwide", package="classdata")`

- Use  `pivot_longer` from the `tidyr` package to combine the variables for the different types of crimes into one variable. Call the key-value pair "Type" and "Incidences". Compute a crime rate

- Only consider crime rates for Iowa and Minnesota. Use `pivot_wider` to create incidence columns for each of these states. Plot crimes in Iowa against crimes in Minnesota, colour by type of crime. Note: you need to exclude some variables.

---

# `pivot_wider` works as

... the inverse of `pivot_longer` ... almost

---

# If the front part is not identical

... it's not collapsible

---

# Where using `pivot_wider` is tricky

- `spread`/`pivot_wider` is the (almost) inverse operation of `gather`/`pivot_longer`

- when we use `gather`/`pivot_longer` the key information is duplicated

- when using `spread`/`pivot_wider` we need to make sure that the information outside the key and value is identical for all levels (and combinations of levels) of the `key` variable.

- If the info is not identical across levels, missing values are introduced into the new data set.

---

# `spread`/`pivot_wider` in the example

Other state specific values don't allow spread to collapse the front part and introduce missing values:

```r
fbi %>% filter(year==2014, state_abbr %in% c("IA", "MN"),
               type=="burglary") %>%
  pivot_wider(names_from="state_abbr", values_from="count")
```

```
## # A tibble: 2 × 8
##   state     state_id  year population type     violent_crime    IA    MN
##   <chr>        <int> <int>      <int> <chr>    <lgl>         <int> <int>
## 1 Iowa            16  2014    3109481 burglary FALSE         14363    NA
## 2 Minnesota       27  2014    5457125 burglary FALSE            NA 20773
```

---

# fixing `pivot_wider` in the example

Negative selection of all state specific variables gets the example to "work":

```r
fbi %>% filter(year==2014, state_abbr %in% c("IA", "MN"),
               type=="burglary") %>%
  select(-state, -state_id, -population) %>%
  pivot_wider(names_from="state_abbr", values_from="count")
```

```
## # A tibble: 1 × 5
##    year type     violent_crime    IA    MN
##   <int> <chr>    <lgl>         <int> <int>
## 1  2014 burglary FALSE         14363 20773
```