class: center, middle, inverse, title-slide .title[ # Reshaping data with tidyr - working with separate and unite ] .author[ ### Heike Hofmann ] --- class: middle, inverse, center # Separate and Unite --- # Different Types of Messiness 1. Column headers are values, not variable names.<br> e.g. *treatmenta, treatmentb* 2. Multiple variables are stored in one column.<br> e.g. *Fall 2015, Spring 2016* or *"1301 8th St SE, Orange City, Iowa 51041 (42.99755, -96.04149)", "2102 Durant, Harlan, Iowa 51537 (41.65672, -95.33780)"* 3. Multiple observational units are stored in the same table. 4. A single observational unit is stored in multiple tables. --- # Messiness (2) Messy (2): Multiple pieces of information are stored in one column ```r library(tidyverse) df <- data.frame(x = c(NA, "a.b", "a.d", "b.c")) df ``` ``` ## x ## 1 <NA> ## 2 a.b ## 3 a.d ## 4 b.c ``` ```r df %>% separate_wider_delim(x, delim=".", names = c("A", "B")) ``` ``` ## # A tibble: 4 × 2 ## A B ## <chr> <chr> ## 1 <NA> <NA> ## 2 a b ## 3 a d ## 4 b c ``` --- class: inverse # Your Turn (5 min) The Iowa Data Portal is a wealth of information on and about the State of Iowa. The website [Liquor Sales](https://data.iowa.gov/Sales-Distribution/Iowa-Liquor-Sales/m3tr-qhgy) provides data on every liquor order a licensed store in Iowa makes (to then presumably sell it to the public). The code below reads (part of) the data into an R session. ``` url <- "https://github.com/Stat579-at-ISU/materials/blob/master/03_tidyverse/data/Iowa_Liquor_Sales.csv.zip?raw=TRUE" download.file(url, "iowa.zip", mode="wb") iowa <- readr::read_csv("iowa.zip") ``` Assess the 'messiness' of the data. List issues that prevent us from working with the data directly. Which of these issues are of type (1) or (2) of messiness? --- class: inverse # Your Turn - Fast Fingers Run the following code to load the Iowa Liquor Sales into your working session ```r url <- "https://github.com/Stat579-at-ISU/materials/blob/master/03_tidyverse/data/Iowa_Liquor_Sales.csv.zip?raw=TRUE" download.file(url, "iowa.zip", mode="wb") iowa <- readr::read_csv("iowa.zip") ``` - Number of variables? number of observations? - How many different stores in Ames order liquor? - What is the time frame of the data? --- # Problems with the data - `Date` is text, in the format of Month/Day/Year (Messy 2) - Store location is a textual expression of form `POINT (`...`)` and geographic latitude and longitude. (Messy 2) no Messy 1? - problems of type Messy 1 are typically hard to detect and often up to interpretation/dependent on the analysis to be done. --- # ... finding other people's solutions - Working with dates: [package `lubridate`](https://lubridate.tidyverse.org/) ```r iowa <- iowa %>% mutate( proper_date = mdy(Date) ) iowa %>% select(proper_date) %>% summary() ``` ``` ## proper_date ## Min. :2012-01-03 ## 1st Qu.:2014-02-26 ## Median :2016-02-03 ## Mean :2016-01-07 ## 3rd Qu.:2017-12-06 ## Max. :2019-09-30 ``` ... but sometimes we still have to do things ourselves ... --- class: inverse # Your Turn (10 min) - Check the help for the function `parse_number` in the `readr` package and use it on the store location. What result do you get? - Use `separate_wider` with a delimiter of " " to separate `date` into strings that contain geographic latitude and longitude, then apply `parse_number` - For a challenge: `separate_wider_regex` allows to do the above step in one go ---