class: center, middle, inverse, title-slide .title[ # Pulling data from different sources ] .author[ ### Heike Hofmann ] --- class: middle, inverse, center # Joining data from different tables --- # Messy (4) Messy (4): *Values for a single observational unit are stored across multiple tables.* <!--After data are normalized by splitting, we want to de-normalize again by *joining* datasets.--> --- # Example: Lahman package Sean Lahman is a database journalist, who started databases of historical sports statistics, in particular, the Lahman database on baseball. ```r library(Lahman) # ?Lahman head(LahmanData) ``` ``` ## file class nobs nvar title ## 1 AllstarFull data.frame 5454 8 AllstarFull table ## 2 Appearances data.frame 110423 21 Appearances table ## 3 AwardsManagers data.frame 179 6 AwardsManagers table ## 4 AwardsPlayers data.frame 6531 6 AwardsPlayers table ## 5 AwardsShareManagers data.frame 425 7 AwardsShareManagers table ## 6 AwardsSharePlayers data.frame 6879 7 AwardsSharePlayers table ``` --- # Lahman database The Lahman database consists of 27 data frames that are linked by `playerID`. This is clean, but not very readable. The `People` table includes names and other attributes for each player (and coaches, managers, ...) *Joining* multiple tables helps us to bring together (pieces of) information from multiple sources. --- # Example: Hall of Fame Who was inducted in the Hall of Fame in 2022? ```r HallOfFame %>% filter(yearID==2022) ``` ``` ## playerID yearID votedBy ballots needed votes inducted category ## 1 ortizda01 2022 BBWAA 394 296 307 Y Player ## 2 bondsba01 2022 BBWAA 394 296 260 N Player ## 3 clemero02 2022 BBWAA 394 296 257 N Player ## 4 rolensc01 2022 BBWAA 394 296 249 N Player ## 5 schilcu01 2022 BBWAA 394 296 231 N Player ## 6 heltoto01 2022 BBWAA 394 296 205 N Player ## 7 wagnebi02 2022 BBWAA 394 296 201 N Player ## 8 jonesan01 2022 BBWAA 394 296 163 N Player ## 9 sheffga01 2022 BBWAA 394 296 160 N Player ## 10 rodrial01 2022 BBWAA 394 296 135 N Player ## 11 kentje01 2022 BBWAA 394 296 129 N Player ## 12 ramirma02 2022 BBWAA 394 296 114 N Player ## 13 vizquom01 2022 BBWAA 394 296 94 N Player ## 14 sosasa01 2022 BBWAA 394 296 73 N Player ## 15 pettian01 2022 BBWAA 394 296 42 N Player ## 16 rolliji01 2022 BBWAA 394 296 37 N Player ## 17 abreubo01 2022 BBWAA 394 296 34 N Player ## 18 buehrma01 2022 BBWAA 394 296 23 N Player ## 19 hunteto01 2022 BBWAA 394 296 21 N Player ## 20 nathajo01 2022 BBWAA 394 296 17 N Player ## 21 hudsoti01 2022 BBWAA 394 296 12 N Player ## 22 linceti01 2022 BBWAA 394 296 9 N Player ## 23 howarry01 2022 BBWAA 394 296 8 N Player ## 24 teixema01 2022 BBWAA 394 296 6 N Player ## 25 papeljo01 2022 BBWAA 394 296 5 N Player ## 26 morneju01 2022 BBWAA 394 296 5 N Player ## 27 pierzaj01 2022 BBWAA 394 296 2 N Player ## 28 fieldpr01 2022 BBWAA 394 296 2 N Player ## 29 peavyja01 2022 BBWAA 394 296 0 N Player ## 30 crawfca02 2022 BBWAA 394 296 0 N Player ## 31 fowlebu99 2022 Veterans NA NA NA Y Pioneer/Executive ## 32 hodgegi01 2022 Veterans NA NA NA Y Player ## 33 kaatji01 2022 Veterans NA NA NA Y Player ## 34 minosmi01 2022 Veterans NA NA NA Y Player ## 35 oneilbu01 2022 Veterans NA NA NA Y Pioneer/Executive ## 36 olivato01 2022 Veterans NA NA NA Y Player ## needed_note ## 1 <NA> ## 2 <NA> ## 3 <NA> ## 4 <NA> ## 5 <NA> ## 6 <NA> ## 7 <NA> ## 8 <NA> ## 9 <NA> ## 10 <NA> ## 11 <NA> ## 12 <NA> ## 13 <NA> ## 14 <NA> ## 15 <NA> ## 16 <NA> ## 17 <NA> ## 18 <NA> ## 19 <NA> ## 20 <NA> ## 21 <NA> ## 22 <NA> ## 23 <NA> ## 24 <NA> ## 25 <NA> ## 26 <NA> ## 27 <NA> ## 28 <NA> ## 29 <NA> ## 30 <NA> ## 31 <NA> ## 32 <NA> ## 33 <NA> ## 34 <NA> ## 35 <NA> ## 36 <NA> ``` --- class: inverse # Your turn For this your turn use the `HallOfFame` and `People` data from the `Lahman` package - Identify all players who were inducted in the Hall of Fame in 2022, by filtering the `People` data for their player IDs. --- # Joins - general idea ![](images/join.png) --- # Joins - more specific idea - Data sets are joined along values of variables. - In `dplyr` there are several join functions: `left_join`, `inner_join`, `full_join`, ... - Differences between join functions only visible, if not all values in one set have values in the other --- ```r df1 <- data.frame(id = 1:6, trt = rep(c("A", "B", "C"), rep=c(2,1,3)), value = c(5,3,7,1,2,3)) df1 ``` ``` ## id trt value ## 1 1 A 5 ## 2 2 B 3 ## 3 3 C 7 ## 4 4 A 1 ## 5 5 B 2 ## 6 6 C 3 ``` ```r df2 <- data.frame(id=c(4,4,5,5,7,7), stress=rep(c(0,1), 3), bpm = c(65, 125, 74, 136, 48, 110)) df2 ``` ``` ## id stress bpm ## 1 4 0 65 ## 2 4 1 125 ## 3 5 0 74 ## 4 5 1 136 ## 5 7 0 48 ## 6 7 1 110 ``` --- # Left join - all elements in the *left* data set are kept - non-matches are filled in by `NA` - `right_join` works symmetric ```r left_join(df1, df2, by="id") ``` ``` ## id trt value stress bpm ## 1 1 A 5 NA NA ## 2 2 B 3 NA NA ## 3 3 C 7 NA NA ## 4 4 A 1 0 65 ## 5 4 A 1 1 125 ## 6 5 B 2 0 74 ## 7 5 B 2 1 136 ## 8 6 C 3 NA NA ``` --- # Inner join - only matches from both data sets are kept ```r inner_join(df1, df2, by = "id") ``` ``` ## id trt value stress bpm ## 1 4 A 1 0 65 ## 2 4 A 1 1 125 ## 3 5 B 2 0 74 ## 4 5 B 2 1 136 ``` --- # Full join - all ids are kept, missings are filled in with `NA` ```r full_join(df1, df2, by = "id") ``` ``` ## id trt value stress bpm ## 1 1 A 5 NA NA ## 2 2 B 3 NA NA ## 3 3 C 7 NA NA ## 4 4 A 1 0 65 ## 5 4 A 1 1 125 ## 6 5 B 2 0 74 ## 7 5 B 2 1 136 ## 8 6 C 3 NA NA ## 9 7 <NA> NA 0 48 ## 10 7 <NA> NA 1 110 ``` --- class:inverse # Your turn (10 mins) - Load the `Lahman` package into your R session. - Join (relevant pieces of) the `People` data set and the `HallOfFame` data. - Find all Hall of Famers who were alive as of 2022. - How often was Sammy Sosa on the HoF ballot? --- # Traps of joins - sometimes we unexpectedly cannot match values: missing values, different spelling, ... - join can be along multiple variables, e.g. `by = c("ID", "Date")` - joining variable(s) can have different names, e.g. `by = c("State" = "Name")` - always make sure to check dimensions of data before and after a join - check on missing values; help with that: `anti_join` --- # Anti join - a neat function in `dplyr` - careful, not symmetric! ```r anti_join(df1, df2, by="id") # no values for id in df2 ``` ``` ## id trt value ## 1 1 A 5 ## 2 2 B 3 ## 3 3 C 7 ## 4 6 C 3 ``` ```r anti_join(df2, df1, by="id") # no values for id in df1 ``` ``` ## id stress bpm ## 1 7 0 48 ## 2 7 1 110 ``` --- class: inverse # Your turn For this example, use the data from the `classdata` package - Load the `classdata` package into your R session. - Investigate data sets `box` and `budget`. - Join the two datasets to incorporate the release date of movies into the box office gross. - Check on the dimensions of the data sets before and after the join. Where are potential problems? - Use `anti_join` to detect problematic cases. - Use `box` to show cumulative box office revenue over time. Extra point: label very successful movies in a readable manner. - Identify box office hits and flops and visualize them.