Pulling data from different sources

class: center, middle, inverse, title-slide

.title[
# Pulling data from different sources
]
.author[
### Heike Hofmann
]

---

class: middle, inverse, center
# Joining data from different tables

---

# Messy (4)

Messy (4): *Values for a single observational unit are stored across multiple tables.*

---

# Example: Lahman package

Sean Lahman is a database journalist, who started databases of historical sports statistics, in particular, the Lahman database on baseball.

```r
library(Lahman)
# ?Lahman
head(LahmanData)
```

```
##                  file      class   nobs nvar                     title
## 1         AllstarFull data.frame   5454    8         AllstarFull table
## 2         Appearances data.frame 110423   21         Appearances table
## 3      AwardsManagers data.frame    179    6      AwardsManagers table
## 4       AwardsPlayers data.frame   6531    6       AwardsPlayers table
## 5 AwardsShareManagers data.frame    425    7 AwardsShareManagers table
## 6  AwardsSharePlayers data.frame   6879    7  AwardsSharePlayers table
```

---

# Lahman database

The Lahman database consists of 27 data frames that are linked by `playerID`.

This is clean, but not very readable.

The `People` table includes names and other attributes for each player (and coaches, managers, ...)

*Joining* multiple tables helps us to bring together (pieces of) information from multiple sources.

---

# Example: Hall of Fame

Who was inducted in the Hall of Fame in 2022?

```r
HallOfFame %>% filter(yearID==2022) 
```

```
##     playerID yearID  votedBy ballots needed votes inducted          category
## 1  ortizda01   2022    BBWAA     394    296   307        Y            Player
## 2  bondsba01   2022    BBWAA     394    296   260        N            Player
## 3  clemero02   2022    BBWAA     394    296   257        N            Player
## 4  rolensc01   2022    BBWAA     394    296   249        N            Player
## 5  schilcu01   2022    BBWAA     394    296   231        N            Player
## 6  heltoto01   2022    BBWAA     394    296   205        N            Player
## 7  wagnebi02   2022    BBWAA     394    296   201        N            Player
## 8  jonesan01   2022    BBWAA     394    296   163        N            Player
## 9  sheffga01   2022    BBWAA     394    296   160        N            Player
## 10 rodrial01   2022    BBWAA     394    296   135        N            Player
## 11  kentje01   2022    BBWAA     394    296   129        N            Player
## 12 ramirma02   2022    BBWAA     394    296   114        N            Player
## 13 vizquom01   2022    BBWAA     394    296    94        N            Player
## 14  sosasa01   2022    BBWAA     394    296    73        N            Player
## 15 pettian01   2022    BBWAA     394    296    42        N            Player
## 16 rolliji01   2022    BBWAA     394    296    37        N            Player
## 17 abreubo01   2022    BBWAA     394    296    34        N            Player
## 18 buehrma01   2022    BBWAA     394    296    23        N            Player
## 19 hunteto01   2022    BBWAA     394    296    21        N            Player
## 20 nathajo01   2022    BBWAA     394    296    17        N            Player
## 21 hudsoti01   2022    BBWAA     394    296    12        N            Player
## 22 linceti01   2022    BBWAA     394    296     9        N            Player
## 23 howarry01   2022    BBWAA     394    296     8        N            Player
## 24 teixema01   2022    BBWAA     394    296     6        N            Player
## 25 papeljo01   2022    BBWAA     394    296     5        N            Player
## 26 morneju01   2022    BBWAA     394    296     5        N            Player
## 27 pierzaj01   2022    BBWAA     394    296     2        N            Player
## 28 fieldpr01   2022    BBWAA     394    296     2        N            Player
## 29 peavyja01   2022    BBWAA     394    296     0        N            Player
## 30 crawfca02   2022    BBWAA     394    296     0        N            Player
## 31 fowlebu99   2022 Veterans      NA     NA    NA        Y Pioneer/Executive
## 32 hodgegi01   2022 Veterans      NA     NA    NA        Y            Player
## 33  kaatji01   2022 Veterans      NA     NA    NA        Y            Player
## 34 minosmi01   2022 Veterans      NA     NA    NA        Y            Player
## 35 oneilbu01   2022 Veterans      NA     NA    NA        Y Pioneer/Executive
## 36 olivato01   2022 Veterans      NA     NA    NA        Y            Player
##    needed_note
## 1         <NA>
## 2         <NA>
## 3         <NA>
## 4         <NA>
## 5         <NA>
## 6         <NA>
## 7         <NA>
## 8         <NA>
## 9         <NA>
## 10        <NA>
## 11        <NA>
## 12        <NA>
## 13        <NA>
## 14        <NA>
## 15        <NA>
## 16        <NA>
## 17        <NA>
## 18        <NA>
## 19        <NA>
## 20        <NA>
## 21        <NA>
## 22        <NA>
## 23        <NA>
## 24        <NA>
## 25        <NA>
## 26        <NA>
## 27        <NA>
## 28        <NA>
## 29        <NA>
## 30        <NA>
## 31        <NA>
## 32        <NA>
## 33        <NA>
## 34        <NA>
## 35        <NA>
## 36        <NA>
```

---
class: inverse
# Your turn

For this your turn use the `HallOfFame` and `People` data from the `Lahman` package

- Identify all players who were inducted in the Hall of Fame in 2022, by filtering the `People` data for their player IDs.

---

# Joins - general idea

![](images/join.png)

---

# Joins - more specific idea

- Data sets are joined along values of variables.

- In `dplyr` there are several join functions: `left_join`, `inner_join`, `full_join`, ...

- Differences between join functions only visible, if not all values in one set have values in the other

---

```r
df1 <- data.frame(id = 1:6, trt = rep(c("A", "B", "C"), rep=c(2,1,3)), value = c(5,3,7,1,2,3))
df1
```

```
##   id trt value
## 1  1   A     5
## 2  2   B     3
## 3  3   C     7
## 4  4   A     1
## 5  5   B     2
## 6  6   C     3
```

```r
df2 <- data.frame(id=c(4,4,5,5,7,7), stress=rep(c(0,1), 3), bpm = c(65, 125, 74, 136, 48, 110))
df2
```

```
##   id stress bpm
## 1  4      0  65
## 2  4      1 125
## 3  5      0  74
## 4  5      1 136
## 5  7      0  48
## 6  7      1 110
```

---

# Left join

- all elements in the *left* data set are kept

- non-matches are filled in by `NA`

- `right_join` works symmetric

```r
left_join(df1, df2, by="id")
```

```
##   id trt value stress bpm
## 1  1   A     5     NA  NA
## 2  2   B     3     NA  NA
## 3  3   C     7     NA  NA
## 4  4   A     1      0  65
## 5  4   A     1      1 125
## 6  5   B     2      0  74
## 7  5   B     2      1 136
## 8  6   C     3     NA  NA
```

---

# Inner join

- only matches from both data sets are kept

```r
inner_join(df1, df2, by = "id")
```

```
##   id trt value stress bpm
## 1  4   A     1      0  65
## 2  4   A     1      1 125
## 3  5   B     2      0  74
## 4  5   B     2      1 136
```

---

# Full join

- all ids are kept, missings are filled in with `NA`

```r
full_join(df1, df2, by = "id")
```

```
##    id  trt value stress bpm
## 1   1    A     5     NA  NA
## 2   2    B     3     NA  NA
## 3   3    C     7     NA  NA
## 4   4    A     1      0  65
## 5   4    A     1      1 125
## 6   5    B     2      0  74
## 7   5    B     2      1 136
## 8   6    C     3     NA  NA
## 9   7 <NA>    NA      0  48
## 10  7 <NA>    NA      1 110
```

---
class:inverse
# Your turn (10 mins)

- Load the `Lahman` package into your R session.

- Join (relevant pieces of) the `People` data set and the `HallOfFame` data.

- Find all Hall of Famers who were alive as of 2022.

- How often was Sammy Sosa on the HoF ballot?

---

# Traps of joins

- sometimes we unexpectedly cannot match values: missing values, different spelling, ...

- join can be along multiple variables, e.g. `by = c("ID", "Date")`

- joining variable(s) can have different names, e.g. `by = c("State" = "Name")`

- always make sure to check dimensions of data before and after a join

- check on missing values; help with that: `anti_join`

---

# Anti join

- a neat function in `dplyr`

- careful, not symmetric!

```r
anti_join(df1, df2, by="id") # no values for id in df2
```

```
##   id trt value
## 1  1   A     5
## 2  2   B     3
## 3  3   C     7
## 4  6   C     3
```

```r
anti_join(df2, df1, by="id") # no values for id in df1
```

```
##   id stress bpm
## 1  7      0  48
## 2  7      1 110
```

---
class: inverse
# Your turn

For this example, use the data from the `classdata` package

- Load the `classdata` package into your R session.

- Investigate data sets `box` and `budget`.

- Join the two datasets to incorporate the release date of movies into the box office gross.

- Check on the dimensions of the data sets before and after the join. Where are potential problems?

- Use `anti_join` to detect problematic cases.

- Use `box` to show cumulative box office revenue over time. Extra point: label very successful movies in a readable manner.
- Identify box office hits and flops and visualize them.