Do you ride a bike?
---
Bike sharing is the idea that you can rent a bike at one station and ride it to another. Users are charged by the amount of time that they take out a bike.
The data set contains information about a bike sharing service in Washington DC (here is the data source: Capital Bikeshare, but you should not need any of the information for the exam).
Each row in the dataset consists of one rental/trip.
You will be asked questions about this dataset. For all of your answers provide the R code necessary for a complete solution. Write your answers in form of an R markdown script (this script already contains the questions).
Question 1
---
Load the data at https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/exams/data/bikesharing/bikes.csv into your R session. The dataset contains data for each trip. How many trips were there overall? How many factor variables are in the data, how many others (describe which ones)?
```{r cache=TRUE}
bikes <- read.csv("https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/exams/data/bikesharing/bikes.csv")
nrow(bikes)
str(bikes)
# 77186 trips
```
Question 2
---
The variable Duration contains the length of the bike rental in minutes. How long was the longest rental (convert into days)? What other information is in the data on this trip? How many trips lasted more than one day?
```{r}
max(bikes$Duration)/60/60/24
# almost nine days
bikes[which.max(bikes$Duration),]
nrow(subset(bikes, Duration > 24*60*60))
```
Question 3
---
Start.Station describes the start of each trip. From which station did most trips get started? How many trips?
Is that the same station at which most trips ended (End.Station)?
When a bike is not returned, the End.Station is marked as "". How often do bikes not get returned? What is reported for the duration of those trips? Change the value of Duration to NA.
```{r}
library(dplyr)
bikes %>% count(Start.Station) %>% slice_max(n, n=3)
bikes %>% count(End.Station) %>% slice_max(n, n=3)
# same thing using base commands
sort(table(bikes$Start.Station), decreasing=TRUE)[1:3]
sort(table(bikes$End.Station), decreasing=TRUE)[1:3]
bikes %>% filter(End.Station =="") %>% nrow()
bikes %>% filter(End.Station =="")
bikes <- bikes %>% mutate(
Duration = ifelse(End.Station=="", NA, Duration)
)
```
Question 4
---
Plot barcharts of the number of trips on each day of the week, facet by Subscriber.Type. Make sure that the days of the week are in the usual order (start with Mondays). Describe any patterns you see in one to two sentences.
Introduce a new variable 'weekend' into the data set that is TRUE for all Saturdays and Sundays and FALSE for all other days of the week. Draw the same plot as before but substitute 'wday' by 'weekend'. Describe the pattern.
```{r message = FALSE}
bikes$wday <- factor(bikes$wday, levels=c("Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"))
library(tidyverse)
bikes %>%
ggplot(aes(x = wday)) +
geom_bar() +
facet_grid(~Subscriber.Type)
# more trips by registered users than casual; casual use more on weekends, registered use more on weekdays
bikes <- bikes %>% mutate(
weekend = wday %in% c("Sat", "Sun")
)
# same thing as above
# wday bikes$weekend <- FALSE
# bikes$weekend[bikes$wday %in% c("Sat", "Sun")] <- TRUE
bikes %>%
ggplot(aes(x = weekend)) +
geom_bar() +
facet_grid(~Subscriber.Type)
# casual users have almost the same number of trips on weekends as on weekdays, even though the weekend is just two days.
```
Question 5
---
The first ten minutes of each trip are free. What is the percentage of free trips by Subscriber.Type? Calculate precisely, then visualize.
```{r}
bikes$free <- bikes$Duration < 600
sum(subset(bikes, Subscriber.Type=="Casual")$free, na.rm=T)/nrow(subset(bikes, Subscriber.Type=="Casual"))*100
sum(subset(bikes, Subscriber.Type=="Registered")$free, na.rm=T)/nrow(subset(bikes, Subscriber.Type=="Registered"))*100
bikes %>%
ggplot(aes(x = Subscriber.Type, fill=free)) +
geom_bar()
bikes %>%
ggplot(aes(x = Subscriber.Type, fill=free)) +
geom_bar(position="fill")
```
Question 6
---
Using methods from the dplyr package come up with summaries of the bike trip data for the two types of subscribers:
Over the course of the day, calculate
- how many trips are done,
- average length of trips (in minutes),
- standard deviation of the trip duration,
Draw a plot showing at least four of these variables. Describe patterns in the plot in one to two sentences.
```{r}
library(dplyr)
bike.summary <- bikes %>%
group_by(Subscriber.Type, hour) %>%
summarize(
n=n(),
length=mean(Duration/60, na.rm=T),
sdlength=sd(Duration/60, na.rm=T)
)
bike.summary %>%
ggplot(aes(x = length, y = sdlength, size = n)) +
geom_point() + facet_grid(~Subscriber.Type)
# subscribers have lots of short trips (almost all < 20 mins), casual renters have trips between 20 mins and 60 mins on average. With increasing average length the variability increases.
```
#### Question 7
Write a function ```moment (x,k, na.rm=F)``` to calculate the (central) kth moment $m_k$ of sample x as given in
$m_k = 1/n \cdot \sum_i (x_i - \mu)^k$,
where $\mu$ is the sample mean of numeric vector x and $n$ is its length.
Make sure that the function deals with the parameter ```na.rm``` appropriately.
Use your function in the following framework:
Construct a dataset for all combinations of ```Start.Station``` and ```End.Station``` get the following statistics:
1) the number of rentals/trips
2) the average duration of a trip
3) the median duration of the trips
3) the second moment of the durations
4) the third moment of the durations
Skewness $\gamma$ of a distribution is defined as the ratio of the third moment divided by the second moment raised to the power of 1.5, i.e.
\[
\gamma = \frac{m_3}{m_2^1.5}.
\]
For between station summaries that are based on at least 50 trips, plot the difference in median and mean (on the horizontal axis) and skewness (vertical axis) in a scatterplot. Describe the pattern.
```{r}
# your code goes here
moment <- function (x, k, na.rm=FALSE) {
if (na.rm==T) {
x <- na.omit(x)
}
n <- length(x)
mu <- mean(x)
res <- sum( 1/n*(x - mu)^k )
return(res)
}
library(dplyr)
between.stations <- bikes %>%
group_by(Start.Station, End.Station) %>%
summarize(
n = length(Duration),
mean=mean(Duration, na.rm=T),
median=median(Duration, na.rm=T),
second=moment(Duration, 2, na.rm=T),
third=moment(Duration, 3, na.rm=T)
)
between.stations %>%
filter( n > 50) %>%
ggplot(aes(x = median-mean, y = third/second^(2/3))) +
geom_point()
```
Replace this text by your answer.