Midterm Stat 579 Fall 2014

Do you ride a bike?

Bike sharing is the idea that you can rent a bike at one station and ride it to another. Users are charged by the amount of time that they take out a bike. The data set contains information about a bike sharing service in Washington DC (here is the data source: Capital Bikeshare, but you should not need any of the information for the exam).

Each row in the dataset consists of one rental/trip. You will be asked questions about this dataset. For all of your answers provide the R code necessary for a complete solution. Write your answers in form of an R markdown script (this script already contains the questions).

Question 1

Load the data at https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/exams/data/bikesharing/bikes.csv into your R session. The dataset contains data for each trip. How many trips were there overall? How many factor variables are in the data, how many others (describe which ones)?

bikes <- read.csv("https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/exams/data/bikesharing/bikes.csv")
nrow(bikes)

## [1] 77186

str(bikes)

## 'data.frame':    77186 obs. of  8 variables:
##  $ Duration       : int  420 600 1080 540 540 1320 600 1020 300 120 ...
##  $ Start.date     : chr  "6/7/2014 23:59" "6/7/2014 23:59" "6/7/2014 23:58" "6/7/2014 23:58" ...
##  $ wday           : chr  "Sat" "Sat" "Sat" "Sat" ...
##  $ hour           : int  23 23 23 23 23 23 23 23 23 23 ...
##  $ Start.Station  : chr  "10th & U St NW" "10th & Florida Ave NW" "21st & M St NW" "18th & M St NW" ...
##  $ End.Station    : chr  "11th & M St NW" "17th & Rhode Island Ave NW" "4th & M St SW" "11th & M St NW" ...
##  $ Subscriber.Type: chr  "Registered" "Registered" "Registered" "Registered" ...
##  $ Bike.          : chr  "W01133" "W20972" "W01289" "W20676" ...

# 77186 trips

Question 2

The variable Duration contains the length of the bike rental in minutes. How long was the longest rental (convert into days)? What other information is in the data on this trip? How many trips lasted more than one day?

max(bikes$Duration)/60/60/24

## [1] 8.976389

# almost nine days
bikes[which.max(bikes$Duration),]

##       Duration     Start.date wday hour                 Start.Station
## 42057   775560 6/4/2014 10:19  Wed   10 Metro Center / 12th & G St NW
##         End.Station Subscriber.Type  Bike.
## 42057 3rd & H St NW          Casual W21208

nrow(subset(bikes, Duration > 24*60*60))

## [1] 7

Question 3

Start.Station describes the start of each trip. From which station did most trips get started? How many trips? Is that the same station at which most trips ended (End.Station)? When a bike is not returned, the End.Station is marked as ““. How often do bikes not get returned? What is reported for the duration of those trips? Change the value of Duration to NA.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

bikes %>% count(Start.Station) %>% slice_max(n, n=3)

##                          Start.Station    n
## 1 Massachusetts Ave & Dupont Circle NW 1595
## 2      Columbus Circle / Union Station 1587
## 3                     Lincoln Memorial 1541

bikes %>% count(End.Station) %>% slice_max(n, n=3)

##                            End.Station    n
## 1 Massachusetts Ave & Dupont Circle NW 1736
## 2      Columbus Circle / Union Station 1645
## 3                     Lincoln Memorial 1506

# same thing using base commands
sort(table(bikes$Start.Station), decreasing=TRUE)[1:3]

## 
## Massachusetts Ave & Dupont Circle NW      Columbus Circle / Union Station 
##                                 1595                                 1587 
##                     Lincoln Memorial 
##                                 1541

sort(table(bikes$End.Station), decreasing=TRUE)[1:3]

## 
## Massachusetts Ave & Dupont Circle NW      Columbus Circle / Union Station 
##                                 1736                                 1645 
##                     Lincoln Memorial 
##                                 1506

bikes %>% filter(End.Station =="") %>% nrow()

## [1] 1

bikes %>% filter(End.Station =="")

##   Duration     Start.date wday hour                  Start.Station End.Station
## 1      120 6/4/2014 18:47  Wed   18 Georgia & New Hampshire Ave NW            
##   Subscriber.Type Bike.
## 1      Registered

bikes <- bikes %>% mutate(
  Duration = ifelse(End.Station=="", NA, Duration)
)

Question 4

Plot barcharts of the number of trips on each day of the week, facet by Subscriber.Type. Make sure that the days of the week are in the usual order (start with Mondays). Describe any patterns you see in one to two sentences. Introduce a new variable ‘weekend’ into the data set that is TRUE for all Saturdays and Sundays and FALSE for all other days of the week. Draw the same plot as before but substitute ‘wday’ by ‘weekend’. Describe the pattern.

bikes$wday <- factor(bikes$wday, levels=c("Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"))
library(tidyverse)
bikes %>% 
  ggplot(aes(x = wday)) + 
  geom_bar() +
  facet_grid(~Subscriber.Type)

# more trips by registered users than casual; casual use more on weekends, registered use more on weekdays
bikes <- bikes %>% mutate(
  weekend = wday %in% c("Sat", "Sun")
)

# same thing as above
# wday bikes$weekend <- FALSE
# bikes$weekend[bikes$wday %in% c("Sat", "Sun")] <- TRUE

bikes %>% 
  ggplot(aes(x = weekend)) + 
  geom_bar() + 
  facet_grid(~Subscriber.Type)

# casual users have almost the same number of trips on weekends as on weekdays, even though the weekend is just two days.

Question 5

The first ten minutes of each trip are free. What is the percentage of free trips by Subscriber.Type? Calculate precisely, then visualize.

bikes$free <- bikes$Duration < 600
sum(subset(bikes, Subscriber.Type=="Casual")$free, na.rm=T)/nrow(subset(bikes, Subscriber.Type=="Casual"))*100

## [1] 11.2961

sum(subset(bikes, Subscriber.Type=="Registered")$free, na.rm=T)/nrow(subset(bikes, Subscriber.Type=="Registered"))*100

## [1] 48.44112

bikes %>% 
  ggplot(aes(x = Subscriber.Type, fill=free)) +
  geom_bar()

bikes %>% 
  ggplot(aes(x = Subscriber.Type, fill=free)) +
  geom_bar(position="fill")

Question 6

Using methods from the dplyr package come up with summaries of the bike trip data for the two types of subscribers: Over the course of the day, calculate - how many trips are done,
- average length of trips (in minutes), - standard deviation of the trip duration,

Draw a plot showing at least four of these variables. Describe patterns in the plot in one to two sentences.

library(dplyr)
bike.summary <- bikes %>%
  group_by(Subscriber.Type, hour) %>%
  summarize(
    n=n(),
    length=mean(Duration/60, na.rm=T),
    sdlength=sd(Duration/60, na.rm=T)
  )

## `summarise()` has grouped output by 'Subscriber.Type'. You can override using
## the `.groups` argument.

bike.summary %>% 
  ggplot(aes(x = length, y = sdlength, size = n)) +
  geom_point() + facet_grid(~Subscriber.Type)

# subscribers have lots of short trips (almost all < 20 mins), casual renters have trips between 20 mins and 60 mins on average. With increasing average length the variability increases.

Question 7

Write a function moment (x,k, na.rm=F) to calculate the (central) kth moment \(m_k\) of sample x as given in \(m_k = 1/n \cdot \sum_i (x_i - \mu)^k\), where \(\mu\) is the sample mean of numeric vector x and \(n\) is its length.

Make sure that the function deals with the parameter na.rm appropriately.

Use your function in the following framework: Construct a dataset for all combinations of Start.Station and End.Station get the following statistics: 1) the number of rentals/trips 2) the average duration of a trip 3) the median duration of the trips 3) the second moment of the durations 4) the third moment of the durations

Skewness \(\gamma\) of a distribution is defined as the ratio of the third moment divided by the second moment raised to the power of 1.5, i.e. \[ \gamma = \frac{m_3}{m_2^1.5}. \] For between station summaries that are based on at least 50 trips, plot the difference in median and mean (on the horizontal axis) and skewness (vertical axis) in a scatterplot. Describe the pattern.

# your code goes here
moment <- function (x, k, na.rm=FALSE) {
  if (na.rm==T) {
    x <- na.omit(x)
  }
  n <- length(x)
  mu <- mean(x)
  res <- sum( 1/n*(x - mu)^k )
  
  return(res)
}
library(dplyr)

between.stations <- bikes %>%
  group_by(Start.Station, End.Station) %>%
  summarize(
  n = length(Duration),
  mean=mean(Duration, na.rm=T),
  median=median(Duration, na.rm=T),
  second=moment(Duration, 2, na.rm=T),
  third=moment(Duration, 3, na.rm=T)
  )

## `summarise()` has grouped output by 'Start.Station'. You can override using the
## `.groups` argument.

between.stations %>% 
  filter( n > 50) %>% 
  ggplot(aes(x = median-mean, y = third/second^(2/3))) +
  geom_point()

Replace this text by your answer.