Bike sharing is the idea that you can rent a bike at one station and ride it to another. Users are charged by the amount of time that they take out a bike. The data set contains information about a bike sharing service in Washington DC (here is the data source: Capital Bikeshare, but you should not need any of the information for the exam).
Each row in the dataset consists of one rental/trip. You will be asked questions about this dataset. For all of your answers provide the R code necessary for a complete solution. Write your answers in form of an R markdown script (this script already contains the questions).
Load the data at https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/exams/data/bikesharing/bikes.csv into your R session. The dataset contains data for each trip. How many trips were there overall? How many factor variables are in the data, how many others (describe which ones)?
bikes <- read.csv("https://raw.githubusercontent.com/Stat579-at-ISU/stat579-at-isu.github.io/master/exams/data/bikesharing/bikes.csv")
nrow(bikes)
## [1] 77186
str(bikes)
## 'data.frame': 77186 obs. of 8 variables:
## $ Duration : int 420 600 1080 540 540 1320 600 1020 300 120 ...
## $ Start.date : chr "6/7/2014 23:59" "6/7/2014 23:59" "6/7/2014 23:58" "6/7/2014 23:58" ...
## $ wday : chr "Sat" "Sat" "Sat" "Sat" ...
## $ hour : int 23 23 23 23 23 23 23 23 23 23 ...
## $ Start.Station : chr "10th & U St NW" "10th & Florida Ave NW" "21st & M St NW" "18th & M St NW" ...
## $ End.Station : chr "11th & M St NW" "17th & Rhode Island Ave NW" "4th & M St SW" "11th & M St NW" ...
## $ Subscriber.Type: chr "Registered" "Registered" "Registered" "Registered" ...
## $ Bike. : chr "W01133" "W20972" "W01289" "W20676" ...
# 77186 trips
The variable Duration contains the length of the bike rental in minutes. How long was the longest rental (convert into days)? What other information is in the data on this trip? How many trips lasted more than one day?
max(bikes$Duration)/60/60/24
## [1] 8.976389
# almost nine days
bikes[which.max(bikes$Duration),]
## Duration Start.date wday hour Start.Station
## 42057 775560 6/4/2014 10:19 Wed 10 Metro Center / 12th & G St NW
## End.Station Subscriber.Type Bike.
## 42057 3rd & H St NW Casual W21208
nrow(subset(bikes, Duration > 24*60*60))
## [1] 7
Start.Station describes the start of each trip. From which station did most trips get started? How many trips? Is that the same station at which most trips ended (End.Station)? When a bike is not returned, the End.Station is marked as ““. How often do bikes not get returned? What is reported for the duration of those trips? Change the value of Duration to NA.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
bikes %>% count(Start.Station) %>% slice_max(n, n=3)
## Start.Station n
## 1 Massachusetts Ave & Dupont Circle NW 1595
## 2 Columbus Circle / Union Station 1587
## 3 Lincoln Memorial 1541
bikes %>% count(End.Station) %>% slice_max(n, n=3)
## End.Station n
## 1 Massachusetts Ave & Dupont Circle NW 1736
## 2 Columbus Circle / Union Station 1645
## 3 Lincoln Memorial 1506
# same thing using base commands
sort(table(bikes$Start.Station), decreasing=TRUE)[1:3]
##
## Massachusetts Ave & Dupont Circle NW Columbus Circle / Union Station
## 1595 1587
## Lincoln Memorial
## 1541
sort(table(bikes$End.Station), decreasing=TRUE)[1:3]
##
## Massachusetts Ave & Dupont Circle NW Columbus Circle / Union Station
## 1736 1645
## Lincoln Memorial
## 1506
bikes %>% filter(End.Station =="") %>% nrow()
## [1] 1
bikes %>% filter(End.Station =="")
## Duration Start.date wday hour Start.Station End.Station
## 1 120 6/4/2014 18:47 Wed 18 Georgia & New Hampshire Ave NW
## Subscriber.Type Bike.
## 1 Registered
bikes <- bikes %>% mutate(
Duration = ifelse(End.Station=="", NA, Duration)
)
Plot barcharts of the number of trips on each day of the week, facet by Subscriber.Type. Make sure that the days of the week are in the usual order (start with Mondays). Describe any patterns you see in one to two sentences. Introduce a new variable ‘weekend’ into the data set that is TRUE for all Saturdays and Sundays and FALSE for all other days of the week. Draw the same plot as before but substitute ‘wday’ by ‘weekend’. Describe the pattern.
bikes$wday <- factor(bikes$wday, levels=c("Mon", "Tues", "Wed", "Thurs", "Fri", "Sat", "Sun"))
library(tidyverse)
bikes %>%
ggplot(aes(x = wday)) +
geom_bar() +
facet_grid(~Subscriber.Type)
# more trips by registered users than casual; casual use more on weekends, registered use more on weekdays
bikes <- bikes %>% mutate(
weekend = wday %in% c("Sat", "Sun")
)
# same thing as above
# wday bikes$weekend <- FALSE
# bikes$weekend[bikes$wday %in% c("Sat", "Sun")] <- TRUE
bikes %>%
ggplot(aes(x = weekend)) +
geom_bar() +
facet_grid(~Subscriber.Type)
# casual users have almost the same number of trips on weekends as on weekdays, even though the weekend is just two days.
The first ten minutes of each trip are free. What is the percentage of free trips by Subscriber.Type? Calculate precisely, then visualize.
bikes$free <- bikes$Duration < 600
sum(subset(bikes, Subscriber.Type=="Casual")$free, na.rm=T)/nrow(subset(bikes, Subscriber.Type=="Casual"))*100
## [1] 11.2961
sum(subset(bikes, Subscriber.Type=="Registered")$free, na.rm=T)/nrow(subset(bikes, Subscriber.Type=="Registered"))*100
## [1] 48.44112
bikes %>%
ggplot(aes(x = Subscriber.Type, fill=free)) +
geom_bar()
bikes %>%
ggplot(aes(x = Subscriber.Type, fill=free)) +
geom_bar(position="fill")
Using methods from the dplyr package come up with summaries of the
bike trip data for the two types of subscribers: Over the course of the
day, calculate - how many trips are done,
- average length of trips (in minutes), - standard deviation of the trip
duration,
Draw a plot showing at least four of these variables. Describe patterns in the plot in one to two sentences.
library(dplyr)
bike.summary <- bikes %>%
group_by(Subscriber.Type, hour) %>%
summarize(
n=n(),
length=mean(Duration/60, na.rm=T),
sdlength=sd(Duration/60, na.rm=T)
)
## `summarise()` has grouped output by 'Subscriber.Type'. You can override using
## the `.groups` argument.
bike.summary %>%
ggplot(aes(x = length, y = sdlength, size = n)) +
geom_point() + facet_grid(~Subscriber.Type)
# subscribers have lots of short trips (almost all < 20 mins), casual renters have trips between 20 mins and 60 mins on average. With increasing average length the variability increases.
Write a function moment (x,k, na.rm=F)
to calculate the
(central) kth moment \(m_k\) of sample
x as given in \(m_k = 1/n \cdot \sum_i (x_i -
\mu)^k\), where \(\mu\) is the
sample mean of numeric vector x and \(n\) is its length.
Make sure that the function deals with the parameter
na.rm
appropriately.
Use your function in the following framework: Construct a dataset for
all combinations of Start.Station
and
End.Station
get the following statistics: 1) the number of
rentals/trips 2) the average duration of a trip 3) the median duration
of the trips 3) the second moment of the durations 4) the third moment
of the durations
Skewness \(\gamma\) of a distribution is defined as the ratio of the third moment divided by the second moment raised to the power of 1.5, i.e. \[ \gamma = \frac{m_3}{m_2^1.5}. \] For between station summaries that are based on at least 50 trips, plot the difference in median and mean (on the horizontal axis) and skewness (vertical axis) in a scatterplot. Describe the pattern.
# your code goes here
moment <- function (x, k, na.rm=FALSE) {
if (na.rm==T) {
x <- na.omit(x)
}
n <- length(x)
mu <- mean(x)
res <- sum( 1/n*(x - mu)^k )
return(res)
}
library(dplyr)
between.stations <- bikes %>%
group_by(Start.Station, End.Station) %>%
summarize(
n = length(Duration),
mean=mean(Duration, na.rm=T),
median=median(Duration, na.rm=T),
second=moment(Duration, 2, na.rm=T),
third=moment(Duration, 3, na.rm=T)
)
## `summarise()` has grouped output by 'Start.Station'. You can override using the
## `.groups` argument.
between.stations %>%
filter( n > 50) %>%
ggplot(aes(x = median-mean, y = third/second^(2/3))) +
geom_point()
Replace this text by your answer.