Table of Contents

Hints

Hint 1

This is a hint to Jawad Bakhteiary (Bay Wheels, OSTA). A similar scheme can be used also in other projects.

Here is a skeleton of the code for producing OSTA distributions for your bike-sharing service for January

> wdir <- "<path>/EDA/projects/Bakhteiary/P2"
> setwd(wdir)
>
> halfHours <- function(DT){ t <- unlist(strsplit(strsplit(DT," ")[[1]][2],":"))
+   return(1+2*as.integer(t[1])+(as.integer(t[2])>29)) }

> T <- read.csv("201901-fordgobike-tripdata.csv",header=TRUE,sep=",")
> # dim(T)
> # names(T)
> # T[1:3,]
> # table(T$user_type)
> # for(i in 1:10) cat(i,T$start_time[i],halfHours(T$start_time[i]),"\n")
> S <- as.vector(sapply(T$start_time,halfHours))
> U <- T$user_type=="Customer"
> D <- as.integer(T$start_station_id)
> D[is.na(D)] <- 400
> F <- matrix(0,nrow=400,ncol=96)
> for(i in 1:nrow(T)) {st <- D[i]; hu <- 48*U[i]+S[i]; F[st,hu] <- F[st,hu]+1}

> R <- rowSums(F)
> which(R==max(R))
[1] 58
> col <- c(rep("red",48),rep("blue",48))
> plot(1:96,F[58,],type="h",lwd=4,col=col,main="Station 58 subscribers/customers departures")
> plot(1:48,F[58,49:96],type="h",lwd=4,col=rep("blue",48),main="Station 58 customers departures")

I assembled all OSTA distributions in the matrix F - rows = stations, columns = halfhours / 1:48 for subscribers, 49:96 for customers. In the January data, the largest station ID is 385 - I set the max number of stations to 399. Some trips have NULL as the start station ID - I assigned them to station 400.

Still to do:

Hint 2

Matrix F contains a large number of units (distributions) - too much to inspect each one. Try to identify some interesting units. Unit 58 in the above skeleton is the most active (max row sum). An option is also to make a clustering of units and determine a smaller number (5 to 10) of clusters. Because of possible large differences in unit activities use for clustering their “shapes” - descriptions based on the corresponding probability distributions (frequency distribution divided by its activity). For each cluster compute its joint frequency distribution (= sum of distributions of units belonging to the cluster) and display it.

Hint 3

Most standard operations/functions in R are on vectors - we can often avoid loops. In your case, try

dura <- as.POSIXct(DivvyTrips$start_time) - as.POSIXct(DivvyTrips$end_time)
head(dura)

See also 3.1 in https://csgillespie.github.io/efficientR/programming.html