Table of Contents

Hints

Hint 1

This is a hint to Jawad Bakhteiary (Bay Wheels, OSTA). A similar scheme can be used also in other projects.

Here is a skeleton of the code for producing OSTA distributions for your bike-sharing service for January

> wdir <- "<path>/EDA/projects/Bakhteiary/P2"
> setwd(wdir)
>
> halfHours <- function(DT){ t <- unlist(strsplit(strsplit(DT," ")[[1]][2],":"))
+   return(1+2*as.integer(t[1])+(as.integer(t[2])>29)) }

> T <- read.csv("201901-fordgobike-tripdata.csv",header=TRUE,sep=",")
> # dim(T)
> # names(T)
> # T[1:3,]
> # table(T$user_type)
> # for(i in 1:10) cat(i,T$start_time[i],halfHours(T$start_time[i]),"\n")
> S <- as.vector(sapply(T$start_time,halfHours))
> U <- T$user_type=="Customer"
> D <- as.integer(T$start_station_id)
> D[is.na(D)] <- 400
> F <- matrix(0,nrow=400,ncol=96)
> for(i in 1:nrow(T)) {st <- D[i]; hu <- 48*U[i]+S[i]; F[st,hu] <- F[st,hu]+1}

> R <- rowSums(F)
> which(R==max(R))
[1] 58
> col <- c(rep("red",48),rep("blue",48))
> plot(1:96,F[58,],type="h",lwd=4,col=col,main="Station 58 subscribers/customers departures")
> plot(1:48,F[58,49:96],type="h",lwd=4,col=rep("blue",48),main="Station 58 customers departures")

I assembled all OSTA distributions in the matrix F - rows = stations, columns = halfhours / 1:48 for subscribers, 49:96 for customers. In the January data, the largest station ID is 385 - I set the max number of stations to 399. Some trips have NULL as the start station ID - I assigned them to station 400.

Still to do:

  • check that no station has an ID larger than 399; otherwise add necessary rows +1 to F
  • include data for the other 11 months into F
  • add station names as rownames of F
  • finally, remove rows with 0 activity from F
  • produce also DURA distributions (one for subscribers, another for customers)

Hint 2

Matrix F contains a large number of units (distributions) - too much to inspect each one. Try to identify some interesting units. Unit 58 in the above skeleton is the most active (max row sum). An option is also to make a clustering of units and determine a smaller number (5 to 10) of clusters. Because of possible large differences in unit activities use for clustering their “shapes” - descriptions based on the corresponding probability distributions (frequency distribution divided by its activity). For each cluster compute its joint frequency distribution (= sum of distributions of units belonging to the cluster) and display it.

Hint 3

Most standard operations/functions in R are on vectors - we can often avoid loops. In your case, try

dura <- as.POSIXct(DivvyTrips$start_time) - as.POSIXct(DivvyTrips$end_time)
head(dura)

See also 3.1 in https://csgillespie.github.io/efficientR/programming.html

ru/hse/eda22/stu/h2.txt · Last modified: 2022/12/14 16:01 by vlado
 
Except where otherwise noted, content on this wiki is licensed under the following license: CC Attribution-Noncommercial-Share Alike 3.0 Unported
Recent changes RSS feed Donate Powered by PHP Valid XHTML 1.0 Valid CSS Driven by DokuWiki