Blog

How to Make Sense of 750,000 Bicycle Trips

0 Flares 0 Flares ×

Chicago’s bike sharing program, Divvy, recently published trip data for the 750,000 trips taken in 2013. They’re looking to use data to answer burning questions like Where are riders going? When are they going there? How far do they ride? What are top stations? What interesting usage patterns emerge? What can the data reveal about how Chicago gets around on Divvy?

As a certified data geek, here’s how I went about analyzing Divvy’s massive stacks of information. You can check out the final analysis at http://drewdepriest.com/divvy.

Step 01. Update CSV tables provided by Divvy:

Divvy_Stations_2013.csv – removed capacity column, added ID column (manually entered from “Trips” CSV)

Divvy_Trips_2013.csv – removed all columns except for start time, “from” station ID, and “to” station ID

Step 02. Wrote static Java package to build a profile for each one of the 757,911 trips (759,788 provided – I excluded any with stations marked “#N/A”) EDIT: Going back through the data two weeks later, I realize I should’ve recognized the “#N/A” stations all applied to “Congress & Ogden,” or Station ID 122. If I find more time to run these 1,877 trips at a later date, I’d love to do so. The data purist in me isn’t happy with this outcome, yet the pragmatist still believes an analysis of 99.75% of the total data set is still pretty accurate.

Profile consists of:

Start time

From station ID

To station ID

From station latitude

From station longitude

To station latitude

To station longitude

Distance (road distance, on a bicycle, between the “from” and “to” stations)

Step 03. Emailed MapQuest Open Directions API Web Service team to request permission to run several thousand calls to their service. Many thanks to Joe Barbara at MapQuest for his blessing, provided I run the calls serially and from only one box. He also asked that I attribute the data to MapQuest and OSM, which I am more than happy to do. They’re fantastic.

Step 04. Within the Java package, I built a mechanism to screen for duplicate trips in the massive list. There are only 89,700 possible permutations of trips:

 frac{(n!)}{(n-k)!} = frac{300!}{(300-2)!} = 300*299 = 89,700

n = number of stations, k = number of combinations (start, end)

Thus, there’s no need to make a distance API call for every single trip – just for the possible trip permutations. In total, there are only 44,180 unique trips found within the Divvy data set for June to December 2013.

Step 05. Made serial calls (staggered to run 5,000 at a time, then print to CSV) from Java to MapQuest Open Directions API Web Service to obtain road distance values for each possible trip. After it obtains all distance values, it prints each batch to a CSV file. I manually stitched them all together into one master distance file.

Step 06. Some (27,000) of the trips came back with “null” values for distance. Even though these represent less than four percent of total trips, I decided to run them against MapQuest again in an effort to be as accurate as possible with total distance. Thus, I deleted any row with “null” distance values from the trip list and ran those trips through the MapQuest API again, essentially repeating step 5 for 27,000 lat/long pairs.

Step 07. Once we know all the distances for each possible trip, we run a separate algorithm that reads back in the CSV created in Step 5. The function loops through all 757,911 trips and matches each route to its corresponding distance.

Step 08. We then print the entire list of 757,911 trips to two CSV files:

  1. MASTER file containing all from/to station info for each station
  2. LITE file containing start time and distance

Step 09. Wrote a Java class as part of the package that sent email via JavaMail to my Gmail account once all API calls had been made and results exported to CSV files. I then added a recipe to IFTTT (if-this-then-that) that detected the email from my Java app, and in response, sent a text message to my phone. Think of it as a super nerdy way to “set it and forget it.”

Step 10. I then took each set of data and leveraged Python to upload everything to TempoDB (Chicago startup that specializes in “Data as a Service (DaaS)”). Tempo casts the list as a time-series database, which allows for lightning-fast processing and grouping by time period (minute, hour, day, etc.).

Step 11. For the weather data, I signed up for a 30-day trial of AccuWeather Professional and found daily data for high temperature and precipitation. I simply built new CSVs that fit Tempo’s format and uploaded a series for each of 187 daily high temperatures and 187 daily precipitation values.

Step 12. Still within Python, and leveraging a package called Pygal, I built SVG representations of charts of each data set. With Pygal, you can embed the resulting charts directly into a web page, which I did and published to http://drewdepriest.com/divvy.

divvy-data-challenge-steps-final

0 Flares Twitter 0 Facebook 0 Google+ 0 LinkedIn 0 Email -- 0 Flares ×
Tags: , , , , , ,

Show Comments (0)

This is a unique website which will require a more modern browser to work! Please upgrade today!

0 Flares Twitter 0 Facebook 0 Google+ 0 LinkedIn 0 Email -- 0 Flares ×