With a real focus on our Unified API and open data policy in recent months, both on this blog and through the Hackathons and events (such as the Urban Traffic Hackathon a few weeks ago) that TfL staff have been involved with, we’ve received lots of great feedback and questions from the developer community, just as we’d hoped we would.
One such question that has cropped up many times is one around customer volume and flow data, i.e. how can we help developers create apps that take into account how busy certain lines, stations, platforms, etc are likely to be when customers are planning a journey.
To provide an update on where we are with this data, TfL’s Data Services Manager Ryan Sweeney offers this summary, and asks for your feedback to help us ensure we’re providing data that is both relevant and useful:
TfL have participated in a number of Hackathons this autumn, focused on working with developers to create innovative new solutions that help make our customers’ journeys better. In order to facilitate these events, we’ve released a range of previously unseen data from across our many modes of transport, enabling those working with the data at these events to be truly creative and original in the solutions they develop.
One of the key new data sets we have released is customer volume and flow data across the whole of the TfL rail and bus network, collected from our automatic fare collection system.
The data released is a two week sample from October, and is aggregated to 30 minute level. In addition to aggregation, we have also modified low values for data protection reasons. The data can be found in a zip file available to download from our data store and includes a metadata spreadsheet listing and describing all of the fields:
30 minute entries.csv: This data shows the total number of gateline entries by station, day and 30 minute time interval
30 minute exits.csv: This data shows the total number of gateline exits by station, day and 30 minute time interval
Bus journeys: This data shows the bus boards by stop, route and time of journey.
Journey time.txt: This data shows the journeys made between station pairs and the time taken to make the journey.
We’ve already seen a number of interesting and useful applications and concepts being created using these data sources in the various Hackathons we have attended. We would like to understand, form a dev’s point of view, the answers to the following questions:
- What would you use this data for?
- How granular (time and location) does this data need to be for it to be beneficial? (Please note that some aggregation and modification to remove low values is required in order to openly release ticketing data)
- Is historic data useful?
- Are additional data sources needed to compliment this to enable you to use it?
- Is the data easy to understand?
Please let us know your thoughts on these questions, and if this data promises to be beneficial to the developer community, we will investigate the possibility of making this data available on a periodic basis.
This data is great to get much more nuanced feel on customer flows, general crowdedness of stations at certain points in time. Things like the journey times can help us understand traffic in stations so we can get more accurate travel times in the journey planner. The metadata descriptions came in handy to have a good understanding of the data. Ofcourse entry/exit is one thing, getting this on a line + station level including more detailed info on crowdedness in the tube itself would bring major benefits. Ideally as detailed as possible, though given the sampling constraints the 30 minute intervals are actually quite OK. Merging this set with metadata of the stations (size, facilities, number of gates, .. provided by the Unified API) adds more interesting tidbits for comparisons.
Since this is a sample of 2 weeks, we can’t fully comprehend effects of seasonality etc. yet.
Some stations were also closed at the time in October, so we don’t get a full overview of all stations at the moment.
Also, is there a reason the key in this dataset is NLC codes instead of NAPTAN codes which are used in the Unified API? There doesn’t seem a good view on linking those 2 sets for tube/overground stations. (I did manage to find a linking dataset for the bus stops)
Would also be great to see this data in a live version integrated in the unified API.
On a sidenote: can we get these kinds of rich historic datasets on transport disruptions (incl. delays) as well?
Thanks!
Hi Yasir, thanks for your comment. We’re currently investigating how we can increase the availability of this data, and the feedback and suggestions here are really useful. We understand that line level data would be beneficial, however this data is not recorded in the ticketing system as the customer taps in at one station and taps out at another, enabling the customer to take a number of routes. Thanks for taking the time to investigate the data and reply to the post. Look out for more on this subject on this blog in the future.