For the past 6 months or so, I have been collecting GPS of my bike rides data via Runkeeper. Figuring I'd reached a data saturation point in my rides - I pretty much take the same route to and from work everyday - I sought to do some analysis of my rides. Now, Runkeeper provides quite a bit of in-app analytics, but mostly as part of their premium service. Fortunately they allow users to export their data as GPX
files. With data in hand, I sought to carry out some basic analysis of my rides. However, I found that the data needed a bit of processing in order to enhance its usefulness.
My ability to work with this data is the product of other people sharing what they've learned about GPX
data, Python programming, pandas, geopandas, folium, shapely and more. In this post I attempt pay forward what I've learned from others and contribute back in particular to community of geospatial Python practitioners. What follows is an explanation of how geopandas and shapely to transform my GPS track points from Runkeeper into a new, even more interesting dataset.
Exploring Runkeeper Data¶
A few weeks ago, I downloaded each individual gpx
data file of my rides and picked a random trip to explore. If you've ever dealt with a gpx
file, you might have noticed that it comes with several different layers. I'll be using a test set of my GPS data called gps_test_data.gpx
throughout this post. Before we going much further, I used fiona to take a peak inside my gpx
file.
import fiona
fname = 'gps_test_data.gpx'
fiona.listlayers(fname)
So this gpx
data actually appears to consist of 5 layers. In my case the layers Runkeeper collected data for were:
tracks
- a single line feature comprising the entire path you traveledtrack_points
- regularly collected x/y cooridinates with elevation and a time stamp.
After figuring out what layers in my gpx
file I could work with, I took a look at each layer to think about what I might be able to do with that data. Using fiona I loaded the tracks
layer to some if it contained some interesting data.
# Open tracks layer
tracks_layer = fiona.open(fname, layer='tracks')
feature = tracks_layer[0]
# Print out the number of features in the tracks layer
print("# of features:\n--------------\n", len(list(tracks_layer.items())))
# Print out fields and attribute data
properties = feature['properties']
print("Feature attribute data:\n-----------------------")
for k, v in properties.items():
print('{}: {}'.format(k,v))
# Draw a quick plot of data using shapely
from shapely.geometry import shape
tracks_data = {'type': 'MultiLineString',
'coordinates': feature['geometry']['coordinates']}
tracks_shape = shape(tracks_data)
tracks_shape
Although the tracks
layer is good at showing the path traveled, there are no time stamps or elevation values for start, finish or anywhere in between. While I'm sure this layer could have some utility, nothing really stoked my curiosity.
On to track_points
. As I did with tracks
I peeked in at the number of features as well as the attributes and geometry of the first point.
# Open track points layer
track_pts_layer = fiona.open(fname, layer='track_points')
feature = track_pts_layer[0]
# Print out the number of features in the tracks layer
print("# of features:\n--------------\n", len(list(track_pts_layer.items())))
# Print out fields and attribute data
properties = feature['properties']
print("Feature attribute data:\n-----------------------")
for k, v in properties.items():
print('{}: {}'.format(k,v))
# Draw a quick plot of data using shapely
track_pt_data = {'type': 'Point',
'coordinates': feature['geometry']['coordinates']}
track_pt_shape = shape(track_pt_data)
track_pt_shape
I found the presence of 163 points in track_points
encouraging. Better still was some of the attributes data collected for each point. A few that caught my eye included:
time
: Time and date at which the feature was collectedele
: Elevation at which feature was collectedtrack_seg_point_id
: Spot in the order in which a feature was collected(eg. 1st, 2nd, 3rd, etc.)
What to do?¶
track_points
contained snapshots of data - time, elevation, a sequence id, and location information - as collected at each point along my journey. What I didn't see were interesting values like like change in elevation or speed throughout my ride. What if I could calculate the difference in elevation, travel time, and distance between consecutive points? And what if, since I was calculating these differences over consecutive points, I could represent each consecutive point as an individual line segment? I figured with a few helpful Python tools, I might be able to address these "what-ifs".
%%html
Thus far I've mostly worked through a basic exploration of my data. For the rest of this post, I'm going to walk through how I constructed line segments between consecutive points in track_points
and calculated change in elevation, change in distance, change in time, and speed using geopandas and shapely.
Setting up¶
Although I had already loaded a couple libraries to help read in and examine my data, I was going to need a few more to assist in my analysis:
import pandas as pd # To structure and manipulated data in a DataFrame format
import geopandas as gpd # To work with spatial data in a DataFrame
from geopandas import GeoDataFrame # To create a GeoDataFrame from a DataFrame
from shapely.geometry import LineString # To create line geometries that can be used in a GeoDataFrame
import folium # To generate a Leaflet-based map of my data throughout my analysis
Load track_points
¶
Earlier I had used fiona to load track_points
so that I could take a quick look at the data. Because I wanted to do some further analysis, I reloaded track_points
into a special type of DataFrame called a GeoDataFrame. Something I love about GeoDataFrames is it allows you to carry out spatial operations on spatial data while also working with a pandas.
Earlier I defined the data file gps_test_data.gpx
as fname
. To load track_points
into a GeoDataFrame, gdf
, I passed fname
to geopanda's gpd.read_file()
. Because the gpx
file had several layers, I specified the layer
parameter as track_points
. I then sorted gdf
on track_seg_point_id
to ensure that the rows of the GeoDataFrame were in the correct order. I reset the index to ensure that the index increments by 1s starting at 0.
If you recall from earlier when I first examined track_points
there were several fields for which either the data were not meaningful or no data was collected. To make gdf
a bit easier to deal with, I reduced the columns to just the track_seg_id
, ele
, time
, and geometry
.
gdf = gpd.read_file(fname, layer = 'track_points')
gdf.sort_values('track_seg_point_id', inplace=True)
gdf.reset_index(drop=True, inplace=True)
gdf = gdf[['track_seg_point_id', 'ele', 'time', 'geometry']].copy()
gdf.head()
With gdf
looking much more manageable, this seemed like a good opportunity to take a beat and visualize track_points
. I did this using folium, but you achieve similar results with cartopy or even just geopandas via gdf.plot()
.
Sidenote:¶
I'm not super comfortable with folium yet, so I won't go into great depth about hos to use this library. I will point out that I cobbled together a little function to draw the markers as circles. Because of how close together the points were in my dataset, the default blue pins were not doing it for me. I've never quite grasped using custom markers in Leaflet and doing so in folium left me similarly vexed. As such, credit and gratitude to Collin Reinking for his helpful notebook on plotting points with folium.
def add_markers(mapobj, gdf):
coords = []
for i, row in gdf.iterrows():
coords.append([row.geometry.y, row.geometry.x])
for coord in coords:
folium.CircleMarker(location = coord,
radius = 2.5,
fill = True,
fill_color = '#F50057',
fill_opacity = 0.75,
color = 'whitesmoke',
weight = 0.5).add_to(mapobj)
return mapobj
f = folium.Figure(height = 400)
m = folium.Map([35.792809, -78.675724], zoom_start = 15, tiles='Cartodb dark_matter')
m.add_to(f)
add_markers(m, gdf)
Making a line segment¶
A line segment requires two pairs of of XY coordinate. In gdf
the XY coordinates for a given row can be found in the geometry
column. My expectation was that for each row in gdf
, the geometry
column would contain a shapely point geometry with a single pair of XY coordinates. To confirm this I took a look at the first two consecutive points in my data .
geom0 = gdf.loc[0]['geometry']
geom1 = gdf.loc[1]['geometry']
print("Geometry type:", str(type(geom0)))
print(f"geom0 coordinates: {geom0.x}, {geom0.y}")
print(f"geom1 coordinates: {geom1.x}, {geom1.y}")
With my expectations confirmed, I wanted to see if I could indeed generate a line segment using the XY coordinates for geom0
and geom1
. In thinking through how to create a line segment from two points, I referred to the shapely documentation on the LineString()
constructor. What I discovered was that in order to create a LineString I'd need to create two tuples, one for each XY coordinate pair. I'd then need to include those two tuples in a list. This list would then be passed to a shapely LineString()
constructor like so:
# Create LineString from coordinates
start, end = [(geom0.x, geom0.y), (geom1.x, geom1.y)]
line = LineString([start, end])
print(f"Geometry type: {str(type(line))}")
line
Wonderful! I had managed to take the XY coordinates from the first two points in my data and use them to create a line segment. What would I need to do repeat this process over and over again for each consecutive pair of XY coordinates?
1. Create a function, make_lines()
that takes two sets of XY coordinates from a GeoDataFrame and adds them to a shapely LineString()
constructor. Returns a DataFrame.¶
def make_lines(gdf, df_out, i, geometry = 'geometry'):
geom0 = gdf.loc[i][geometry]
geom1 = gdf.loc[i + 1][geometry]
start, end = [(geom0.x, geom0.y), (geom1.x, geom1.y)]
line = LineString([start, end])
# Create a DataFrame to hold record
data = {'id': i,
'geometry': [line]}
df_line = pd.DataFrame(data, columns = ['id', 'geometry'])
# Add record DataFrame of compiled records
df_out = pd.concat([df_out, df_line])
return df_out
2. Loop through a GeoDataFrame, passing consecutive XY coordinate pairs to make_lines()
.¶
# initialize an output DataFrame
df = pd.DataFrame(columns = ['id', 'geometry'])
# Loop through each row of the input point GeoDataFrame
x = 0
while x < len(gdf) - 1:
df = make_lines(gdf, df, x)
x = x + 1
df.head()
Having managed to create LineStrings
for each consecutive pair of points, I piggybacked on that process to calculate change in elevation and time between each point. I did this by expanding my make_lines()
function to carry out those calculations and place the results into columns with the ouput DataFrame.
def make_lines(gdf, df_out, i, elevation = 'ele', time = 'time', geometry = 'geometry'):
# Get track coordinates
geom0 = gdf.loc[i][geometry]
geom1 = gdf.loc[i + 1][geometry]
# Create LineString from coordinates
start, end = [(geom0.x, geom0.y), (geom1.x, geom1.y)]
line = LineString([start, end])
# Calculate change in elevation
elevation_change = gdf.loc[i + 1, elevation] - gdf.loc[i, elevation]
# Calculate time betweent segments
time_change = pd.to_datetime(gdf.loc[i + 1, time], infer_datetime_format=True) - pd.to_datetime(gdf.loc[i, time], infer_datetime_format=True)
time_change_seconds = time_change.seconds
# Create a DataFrame to hold record
data = {'id': i,
'elevation_change': elevation_change,
'time_change': time_change_seconds,
'geometry': [line]}
df_line = pd.DataFrame(data, columns = ['id', 'elevation_change', 'time_change','geometry'])
# Add record DataFrame of compiled records
df_out = pd.concat([df_out, df_line])
return df_out
With a function for creating the segments and doing calculations, all I need to do was initialize a new DataFrame, df
, that matched the schema of the DataFrame being returned by make_lines()
. While df
starts as an empty DataFrame, it is populated as a while
-loop runs make_lines()
on each consecutive pair of points.
df = pd.DataFrame(columns = ['id', 'elevation_change', 'time_change', 'geometry'])
x = 0
while x < len(gdf) - 1:
df_input = make_lines(gdf, df, x)
df = df_input
x = x + 1
df.head()
Although the final version of df
had a geometry
column containing shapely LineStrings, the DataFrame still needed to be converted into a GeoDataFrame for both plotting and for calculating the length of each segment. To convert df
into a GeoDataFrame, gdf_line
, I passed the DataFrame and the coordinate reference system of the geometry
column to a geopandas GeoDataFrame()
:
crs = {'init': 'epsg:4326'}
gdf_line = GeoDataFrame(df, crs=crs)
gdf_line.head()
Although the geometry
column in df
and gdf_line
look the same, by converting df
to a GeoDataFrame, the geometry
column changed from a pandas Series to a geopandas GeoSeries.
print("Field types:")
print(f">> df['geometry']: {str(type(df['geometry']))}")
print(f">> gdf_line['geometry']: {str(type(gdf_line['geometry']))}")
The distinction between a Series
and a GeoSeries
may sound trivial, but this conversion meant I could use geopandas to carry out the spatial analysis missing in pandas. It also meant I could quickly map gdf_line
and make sure the results made visual sense.
folium.GeoJson(gdf_line).add_to(m)
m
We're in the home stretch here! While I had managed to create line segments and derived time and elevation change for each line segment, I was missing a key measure: speed.
Speed = Distance/Time
My desire was to calculate miles per hour for each segment. I already had time, albeit in seconds. I didn't have the distance along each segment. Fortunately, geopandas has a length
method for calculating distance. The catch was that it calculates length in the units of GeoDataFrame's coordinate reference system (CRS). Because gdf_line
is based on WGS84, its units were in degrees. Luckily I was able to use geopanda's to_crs()
method to briefly transform gdf_line
into a local, foot-based CRS, NAD83 / North Carolina. I calculated the length of each segment and converted the value into miles, miles_travled
. I then created a new column, mph
, by converting time_change
from seconds to hours and dividing miles_traveled
by that value. As a final step re-transformed gdf_line
back to WGS84.
# convert to GeoDataFrame to a crs from which we can derive linear measurements
gdf_line = gdf_line.to_crs(epsg = 2264)
# calculate miles per hour for each segment. Distance is calculated in meters, time in seconds.
# They are converted to miles and hours, respectively, when creating the speed (mph) column.
gdf_line['miles_traveled'] = gdf_line.length * 0.000189394
gdf_line['mph'] = gdf_line['miles_traveled'] / (gdf_line['time_change'] * 0.000277778)
# convert back to original crs
gdf_line = gdf_line.to_crs(epsg = 4326)
gdf_line.head()
To check my work I plotted gdf_line
using folium. I was especially interested to see if the speed for each segment seemed in line with my own experience, so I symbolized by the mph
column using folium's color mapping functionality.
import branca.colormap as cm
import math
f2 = folium.Figure(height = 400)
m2 = folium.Map([35.792809, -78.675724], zoom_start = 15, tiles='Stamen Toner')
max_mph = math.ceil(gdf_line['mph'].max())
linear = cm.linear.YlOrRd.scale(0,max_mph)
gps_lyr = folium.GeoJson(gdf_line,
style_function = lambda feature: {
'color': linear(feature['properties']['mph']),
'weight': 3})
gps_lyr.add_child
m2.add_child(linear)
m2.add_child(gps_lyr)
m2.add_to(f2)
Looking at the map and speaking from a position of having intimately experienced this ride, the map seems reasonable.
Finally, I put together a little function to export gdf_line
to GeoJSON file using geopanda's to_file()
method.
import os
def export_gdf(gdf, filename):
try:
os.remove(filename)
except OSError:
pass
gdf.to_file('output.geojson', driver = "GeoJSON")
export_gdf(gdf, 'output.geojson')
In this post I walked through how I managed to take some GPS data and perform basic data manipulation and analysis using geopandas, shapely, pandas, and other Python libraries. No doubt there are more efficient ways to what I've outlined here. In fact, as I've been writing this post, I have found ways to improve this process. I hope that the code and and my thought process can be helpful for you for addressing your own work. Or perhaps it serves as a case study in your own exploration of Python geospatial development. Perhaps something else entirely! Whatever the case, thank you for reading. I don't yet have commenting worked out here, but feel free to reach out to me on Twitter, @maptastik.