For the past 6 months or so, I have been collecting GPS of my bike rides data via Runkeeper. Figuring I'd reached a data saturation point in my rides - I pretty much take the same route to and from work everyday - I sought to do some analysis of my rides. Now, Runkeeper provides quite a bit of in-app analytics, but mostly as part of their premium service. Fortunately they allow users to export their data as GPX files. With data in hand, I sought to carry out some basic analysis of my rides. However, I found that the data needed a bit of processing in order to enhance its usefulness.

My ability to work with this data is the product of other people sharing what they've learned about GPX data, Python programming, pandas, geopandas, folium, shapely and more. In this post I attempt pay forward what I've learned from others and contribute back in particular to community of geospatial Python practitioners. What follows is an explanation of how geopandas and shapely to transform my GPS track points from Runkeeper into a new, even more interesting dataset.

Exploring Runkeeper Data

A few weeks ago, I downloaded each individual gpx data file of my rides and picked a random trip to explore. If you've ever dealt with a gpx file, you might have noticed that it comes with several different layers. I'll be using a test set of my GPS data called gps_test_data.gpx throughout this post. Before we going much further, I used fiona to take a peak inside my gpx file.

In [1]:
import fiona

fname = 'gps_test_data.gpx'
fiona.listlayers(fname)
Out[1]:
['waypoints', 'routes', 'tracks', 'route_points', 'track_points']

So this gpx data actually appears to consist of 5 layers. In my case the layers Runkeeper collected data for were:

  • tracks - a single line feature comprising the entire path you traveled
  • track_points - regularly collected x/y cooridinates with elevation and a time stamp.

After figuring out what layers in my gpx file I could work with, I took a look at each layer to think about what I might be able to do with that data. Using fiona I loaded the tracks layer to some if it contained some interesting data.

In [2]:
# Open tracks layer
tracks_layer = fiona.open(fname, layer='tracks')
feature = tracks_layer[0]

# Print out the number of features in the tracks layer
print("# of features:\n--------------\n", len(list(tracks_layer.items())))

# Print out fields and attribute data
properties = feature['properties']
print("Feature attribute data:\n-----------------------")
for k, v in properties.items():
    print('{}: {}'.format(k,v))

# Draw a quick plot of data using shapely
from shapely.geometry import shape
tracks_data = {'type': 'MultiLineString',
               'coordinates': feature['geometry']['coordinates']}
tracks_shape = shape(tracks_data)
tracks_shape
# of features:
--------------
 1
Feature attribute data:
-----------------------
name: Cycling 12/29/17 4:42 pm
cmt: None
desc: None
src: None
link1_href: None
link1_text: None
link1_type: None
link2_href: None
link2_text: None
link2_type: None
number: None
type: None
Out[2]:

Although the tracks layer is good at showing the path traveled, there are no time stamps or elevation values for start, finish or anywhere in between. While I'm sure this layer could have some utility, nothing really stoked my curiosity.

On to track_points. As I did with tracks I peeked in at the number of features as well as the attributes and geometry of the first point.

In [3]:
# Open track points layer
track_pts_layer = fiona.open(fname, layer='track_points')
feature = track_pts_layer[0]

# Print out the number of features in the tracks layer
print("# of features:\n--------------\n", len(list(track_pts_layer.items())))

# Print out fields and attribute data
properties = feature['properties']
print("Feature attribute data:\n-----------------------")
for k, v in properties.items():
      print('{}: {}'.format(k,v))

# Draw a quick plot of data using shapely
track_pt_data = {'type': 'Point',
               'coordinates': feature['geometry']['coordinates']}
track_pt_shape = shape(track_pt_data)
track_pt_shape
# of features:
--------------
 163
Feature attribute data:
-----------------------
track_fid: 0
track_seg_id: 0
track_seg_point_id: 0
ele: 125.7
time: 2017-12-29T21:42:18
magvar: None
geoidheight: None
name: None
cmt: None
desc: None
src: None
link1_href: None
link1_text: None
link1_type: None
link2_href: None
link2_text: None
link2_type: None
sym: None
type: None
fix: None
sat: None
hdop: None
vdop: None
pdop: None
ageofdgpsdata: None
dgpsid: None
Out[3]:

I found the presence of 163 points in track_points encouraging. Better still was some of the attributes data collected for each point. A few that caught my eye included:

  • time: Time and date at which the feature was collected
  • ele: Elevation at which feature was collected
  • track_seg_point_id: Spot in the order in which a feature was collected(eg. 1st, 2nd, 3rd, etc.)

What to do?

track_points contained snapshots of data - time, elevation, a sequence id, and location information - as collected at each point along my journey. What I didn't see were interesting values like like change in elevation or speed throughout my ride. What if I could calculate the difference in elevation, travel time, and distance between consecutive points? And what if, since I was calculating these differences over consecutive points, I could represent each consecutive point as an individual line segment? I figured with a few helpful Python tools, I might be able to address these "what-ifs".

In [4]:
%%html

Thus far I've mostly worked through a basic exploration of my data. For the rest of this post, I'm going to walk through how I constructed line segments between consecutive points in track_points and calculated change in elevation, change in distance, change in time, and speed using geopandas and shapely.

Setting up

Although I had already loaded a couple libraries to help read in and examine my data, I was going to need a few more to assist in my analysis:

In [5]:
import pandas as pd # To structure and manipulated data in a DataFrame format
import geopandas as gpd # To work with spatial data in a DataFrame
from geopandas import GeoDataFrame # To create a GeoDataFrame from a DataFrame
from shapely.geometry import LineString # To create line geometries that can be used in a GeoDataFrame
import folium # To generate a Leaflet-based map of my data throughout my analysis 

Load track_points

Earlier I had used fiona to load track_points so that I could take a quick look at the data. Because I wanted to do some further analysis, I reloaded track_points into a special type of DataFrame called a GeoDataFrame. Something I love about GeoDataFrames is it allows you to carry out spatial operations on spatial data while also working with a pandas.

Earlier I defined the data file gps_test_data.gpx as fname. To load track_points into a GeoDataFrame, gdf, I passed fname to geopanda's gpd.read_file(). Because the gpx file had several layers, I specified the layer parameter as track_points. I then sorted gdf on track_seg_point_id to ensure that the rows of the GeoDataFrame were in the correct order. I reset the index to ensure that the index increments by 1s starting at 0.

If you recall from earlier when I first examined track_points there were several fields for which either the data were not meaningful or no data was collected. To make gdf a bit easier to deal with, I reduced the columns to just the track_seg_id, ele, time, and geometry.

In [6]:
gdf = gpd.read_file(fname, layer = 'track_points')
gdf.sort_values('track_seg_point_id', inplace=True)
gdf.reset_index(drop=True, inplace=True)
gdf = gdf[['track_seg_point_id', 'ele', 'time', 'geometry']].copy()
gdf.head()
Out[6]:
track_seg_point_id ele time geometry
0 0 125.7 2017-12-29T21:42:18 POINT (-78.668663 35.787979)
1 1 125.6 2017-12-29T21:42:23 POINT (-78.668841 35.787985)
2 2 125.4 2017-12-29T21:42:25 POINT (-78.66897 35.788028)
3 3 125.2 2017-12-29T21:42:27 POINT (-78.669096 35.788059)
4 4 125.1 2017-12-29T21:42:29 POINT (-78.669222 35.788086)

With gdf looking much more manageable, this seemed like a good opportunity to take a beat and visualize track_points. I did this using folium, but you achieve similar results with cartopy or even just geopandas via gdf.plot().

Sidenote:

I'm not super comfortable with folium yet, so I won't go into great depth about hos to use this library. I will point out that I cobbled together a little function to draw the markers as circles. Because of how close together the points were in my dataset, the default blue pins were not doing it for me. I've never quite grasped using custom markers in Leaflet and doing so in folium left me similarly vexed. As such, credit and gratitude to Collin Reinking for his helpful notebook on plotting points with folium.

In [7]:
def add_markers(mapobj, gdf):
    coords = []
    for i, row in gdf.iterrows():
        coords.append([row.geometry.y, row.geometry.x])
    for coord in coords:
        folium.CircleMarker(location = coord,
                            radius = 2.5, 
                            fill = True,
                            fill_color = '#F50057',
                            fill_opacity = 0.75,
                            color = 'whitesmoke',
                            weight = 0.5).add_to(mapobj)
    return mapobj

f = folium.Figure(height = 400)
m = folium.Map([35.792809, -78.675724], zoom_start = 15, tiles='Cartodb dark_matter')
m.add_to(f)

add_markers(m, gdf)
Out[7]:

Making a line segment

A line segment requires two pairs of of XY coordinate. In gdf the XY coordinates for a given row can be found in the geometry column. My expectation was that for each row in gdf, the geometry column would contain a shapely point geometry with a single pair of XY coordinates. To confirm this I took a look at the first two consecutive points in my data .

In [8]:
geom0 = gdf.loc[0]['geometry']
geom1 = gdf.loc[1]['geometry']

print("Geometry type:", str(type(geom0)))
print(f"geom0 coordinates: {geom0.x}, {geom0.y}")
print(f"geom1 coordinates: {geom1.x}, {geom1.y}")
Geometry type: 
geom0 coordinates: -78.668663, 35.787979
geom1 coordinates: -78.668841, 35.787985

With my expectations confirmed, I wanted to see if I could indeed generate a line segment using the XY coordinates for geom0 and geom1. In thinking through how to create a line segment from two points, I referred to the shapely documentation on the LineString() constructor. What I discovered was that in order to create a LineString I'd need to create two tuples, one for each XY coordinate pair. I'd then need to include those two tuples in a list. This list would then be passed to a shapely LineString() constructor like so:

In [9]:
# Create LineString from coordinates
start, end = [(geom0.x, geom0.y), (geom1.x, geom1.y)]
line = LineString([start, end])
print(f"Geometry type: {str(type(line))}")
line
Geometry type: 
Out[9]:

Wonderful! I had managed to take the XY coordinates from the first two points in my data and use them to create a line segment. What would I need to do repeat this process over and over again for each consecutive pair of XY coordinates?

1. Create a function, make_lines() that takes two sets of XY coordinates from a GeoDataFrame and adds them to a shapely LineString() constructor. Returns a DataFrame.

In [10]:
def make_lines(gdf, df_out, i, geometry = 'geometry'):
    geom0 = gdf.loc[i][geometry]
    geom1 = gdf.loc[i + 1][geometry]
    
    start, end = [(geom0.x, geom0.y), (geom1.x, geom1.y)]
    line = LineString([start, end])
    
    # Create a DataFrame to hold record
    data = {'id': i,
            'geometry': [line]}
    df_line = pd.DataFrame(data, columns = ['id', 'geometry'])
    
    # Add record DataFrame of compiled records
    df_out = pd.concat([df_out, df_line])
    return df_out

2. Loop through a GeoDataFrame, passing consecutive XY coordinate pairs to make_lines().

In [11]:
# initialize an output DataFrame
df = pd.DataFrame(columns = ['id', 'geometry'])

# Loop through each row of the input point GeoDataFrame
x = 0
while x < len(gdf) - 1:
    df = make_lines(gdf, df, x)
    x = x + 1
    
df.head()
Out[11]:
id geometry
0 0 LINESTRING (-78.668663 35.787979, -78.668841 3...
0 1 LINESTRING (-78.668841 35.787985, -78.66897 35...
0 2 LINESTRING (-78.66897 35.788028, -78.669096 35...
0 3 LINESTRING (-78.669096 35.788059, -78.669222 3...
0 4 LINESTRING (-78.669222 35.788086, -78.669372 3...

Having managed to create LineStrings for each consecutive pair of points, I piggybacked on that process to calculate change in elevation and time between each point. I did this by expanding my make_lines() function to carry out those calculations and place the results into columns with the ouput DataFrame.

In [12]:
def make_lines(gdf, df_out, i, elevation = 'ele', time = 'time', geometry = 'geometry'):
    # Get track coordinates
    geom0 = gdf.loc[i][geometry]
    geom1 = gdf.loc[i + 1][geometry]
    
    # Create LineString from coordinates
    start, end = [(geom0.x, geom0.y), (geom1.x, geom1.y)]
    line = LineString([start, end])
    
    # Calculate change in elevation
    elevation_change = gdf.loc[i + 1, elevation] - gdf.loc[i, elevation]
    
    # Calculate time betweent segments
    time_change = pd.to_datetime(gdf.loc[i + 1, time], infer_datetime_format=True) - pd.to_datetime(gdf.loc[i, time], infer_datetime_format=True)
    time_change_seconds = time_change.seconds
    
    # Create a DataFrame to hold record
    data = {'id': i,
            'elevation_change': elevation_change,
            'time_change': time_change_seconds,
            'geometry': [line]}
    df_line = pd.DataFrame(data, columns = ['id', 'elevation_change', 'time_change','geometry'])
    
    # Add record DataFrame of compiled records
    df_out = pd.concat([df_out, df_line])
    return df_out

With a function for creating the segments and doing calculations, all I need to do was initialize a new DataFrame, df, that matched the schema of the DataFrame being returned by make_lines(). While df starts as an empty DataFrame, it is populated as a while-loop runs make_lines() on each consecutive pair of points.

In [13]:
df = pd.DataFrame(columns = ['id', 'elevation_change', 'time_change', 'geometry'])

x = 0
while x < len(gdf) - 1:
    df_input = make_lines(gdf, df, x)
    df = df_input
    x = x + 1

df.head()
Out[13]:
id elevation_change time_change geometry
0 0 -0.1 5 LINESTRING (-78.668663 35.787979, -78.668841 3...
0 1 -0.2 2 LINESTRING (-78.668841 35.787985, -78.66897 35...
0 2 -0.2 2 LINESTRING (-78.66897 35.788028, -78.669096 35...
0 3 -0.1 2 LINESTRING (-78.669096 35.788059, -78.669222 3...
0 4 -0.1 2 LINESTRING (-78.669222 35.788086, -78.669372 3...

Although the final version of df had a geometry column containing shapely LineStrings, the DataFrame still needed to be converted into a GeoDataFrame for both plotting and for calculating the length of each segment. To convert df into a GeoDataFrame, gdf_line, I passed the DataFrame and the coordinate reference system of the geometry column to a geopandas GeoDataFrame():

In [14]:
crs = {'init': 'epsg:4326'}
gdf_line = GeoDataFrame(df, crs=crs)
gdf_line.head()
Out[14]:
id elevation_change time_change geometry
0 0 -0.1 5 LINESTRING (-78.668663 35.787979, -78.668841 3...
0 1 -0.2 2 LINESTRING (-78.668841 35.787985, -78.66897 35...
0 2 -0.2 2 LINESTRING (-78.66897 35.788028, -78.669096 35...
0 3 -0.1 2 LINESTRING (-78.669096 35.788059, -78.669222 3...
0 4 -0.1 2 LINESTRING (-78.669222 35.788086, -78.669372 3...

Although the geometry column in df and gdf_line look the same, by converting df to a GeoDataFrame, the geometry column changed from a pandas Series to a geopandas GeoSeries.

In [15]:
print("Field types:")
print(f">> df['geometry']: {str(type(df['geometry']))}")
print(f">> gdf_line['geometry']: {str(type(gdf_line['geometry']))}")
Field types:
>> df['geometry']: 
>> gdf_line['geometry']: 

The distinction between a Series and a GeoSeries may sound trivial, but this conversion meant I could use geopandas to carry out the spatial analysis missing in pandas. It also meant I could quickly map gdf_line and make sure the results made visual sense.

In [16]:
folium.GeoJson(gdf_line).add_to(m)
m
Out[16]:

We're in the home stretch here! While I had managed to create line segments and derived time and elevation change for each line segment, I was missing a key measure: speed.

Speed = Distance/Time

My desire was to calculate miles per hour for each segment. I already had time, albeit in seconds. I didn't have the distance along each segment. Fortunately, geopandas has a length method for calculating distance. The catch was that it calculates length in the units of GeoDataFrame's coordinate reference system (CRS). Because gdf_line is based on WGS84, its units were in degrees. Luckily I was able to use geopanda's to_crs() method to briefly transform gdf_line into a local, foot-based CRS, NAD83 / North Carolina. I calculated the length of each segment and converted the value into miles, miles_travled. I then created a new column, mph, by converting time_change from seconds to hours and dividing miles_traveled by that value. As a final step re-transformed gdf_line back to WGS84.

In [17]:
# convert to GeoDataFrame to a crs from which we can derive linear measurements
gdf_line = gdf_line.to_crs(epsg = 2264)
# calculate miles per hour for each segment. Distance is calculated in meters, time in seconds. 
# They are converted to miles and hours, respectively, when creating the speed (mph) column.
gdf_line['miles_traveled'] = gdf_line.length * 0.000189394
gdf_line['mph'] = gdf_line['miles_traveled'] / (gdf_line['time_change'] * 0.000277778)
# convert back to original crs
gdf_line = gdf_line.to_crs(epsg = 4326)
gdf_line.head()
Out[17]:
id elevation_change time_change geometry miles_traveled mph
0 0 -0.1 5 LINESTRING (-78.668663 35.7879789999997, -78.6... 0.010007 7.20489
0 1 -0.2 2 LINESTRING (-78.668841 35.7879849999997, -78.6... 0.007829 14.0919
0 2 -0.2 2 LINESTRING (-78.66897 35.78802799999968, -78.6... 0.007393 13.3074
0 3 -0.1 2 LINESTRING (-78.669096 35.78805899999967, -78.... 0.007318 13.1725
0 4 -0.1 2 LINESTRING (-78.669222 35.78808599999969, -78.... 0.008659 15.5869

To check my work I plotted gdf_line using folium. I was especially interested to see if the speed for each segment seemed in line with my own experience, so I symbolized by the mph column using folium's color mapping functionality.

In [18]:
import branca.colormap as cm
import math

f2 = folium.Figure(height = 400)
m2 = folium.Map([35.792809, -78.675724], zoom_start = 15, tiles='Stamen Toner')

max_mph = math.ceil(gdf_line['mph'].max())
linear = cm.linear.YlOrRd.scale(0,max_mph)
gps_lyr = folium.GeoJson(gdf_line,
                         style_function = lambda feature: {
                             'color': linear(feature['properties']['mph']),
                             'weight': 3})       

gps_lyr.add_child
    
m2.add_child(linear)
m2.add_child(gps_lyr)
m2.add_to(f2)
Out[18]:

Looking at the map and speaking from a position of having intimately experienced this ride, the map seems reasonable.

Finally, I put together a little function to export gdf_line to GeoJSON file using geopanda's to_file() method.

In [19]:
import os

def export_gdf(gdf, filename):
    try:
        os.remove(filename)
    except OSError:
        pass
    gdf.to_file('output.geojson', driver = "GeoJSON")

export_gdf(gdf, 'output.geojson')

In this post I walked through how I managed to take some GPS data and perform basic data manipulation and analysis using geopandas, shapely, pandas, and other Python libraries. No doubt there are more efficient ways to what I've outlined here. In fact, as I've been writing this post, I have found ways to improve this process. I hope that the code and and my thought process can be helpful for you for addressing your own work. Or perhaps it serves as a case study in your own exploration of Python geospatial development. Perhaps something else entirely! Whatever the case, thank you for reading. I don't yet have commenting worked out here, but feel free to reach out to me on Twitter, @maptastik.