Preprocessing

filtering.filter(tdf[, max_speed_kmh, …])

Trajectory filtering.

compression.compress(tdf[, spatial_radius_km])

Trajectory compression.

detection.stops(tdf[, stop_radius_factor, …])

Stops detection.

clustering.cluster(tdf[, cluster_radius_km, …])

Clustering of locations.

skmob.preprocessing.filtering.filter(tdf, max_speed_kmh=500.0, include_loops=False, speed_kmh=5.0, max_loop=6, ratio_max=0.25)

Trajectory filtering.

For each individual in a TrajDataFrame, filter out the trajectory points that are considered noise or outliers [Z2015].

Parameters
  • tdf (TrajDataFrame) – the trajectories of the individuals.

  • max_speed_kmh (float, optional) – delete a trajectory point if the speed (in km/h) from the previous point is higher than max_speed_kmh. The default is 500.0.

  • include_loops (boolean, optional) – If True, trajectory points belonging to short and fast “loops” are removed. Specifically, points are removed if within the next max_loop points the individual has come back to a distance (ratio_max * the maximum distance reached), AND the average speed (in km/h) is higher than speed. The default is False.

  • speed (float, optional) – the default is 5km/h (walking speed).

  • max_loop (int, optional) – the default is 6.

  • ratio_max (float, optional) – the default is 0.25.

Returns

the TrajDataFrame without the trajectory points that have been filtered out.

Return type

TrajDataFrame

Warning

if include_loops is True, the filter is very slow. Use only if raw data is really noisy.

Examples

>>> import skmob
>>> import pandas as pd
>>> from skmob.preprocessing import filtering
>>> # read the trajectory data (GeoLife)
>>> url = skmob.utils.constants.GEOLIFE_SAMPLE
>>> df = pd.read_csv(url, sep=',', compression='gzip')
>>> tdf = skmob.TrajDataFrame(df, latitude='lat', longitude='lon', user_id='user', datetime='datetime')
>>> print(tdf.head())
         lat         lng            datetime  uid
0  39.984094  116.319236 2008-10-23 05:53:05    1
1  39.984198  116.319322 2008-10-23 05:53:06    1
2  39.984224  116.319402 2008-10-23 05:53:11    1
3  39.984211  116.319389 2008-10-23 05:53:16    1
4  39.984217  116.319422 2008-10-23 05:53:21    1
>>> # filter out all points with a speed (in km/h) from the previous point higher than 500 km/h
>>> ftdf = filtering.filter(tdf, max_speed_kmh=500.)
>>> print(ftdf.parameters)
{'filter': {'function': 'filter', 'max_speed_kmh': 500.0, 'include_loops': False, 'speed_kmh': 5.0, 'max_loop': 6, 'ratio_max': 0.25}}
>>> n_deleted_points = len(tdf) - len(ftdf) # number of deleted points
>>> print(n_deleted_points)
54

References

Z2015

Zheng, Y. (2015) Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology 6(3), https://dl.acm.org/citation.cfm?id=2743025

skmob.preprocessing.compression.compress(tdf, spatial_radius_km=0.2)

Trajectory compression.

Reduce the number of points in a trajectory for each individual in a TrajDataFrame. All points within a radius of spatial_radius_km kilometers from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point [Z2015].

Parameters
  • tdf (TrajDataFrame) – the input trajectories of the individuals.

  • spatial_radius_km (float, optional) – the minimum distance (in km) between consecutive points of the compressed trajectory. The default is 0.2.

Returns

the compressed TrajDataFrame.

Return type

TrajDataFrame

Examples

>>> import skmob
>>> import pandas as pd
>>> from skmob.preprocessing import compression
>>> # read the trajectory data (GeoLife)
>>> url = skmob.utils.constants.GEOLIFE_SAMPLE
>>> df = pd.read_csv(url, sep=',', compression='gzip')
>>> tdf = skmob.TrajDataFrame(df, latitude='lat', longitude='lon', user_id='user', datetime='datetime')
>>> print(tdf.head())
         lat         lng            datetime  uid
0  39.984094  116.319236 2008-10-23 05:53:05    1
1  39.984198  116.319322 2008-10-23 05:53:06    1
2  39.984224  116.319402 2008-10-23 05:53:11    1
3  39.984211  116.319389 2008-10-23 05:53:16    1
4  39.984217  116.319422 2008-10-23 05:53:21    1
>>> # compress the trajectory using a spatial radius of 0.2 km
>>> ctdf = compression.compress(tdf, spatial_radius_km=0.2)
>>> print('Points of the original trajectory:\t%s'%len(tdf))
>>> print('Points of the compressed trajectory:\t%s'%len(ctdf))
Points of the original trajectory:  217653
Points of the compressed trajectory:        6281

References

Z2015

Zheng, Y. (2015) Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology 6(3), https://dl.acm.org/citation.cfm?id=2743025

skmob.preprocessing.detection.stops(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, spatial_radius_km=0.2, leaving_time=True, no_data_for_minutes=1000000000000.0, min_speed_kmh=None)

Stops detection.

Detect the stops for each individual in a TrajDataFrame. A stop is detected when the individual spends at least minutes_for_a_stop minutes within a distance stop_radius_factor * spatial_radius km from a given trajectory point. The stop’s coordinates are the median latitude and longitude values of the points found within the specified distance [RT2004] [Z2015].

Parameters
  • tdf (TrajDataFrame) – the input trajectories of the individuals.

  • stop_radius_factor (float, optional) – if argument spatial_radius_km is None, the spatial_radius used is the value specified in the TrajDataFrame properties (“spatial_radius_km” assigned by a preprocessing.compression function) multiplied by this argument, stop_radius_factor. The default is 0.5.

  • minutes_for_a_stop (float, optional) – the minimum stop duration, in minutes. The default is 20.0.

  • spatial_radius_km (float or None, optional) – the radius of the ball enclosing all trajectory points within the stop location. The default is 0.2.

  • leaving_time (boolean, optional) – if True, a new column ‘leaving_datetime’ is added with the departure time from the stop location. The default is True.

  • no_data_for_minutes (float, optional) – if the number of minutes between two consecutive points is larger than no_data_for_minutes, then this is interpreted as missing data and does not count as a stop. The default is 1e12.

  • min_speed_kmh (float or None, optional) – if not None, remove the points at the end of a stop if their speed is larger than min_speed_kmh km/h. The default is None.

Returns

a TrajDataFrame with the coordinates (latitude, longitude) of the stop locations.

Return type

TrajDataFrame

Examples

>>> import skmob
>>> import pandas as pd
>>> from skmob.preprocessing import detection
>>> # read the trajectory data (GeoLife)
>>> url = skmob.utils.constants.GEOLIFE_SAMPLE
>>> df = pd.read_csv(url, sep=',', compression='gzip')
>>> tdf = skmob.TrajDataFrame(df, latitude='lat', longitude='lon', user_id='user', datetime='datetime')
>>> print(tdf.head())
         lat         lng            datetime  uid
0  39.984094  116.319236 2008-10-23 05:53:05    1
1  39.984198  116.319322 2008-10-23 05:53:06    1
2  39.984224  116.319402 2008-10-23 05:53:11    1
3  39.984211  116.319389 2008-10-23 05:53:16    1
4  39.984217  116.319422 2008-10-23 05:53:21    1
>>> stdf = detection.stops(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, spatial_radius_km=0.2, leaving_time=True)
>>> print(stdf.head())
         lat         lng            datetime  uid    leaving_datetime
0  39.978030  116.327481 2008-10-23 06:01:37    1 2008-10-23 10:32:53
1  40.013820  116.306532 2008-10-23 11:10:19    1 2008-10-23 23:45:27
2  39.978419  116.326870 2008-10-24 00:21:52    1 2008-10-24 01:47:30
3  39.981166  116.308475 2008-10-24 02:02:31    1 2008-10-24 02:30:29
4  39.981431  116.309902 2008-10-24 02:30:29    1 2008-10-24 03:16:35
>>> print(stdf.parameters)
{'detect': {'function': 'stops', 'stop_radius_factor': 0.5, 'minutes_for_a_stop': 20.0, 'spatial_radius_km': 0.2, 'leaving_time': True, 'no_data_for_minutes': 1000000000000.0, 'min_speed_kmh': None}}
>>> print('Points of the original trajectory:\t%s'%len(tdf))
>>> print('Points of stops:\t\t\t%s'%len(stdf))
Points of the original trajectory:  217653
Points of stops:                    391

References

RT2004

Ramaswamy, H. & Toyama, K. (2004) Project Lachesis: parsing and modeling location histories. In International Conference on Geographic Information Science, 106-124, http://kentarotoyama.com/papers/Hariharan_2004_Project_Lachesis.pdf

Z2015

Zheng, Y. (2015) Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology 6(3), https://dl.acm.org/citation.cfm?id=2743025

skmob.preprocessing.clustering.cluster(tdf, cluster_radius_km=0.1, min_samples=1)

Clustering of locations.

Cluster the stops of each individual in a TrajDataFrame. The stops correspond to visits to the same location at different times, based on spatial proximity [RT2004]. The clustering algorithm used is DBSCAN (by sklearn [DBSCAN]).

Parameters
  • tdf (TrajDataFrame) – the input TrajDataFrame that should contain the stops, i.e., the output of a preprocessing.detection function.

  • cluster_radius_km (float, optional) – the parameter eps of the function sklearn.cluster.DBSCAN, in kilometers. The default is 0.1.

  • min_samples (int, optional) – the parameter min_samples of the function sklearn.cluster.DBSCAN indicating the minimum number of stops to form a cluster. The default is 1.

Returns

a TrajDataFrame with the additional column ‘cluster’ containing the cluster labels. The stops that belong to the same cluster have the same label. The labels are integers corresponding to the ranks of clusters according to the frequency of visitation (the most visited cluster has label 0, the second most visited has label 1, etc.).

Return type

TrajDataFrame

Examples

>>> import skmob
>>> import pandas as pd
>>> from skmob.preprocessing import detection, clustering
>>> # read the trajectory data (GeoLife)
>>> url = skmob.utils.constants.GEOLIFE_SAMPLE
>>> df = pd.read_csv(url, sep=',', compression='gzip')
>>> tdf = skmob.TrajDataFrame(df, latitude='lat', longitude='lon', user_id='user', datetime='datetime')
>>> print(tdf.head())
         lat         lng            datetime  uid
0  39.984094  116.319236 2008-10-23 05:53:05    1
1  39.984198  116.319322 2008-10-23 05:53:06    1
2  39.984224  116.319402 2008-10-23 05:53:11    1
3  39.984211  116.319389 2008-10-23 05:53:16    1
4  39.984217  116.319422 2008-10-23 05:53:21    1
>>> # detect the stops first
>>> stdf = detection.stops(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, spatial_radius_km=0.2, leaving_time=True)
>>> # cluster the stops
>>> cstdf = clustering.cluster(stdf, cluster_radius_km=0.1, min_samples=1)
>>> print(cstdf.head())
         lat         lng            datetime  uid    leaving_datetime  cluster
0  39.978030  116.327481 2008-10-23 06:01:37    1 2008-10-23 10:32:53        0
1  40.013820  116.306532 2008-10-23 11:10:19    1 2008-10-23 23:45:27        1
2  39.978419  116.326870 2008-10-24 00:21:52    1 2008-10-24 01:47:30        0
3  39.981166  116.308475 2008-10-24 02:02:31    1 2008-10-24 02:30:29       42
4  39.981431  116.309902 2008-10-24 02:30:29    1 2008-10-24 03:16:35       41
>>> print(cstdf.parameters)
{'detect': {'function': 'stops', 'stop_radius_factor': 0.5, 'minutes_for_a_stop': 20.0, 'spatial_radius_km': 0.2, 'leaving_time': True, 'no_data_for_minutes': 1000000000000.0, 'min_speed_kmh': None}, 'cluster': {'function': 'cluster', 'cluster_radius_km': 0.1, 'min_samples': 1}}

References

DBSCAN

DBSCAN implementation, scikit-learn, https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html

RT2004

Ramaswamy, H. & Toyama, K. (2004) Project Lachesis: parsing and modeling location histories. In International Conference on Geographic Information Science, 106-124, http://kentarotoyama.com/papers/Hariharan_2004_Project_Lachesis.pdf