Preprocessing

filtering.filter(tdf[, max_speed_kmh, ...])

Trajectory filtering.

compression.compress(tdf[, spatial_radius_km])

Trajectory compression.

detection.stay_locations(tdf[, ...])

Stops detection.

clustering.cluster(tdf[, cluster_radius_km, ...])

skmob.preprocessing.filtering.filter(tdf, max_speed_kmh=500.0, include_loops=False, speed_kmh=5.0, max_loop=6, ratio_max=0.25)

Trajectory filtering.

For each individual in a TrajDataFrame, filter out the trajectory points that are considered noise or outliers [Z2015].

Parameters
  • tdf (TrajDataFrame) – the trajectories of the individuals.

  • max_speed_kmh (float, optional) – delete a trajectory point if the speed (in km/h) from the previous point is higher than max_speed_kmh. The default is 500.0.

  • include_loops (boolean, optional) – If True, trajectory points belonging to short and fast “loops” are removed. Specifically, points are removed if within the next max_loop points the individual has come back to a distance (ratio_max * the maximum distance reached), AND the average speed (in km/h) is higher than speed. The default is False.

  • speed (float, optional) – the default is 5km/h (walking speed).

  • max_loop (int, optional) – the default is 6.

  • ratio_max (float, optional) – the default is 0.25.

Returns

the TrajDataFrame without the trajectory points that have been filtered out.

Return type

TrajDataFrame

Warning

if include_loops is True, the filter is very slow. Use only if raw data is really noisy.

Examples

>>> import skmob
>>> import pandas as pd
>>> from skmob.preprocessing import filtering
>>> # read the trajectory data (GeoLife)
>>> url = skmob.utils.constants.GEOLIFE_SAMPLE
>>> df = pd.read_csv(url, sep=',', compression='gzip')
>>> tdf = skmob.TrajDataFrame(df, latitude='lat', longitude='lon', user_id='user', datetime='datetime')
>>> print(tdf.head())
         lat         lng            datetime  uid
0  39.984094  116.319236 2008-10-23 05:53:05    1
1  39.984198  116.319322 2008-10-23 05:53:06    1
2  39.984224  116.319402 2008-10-23 05:53:11    1
3  39.984211  116.319389 2008-10-23 05:53:16    1
4  39.984217  116.319422 2008-10-23 05:53:21    1
>>> # filter out all points with a speed (in km/h) from the previous point higher than 500 km/h
>>> ftdf = filtering.filter(tdf, max_speed_kmh=500.)
>>> print(ftdf.parameters)
{'filter': {'function': 'filter', 'max_speed_kmh': 500.0, 'include_loops': False, 'speed_kmh': 5.0, 'max_loop': 6, 'ratio_max': 0.25}}
>>> n_deleted_points = len(tdf) - len(ftdf) # number of deleted points
>>> print(n_deleted_points)
54

References

Z2015

Zheng, Y. (2015) Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology 6(3), https://dl.acm.org/citation.cfm?id=2743025

skmob.preprocessing.compression.compress(tdf, spatial_radius_km=0.2)

Trajectory compression.

Reduce the number of points in a trajectory for each individual in a TrajDataFrame. All points within a radius of spatial_radius_km kilometers from a given initial point are compressed into a single point that has the median coordinates of all points and the time of the initial point [Z2015].

Parameters
  • tdf (TrajDataFrame) – the input trajectories of the individuals.

  • spatial_radius_km (float, optional) – the minimum distance (in km) between consecutive points of the compressed trajectory. The default is 0.2.

Returns

the compressed TrajDataFrame.

Return type

TrajDataFrame

Examples

>>> import skmob
>>> import pandas as pd
>>> from skmob.preprocessing import compression
>>> # read the trajectory data (GeoLife)
>>> url = skmob.utils.constants.GEOLIFE_SAMPLE
>>> df = pd.read_csv(url, sep=',', compression='gzip')
>>> tdf = skmob.TrajDataFrame(df, latitude='lat', longitude='lon', user_id='user', datetime='datetime')
>>> print(tdf.head())
         lat         lng            datetime  uid
0  39.984094  116.319236 2008-10-23 05:53:05    1
1  39.984198  116.319322 2008-10-23 05:53:06    1
2  39.984224  116.319402 2008-10-23 05:53:11    1
3  39.984211  116.319389 2008-10-23 05:53:16    1
4  39.984217  116.319422 2008-10-23 05:53:21    1
>>> # compress the trajectory using a spatial radius of 0.2 km
>>> ctdf = compression.compress(tdf, spatial_radius_km=0.2)
>>> print('Points of the original trajectory:\t%s'%len(tdf))
>>> print('Points of the compressed trajectory:\t%s'%len(ctdf))
Points of the original trajectory:  217653
Points of the compressed trajectory:        6281

References

Z2015

Zheng, Y. (2015) Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology 6(3), https://dl.acm.org/citation.cfm?id=2743025

skmob.preprocessing.detection.stay_locations(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, spatial_radius_km=0.2, leaving_time=True, no_data_for_minutes=1000000000000.0, min_speed_kmh=None)

Stops detection.

Detect the stay locations (or stops) for each individual in a TrajDataFrame. A stop is detected when the individual spends at least minutes_for_a_stop minutes within a distance stop_radius_factor * spatial_radius km from a given trajectory point. The stop’s coordinates are the median latitude and longitude values of the points found within the specified distance [RT2004] [Z2015].

Parameters
  • tdf (TrajDataFrame) – the input trajectories of the individuals.

  • stop_radius_factor (float, optional) – if argument spatial_radius_km is None, the spatial_radius used is the value specified in the TrajDataFrame properties (“spatial_radius_km” assigned by a preprocessing.compression function) multiplied by this argument, stop_radius_factor. The default is 0.5.

  • minutes_for_a_stop (float, optional) – the minimum stop duration, in minutes. The default is 20.0.

  • spatial_radius_km (float or None, optional) – the radius of the ball enclosing all trajectory points within the stop location. The default is 0.2.

  • leaving_time (boolean, optional) – if True, a new column ‘leaving_datetime’ is added with the departure time from the stop location. The default is True.

  • no_data_for_minutes (float, optional) – if the number of minutes between two consecutive points is larger than no_data_for_minutes, then this is interpreted as missing data and does not count as a stop. The default is 1e12.

  • min_speed_kmh (float or None, optional) – if not None, remove the points at the end of a stop if their speed is larger than min_speed_kmh km/h. The default is None.

Returns

a TrajDataFrame with the coordinates (latitude, longitude) of the stop locations.

Return type

TrajDataFrame

Examples

>>> import skmob
>>> import pandas as pd
>>> from skmob.preprocessing import detection
>>> # read the trajectory data (GeoLife)
>>> url = skmob.utils.constants.GEOLIFE_SAMPLE
>>> df = pd.read_csv(url, sep=',', compression='gzip')
>>> tdf = skmob.TrajDataFrame(df, latitude='lat', longitude='lon', user_id='user', datetime='datetime')
>>> print(tdf.head())
         lat         lng            datetime  uid
0  39.984094  116.319236 2008-10-23 05:53:05    1
1  39.984198  116.319322 2008-10-23 05:53:06    1
2  39.984224  116.319402 2008-10-23 05:53:11    1
3  39.984211  116.319389 2008-10-23 05:53:16    1
4  39.984217  116.319422 2008-10-23 05:53:21    1
>>> stdf = detection.stay_locations(tdf, stop_radius_factor=0.5, minutes_for_a_stop=20.0, spatial_radius_km=0.2, leaving_time=True)
>>> print(stdf.head())
         lat         lng            datetime  uid    leaving_datetime
0  39.978030  116.327481 2008-10-23 06:01:37    1 2008-10-23 10:32:53
1  40.013820  116.306532 2008-10-23 11:10:19    1 2008-10-23 23:45:27
2  39.978419  116.326870 2008-10-24 00:21:52    1 2008-10-24 01:47:30
3  39.981166  116.308475 2008-10-24 02:02:31    1 2008-10-24 02:30:29
4  39.981431  116.309902 2008-10-24 02:30:29    1 2008-10-24 03:16:35
>>> print(stdf.parameters)
{'detect': {'function': 'stay_locations', 'stop_radius_factor': 0.5, 'minutes_for_a_stop': 20.0, 'spatial_radius_km': 0.2, 'leaving_time': True, 'no_data_for_minutes': 1000000000000.0, 'min_speed_kmh': None}}
>>> print('Points of the original trajectory:\t%s'%len(tdf))
>>> print('Points of stops:\t\t\t%s'%len(stdf))
Points of the original trajectory:  217653
Points of stops:                    391

References

RT2004

Ramaswamy, H. & Toyama, K. (2004) Project Lachesis: parsing and modeling location histories. In International Conference on Geographic Information Science, 106-124, http://kentarotoyama.com/papers/Hariharan_2004_Project_Lachesis.pdf

Z2015

Zheng, Y. (2015) Trajectory data mining: an overview. ACM Transactions on Intelligent Systems and Technology 6(3), https://dl.acm.org/citation.cfm?id=2743025

skmob.preprocessing.clustering.cluster(tdf, cluster_radius_km=0.1, min_samples=1)