In this post, I'd like to discuss my favorite Kaggle competition. Out of 30 some competitions that I've competed in, my favorite was the AXA Driver Telematics Analysis. Although it was only my second or third competition, and I had only just finished my first online machine learning class, I ended up in 109th place out of 1528. All of the code that I used to generate the predictions can be found here.

AXA is an insurance company. The idea behind the competition was to be able to identify driver behaviors, presumably to be able to correlate them to risk. I think the idea is pretty exciting - instead of lumping people in groups only based on age, sex, etc., let people prove their driving skills and give favorable rates to the safest drivers.

Data¶

The data in this case consisted of a number of folders, each corresponding to a single driver. For each driver, there were 200 files, each representing a single drive. Each of those drive files consisted of nothing more than a list of x and y coordinates. According to the data description, the lists of coordinates were randomly truncated at both ends and randomly rotated. Let's look at the first few points of data from a single drive:

In [1]:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

drive = pd.read_csv('../drivers/1/2.csv')
drive.head(10)

Out[1]:

	x	y
0	0.0	0.0
1	-11.5	2.9
2	-22.0	5.5
3	-31.9	6.7
4	-40.1	8.4
5	-46.8	9.6
6	-51.1	11.1
7	-53.2	11.9
8	-54.0	12.2
9	-54.0	12.2

Although units are not given for either the index or the x and y columns, assuming meters for x and y and one reading taken each second gives reasonable speeds.

Now a plot of the whole drive:

In [2]:

plt.plot(drive.x, drive.y)
plt.show()

The Challenge¶

Within each folder, the vast majority of the 200 drive files belong to the driver represented by that folder. However, a few files are included from other drivers. We were not told how many. Our task was to predict the probability of each drive file belonging to the driver represented by that folder. Performance was measured by area under the ROC curve.

Literature¶

As a scientist, any time I start a new project, one of the first things I will do is a literature search. I assumed that studies like this had already been done either by academics or by scientists in the insurance industry. Mainly, I was looking for the types of features that others had found useful for identifying drivers or certain driving behaviors. There were a number of interesting papers, although in some cases, the type of data used was unavailable for this competition. Unfortunately, I couldn't find working links to all of the papers I had looked at before.

1) Driving Behavior Improvement and Driver Recognition Based on Real-Time Driving Information
2) Smoothing Methods to Minimize Impact of GPS Random Error
3) Driver Identification by Driving Style

Features¶

The reason I enjoyed this competition so much is that the data was fairly small and simple. There were not hundreds or thousands of fields in the data files, nor were there hundreds of gigabytes of images, like in some other competitions. With such simple data, the key was feature engineering. We needed to figure out how to extract as much information about driver behavior as possible, and it turns out there is quite a bit you can do with it.

I divided my features into two groups - those involving non-vector quantities, and those involving vectors.

For the non-vector features, I started by calculating the distance between successive points. Because the points were taken once each second, this distance also gives us the average speed over that second. Taking differences between speeds gives us accelerations, and differences in accelerations give us jerks. I thought jerk, the derivative of acceleration with respect to time, would be an interesting feature. It is not a term well known to most people, but it seemed like it would be relevant, given that there are many changes in acceleration during normal driving. Large magnitude jerks, or a high frequency of them, might make for an uncomfortable ride. Of course, while I referred to these features as velocities, accelerations, and jerks in my original files, I am actually only using the magnitudes of these vectors.

Let's calculate velocities, accelerations, and jerks for the drive shown above:

In [3]:

drive[['prev_x', 'prev_y']] = drive[['x', 'y']].shift(1) 
drive['vel'] = ( (drive['x'] - drive['prev_x']) ** 2 + (drive['y'] - drive['prev_y']) ** 2 ) ** 0.5
drive['accel'] = drive['vel'].diff()
drive['jerk'] = drive['accel'].diff()
drive.fillna(0, inplace=True)
drive.head()

Out[3]:

	x	y	prev_x	prev_y	vel	accel	jerk
0	0.0	0.0	0.0	0.0	0.000000	0.000000	0.000000
1	-11.5	2.9	0.0	0.0	11.860017	0.000000	0.000000
2	-22.0	5.5	-11.5	2.9	10.817116	-1.042901	0.000000
3	-31.9	6.7	-22.0	5.5	9.972462	-0.844654	0.198247
4	-40.1	8.4	-31.9	6.7	8.374366	-1.598096	-0.753442

And now plot those values:

In [4]:

fig, ax = plt.subplots(1, 3, figsize=(12, 4))
ax[0].plot(drive['vel'])
ax[0].set_title('velocity')
ax[1].plot(drive['accel'])
ax[1].set_title('acceleration')
ax[2].plot(drive['jerk'])
ax[2].set_title('jerk')
plt.show()

After calculating these values, one can easily get the final distance from the origin, the total distance travled, total trip length, total moving time and stop time, etc. One clue from reference 3 is that you can get more information about a driver by developing velocity or acceleration profiles for each drive. Basically, this involves creating a histogram of all the values and using each bin of the histrogram as a separate feature. While all of these features are affected by the route, the speed limits available, and the power of the car, they are also affected by the driver's style.

The velocities for this drive are reasonable, but sometimes you will see very high values all of a sudden. This can arise from errors in the GPS measurements or in missing measurements. Sometimes, measurements would be missing for several seconds, perhaps due to passing through a tunnel, but there would be no indication in the file. Such an occurrence leads to sudden, extremely high velocities, and those must be filtered out.

The accleration and jerk plots are especially jumpy. Some smoothing could be used to eliminate the extreme values and high frequency oscillations, however, you must be careful not to smooth too much, as some extreme values can be real. For example, some drivers might drive very close to the car in front of them and then have to hit the brakes hard. If they do this often, it represents a real driving behavior that we want to capture.

To create the velocity profile, we need to see what a histogram of velocity looks like. Values in the histogram need to be adjusted to represent percentages of the total time so that longer drives can be compared with shorter ones.

In [5]:

weights = np.ones(drive['vel'].size) / drive['vel'].size
plt.hist(drive['vel'], weights=weights)
plt.show()

In my original files, I used around 10 bins to create the profiles, but let's see what it looks like if we use more:

In [6]:

plt.hist(drive['vel'], bins=50, weights=weights)
plt.show()

Perhaps I was afraid of creating too many features, but in retrospect, it may make sense to use more bins. The bins I used were 4.5 m/s wide, which corresponds to ~10 mph. That might not have high enough resolution to distinguish certain driving behaviors, such as trying to stay ~4 mph above the speed limit to avoid a ticket. Some drivers might always drive right around the speed limit, others might vary a lot depending on traffic, and some may use cruise control.

For the vector-based features, I start by calculating $(\Delta x, \Delta y)$ for each pair of successive points. From here, I can calculate centripetal acceleration and cross products, which might give you some information about how fast a driver turns. For example, the cross product is defined as: $$\lVert v_1 \rVert\lVert v_2 \rVert sin \theta$$ This quantity will be larger if either the vectors are larger (driving faster) or the angle is closer to the maximum of the sine function at 90 degrees (sharper angle). Because a sharper upcoming turn usually causes a driver to slow down, these two factors tend to offset, but a driver who likes to takes turns at higher speed may show greater cross product values.

By this point, I was pretty excited about this competition, and I was having a lot of fun trying to think of new features. For example, since I have always driven a manual transmission, I wondered if it was possible to determine manual vs. automatic. If you've ever driven a manual next to an automatic at about the same acceleration from a stop, you'll know that usually, you fall behind by a couple of feet during gear shifts. Although I don't know where the drivers were located, I believe AXA is a European company, where manuals are more common than here in the US. After looking at several files, I was unable to find anything that indicated a manual. I suppose that the GPS accuracy and measurement frequency are not high enough to detect the small differences.

Also, one of my pet peeves while driving is when people stop at a light a whole car length behind the person in front of them, then proceed to creep up a foot or two every few seconds. As a manual driver, it is really annoying to have to put the car in gear every few seconds to move up a bit! I notice that some people do this regularly, and while you can find several drive files that have stretches of data that look kind of like this, it is hard to distinguish it from a traffic jam, or maybe a parking maneuver, or simply noise in the positional measurements.

See the graphs of position and velocity of the final time points from the drive above:

In [7]:

fig, ax = plt.subplots(1, 2, figsize=(10, 4))
ax[0].plot(drive.x[480:], drive.y[480:])
ax[0].set_title('Position')
ax[1].plot(drive['vel'][480:])
ax[1].set_title('Velocity')
plt.show()

The velocity graph shows most of this movement is fairly slow. In the position graph, there is a spike in the middle lower part that doesn't seem like a possible movement, and is probably just noise. The right side of the position graph, going up, could be a parking maneuver, given the very tight U-turn and the small spike at the very top that doubles back.

Semisupervised Approach¶

After generating features for all drive files, there was one more task before I could throw the data into a machine learning algorithm. Because we were only told that a majority of files belonged to the driver in question, there was no labeled training data provided. However, it seemed reasonable to create labeled data for each driver by assuming that all 200 files in a driver's folder belonged to him. Of course, there would be a few errors, but the hope was that the learning algorithm wouldn't be too confused by this as long as there were only a few mislabeled drives in each folder. For negative samples, 200 files were randomly chosen from the set of all drives from all folders. The chance that any of these came from the driver in question and were mislabeled is very low.

Machine Learning¶

For building a model, I switched to Python to use the great scikit-learn package. I tried a number of classification algorithms including logistic regression, random forest, and gradient boosting. After a few experiments and some manual hyperparameter tuning, I found that a Random Forest with 500 trees, a max_depth of 12, and min_samples_leaf of 5 gave the best results. For each driver, I had a file containing all of the features for the 200 drives in his folder, labeled 1, and 200 randomly chosen drives, labeled 0. A Random Forest model was trained separately for each driver in order to learn each driver's behavior. With this model and paramaters, I scored 0.91252, good for 109th place.

I wondered if I could do a bit better without any huge changes. First, I fixed a couple of mistakes in the feature calculations. Then, I used a lot more bins for several of the feature profiles. Finally, I tried adding a lot more negative examples for each driver. Although I had thought to try this before, a mistake in my code prevented more examples from being written to the file. I found that adding 1000 randomly chosen drives from other drivers improved the score significantly.

In [8]:

from IPython.display import Image
Image("AXA_score.png")

Out[8]:

This score would have been good for 77th place. Not a bad improvement.

Final Thoughts¶

After creating all of the features, I realized that many of them can unintentionally pick up information about the path being driven, instead of the driver's behavior. Of course, most people will have many very similar drive paths. After the competition ended, it became clear that many of the best scores went to those who found ways to explicitly match paths among the drives in a folder. If you found a path that was very similar to many others, then it would be highly likely that all of them belonged to that driver.

It is a bit odd to me that AXA chose to set up the competition as they did. Matching similar paths is so powerful in this case, but useless to the company if they are looking for driving behaviors. Perhaps it would have been better to give us 200 drive files per driver and a label representing whether or not the driver caused an accident in the past year. That would force us to look for real behaviors that are riskier.

My Favorite Kaggle Competition