Hunting for the True Location, with Machine Learning

Some context first.

My company puts on a year end function every year. It’s at some resort or other, and the important thing for this post is that we’re not told the location in advance. We find out when we get there (by bus).

What we are told, about a month ahead of the event, is approximate distances from 3-4 locations. These are where the bus pickup sites are. The locations are:

  • Head Office
  • Near Clearwater Mall
  • Fourways
  • Centurion

The distances given aren’t correct. And, as a result, there’s usually several attempts by various people to figure out where the year end function will be in advance.

I thought I’d join in this year, using some machine learning on those distances.

Now, I should mention that this is a very poor use for ML. Mainly because of a lack of data. I should have hundreds of data points for a decent prediction. I have 2 or 3 data points, for 4 different locations. Still, it’s what I have to work with.

First, the starting data. The distances for this year are:

  • Clearwater mall: 63 KM
  • Centurion: 56 KM
  • Fourways: 43 KM
  • HQ: 20 KM
  • Cape Town: 1447 KM

I’m going to ignore Cape Town for training, as it only had a distance previously specified in 2015, and so I only have one piece of data.

Plotting this on a map makes it clear that the distances have been ‘massaged’ (I’m plotting ‘as the bird flies’, not driving distance for ease of plotting, I’ll use driving distances for the training)

yef_720

Let’s look at previous years.

2016

Actual location: Seasons Sport and Spa (pin on the map below)

Actual distances calculated with Google Maps, driving distance, shortest route.

  • Clearwater: Given distance – 80KM. Actual distance – 67KM
  • Centurion : Given distance – 88KM. Actual distance – 47KM
  • Fourways : Given distance – 68KM. Actual distance – 52KM
  • HQ: Given distance – 115KM. Actual distance – 75KM

YEF2016

2015

Actual location: Vaal River Country Lodge.

Actual distances calculated with Google Maps, driving distance, shortest route.

  • Clearwater Mall: Given distance –  51KM. Actual distance – 79KM
  • Centurion: Given distance – 110KM. Actual distance – 118KM
  • Fourways: Given distance – 89KM. Actual distance – 97KM

YEF2015

2014

Actual location: Askari Game Lodge.

Actual distances calculated with Google Maps, driving distance, shortest route.

  • Clearwater Mall: Given distance –  52KM. Actual distance – 79KM
  • HQ: Given distance – 90KM. Actual distance – 118KM

YEF2014

With that, I have the following training data:

Location Given Distance Error in Distance (Given – Actual)
Clearwater Mall 80 13
Clearwater Mall 51 -28
Clearwater Mall 52 -27
Centurion 88 41
Centurion 110 -8
Fourways 68 -6
Fourways 89 -8
HQ 115 40
HQ 90 -28

Now to stick those into a linear regression and see if I can predict the error on this year’s measurements.

I need to mention that with so little data, the accuracy of the linear regression is going to be very low. I’m as likely to get the correct results from linear regression as I am to get correct results from rolling a couple of d20s.

That said, onwards to untrustworthy results.

Once the starting values are loaded into R, creating a simple model is as easy as

m <- lm(Error ~ Location + Distance, data=YEF)

Then load up this year’s values into another data frame, and predict.

predict(m, YEFPredict)

The errors come out as:

  • Clearwater Mall: -12
  • Centurion: -18
  • Fourways: -35
  • HQ: -60

Giving final estimated distances (Given – Error) as

  • Clearwater Mall: 75KM
  • Centurion: 74KM
  • Fourways: 78KM
  • HQ: 80KM

YEF2017

Maybe I should have stuck to using dice.

5 Comments

  1. Tinus the mastermind

    hahaha. Just like machines learn…The Social team has learned

    Reply
    1. Gail (Post author)

      That’s OK. I’ll throw more intelligent algorithms at it next year. 🙂

      Reply
  2. Brian K

    Jeez why would the source data be so bad? Somebody’s odometer is messed up.

    Reply
    1. Gail (Post author)

      It’s intentionally wrong, to preserve the surprise of where the event is until the day.

      Reply
  3. jonmcrawford

    haha, I had a similar experience when I asked our resident statistician to walk me through a 95% confidence level calculation on some data. He did all of it, answered my questions to the best of his ability, left me fairly befuddled. Only to then (after several hours over a few days helping me) follow up with “but your population is small enough that you’d be better off randomly picking 100 people and testing them rather than doing all of this”

    Reply

Leave a Comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.