Making today worse so tomorrow seems better.

Estimating Key Collisions (Birthday Problem)


Thanks to a somewhat well known statistical behavior known as the Birthday Problem, collisions of randomly generated keys are much more likely than you would expect. You can estimate how likely with some quick Ruby.

I ran into real world application of the Birthday Problem while I was building the mBlox adapter for Vocel’s SMS product. Most APIs for sending SMS give you back a randomly generated tracking ID when you submit a message. This tracking ID can then be used to report the delivery status of the message.

The mBlox API doesn’t automatically return a tracking ID, but it allows you to set one of your own. Could we get away with just randomly generating our own without checking to see if we had used the ID yet?

After some research online, I built this little method to determine the chance of having at least one collision in a set of randomly distributed keys ‘k’ out of a range of ‘n’ possible values. While calculating the exact probability for large numbers isn’t easy, it turns out estimating it is.

The real problem with the mBlox API is that the maximum value for their “BatchID” parameter is 99,999,999. That would make a big salary, but it’s not a big keyspace. For the expected trial run of 70,000 messages, the chance of at least one collision is over 99.999%.

Clickatell (another SMS aggregator), let’s you specify a 32-character string as the tracking ID. Using that for 16-bytes of hex, that is a big enough space that the collision rate is effectively 0 even with over 100 Billion messages.

So the random ID plan didn’t work for mBlox, although we could use it for Clickatell. In the end we used a MySQL DB table with an autoincrement key as the tracking ID for mBlox messages. This avoids dupe-checking and is synchronized on all the machines in the cluster.