MD5 Hash Collision Probability (Using Birthday Paradox) Note: This post is more about math than coding.
In one of my projects I was considering to use MD5 hashing for both generating a unique key for the same text input and increasing search performance (search for the same text input before inserting) by indexing this hashed column in the database.

But the question was how likely was it to get a hash collision while generating MD5 hash keys for each input to the system. I had heard about MD5 being not secure enough for crypto applications, how it was cracked, how it was possible to get collisions and so on.

So what is this collision and how likely is it? Below you will find my answers to these questions.

What is MD5

MD5 is a hashing algorithm that generates unique (?!?) hash values for given inputs. Resulting MD5 string is a 128-bit (16-byte) string which is expressed as a 32 characters long hexadecimal number.

Hash Collision

Yes, you guessed that right. Hash collision means generating the same MD5 hash value for 2 different inputs. If the same hash value was created for a different input than the system would not allow the recording of that input although it was not a duplicate of any previous records which is totally an undesired behaviour.

In probability theory, the birthday problem or birthday paradox concerns the probability that, in a set of n randomly chosen people, some pair of them will have the same birthday. By the pigeonhole principle, the probability reaches 100% when the number of people reaches 367 (since there are 366 possible birthdays, including February 29). However, 99% probability is reached with just 57 people, and 50% probability with 23 people. (wikipedia)

Logic:
In a group of 23 people comparing the birthday of first person to the others allows 22 chances for a matching birthday, the second person will have 21 chances, 3rd 20 chances and so on. Comparing every person to the others will create 253 match chances.
When it comes to calculate the probabilities it is easier to calculate the inverse probability and subtract it from 1, P(A) = 1 – P(A'). So, let's calculate the inverse probability for our birthday problem.
Calculation:
For “person 1”, probability of not having the same birthday as previous people in the group is 1 (365/365) since he is the first and is compared to no previous people. For the second person this probability is 364/365, for the third it is 363/365 and so on.

P(A') =  P(1)*P(2)*P(3)*….*P(23) = 365/365*364/365*363/365*…..*343/365 = 0,493

Therefore P(A) = 1- P(A') = 0,507 = %50,7

As you can see for a group of 23 people the probability of 2 people having the same birthday reaches just above %50.

MD5 Hash Collision Probability (Using Birthday Paradox)

In the light of Birthday Paradox (or Birthday Problem) probability calculations following results can be obtained for MD5 algorithm:

Number of hashed elements such that {probability of at least one hash collision = p}

 hash space size (2#bits) p=10?18 p=10?15 p=10?12 p=10?9 p=10?6 p=0.1% p=1% p=25% p=50% p=75% 3.4×1038 2.6×1010 8.2×1011 2.6×1013 8.2×1014 2.6×1016 8.3×1017 2.6×1018 1.4×1019 2.2×1019 3.1×1019

Full table for 16-512 bit range: http://en.wikipedia.org/wiki/Birthday_attack

Conclusion

It is not that easy to get hash collisions when using MD5 algorithm. Even after you have generated 26 trillion hash values, the probability of the next generated hash value to be the same as one of those 26 trillion previously generated hash values is 1/1trillion (1 out of 1 trillion).

Hope this helps someone.

Good luck,
Serdar.
UC9HCY8UB68Y