Note: This post is more about math than coding.

In one of my projects I was considering to use MD5 hashing for both generating a unique key for the same text input and increasing search performance (search for the same text input before inserting) by indexing this hashed column in the database.

But the question was how likely was it to get a hash collision while generating MD5 hash keys for each input to the system. I had heard about MD5 being not secure enough for crypto applications, how it was cracked, how it was possible to get collisions and so on.

So what is this collision and how likely is it? Below you will find my answers to these questions.

**What is MD5**

MD5 is a hashing algorithm that generates unique (?!?) hash values for given inputs. Resulting MD5 string is a 128-bit (16-byte) string which is expressed as a 32 characters long hexadecimal number.

**Hash Collision**

Yes, you guessed that right. Hash collision means generating the same MD5 hash value for 2 different inputs. If the same hash value was created for a different input than the system would not allow the recording of that input although it was not a duplicate of any previous records which is totally an undesired behaviour.

**Birthday Paradox**

In probability theory, the birthday problem or birthday paradox concerns the probability that, in a set of n randomly chosen people, some pair of them will have the same birthday. By the pigeonhole principle, the probability reaches 100% when the number of people reaches 367 (since there are 366 possible birthdays, including February 29). However, 99% probability is reached with just 57 people, and 50% probability with 23 people. (wikipedia)

*Logic:*

*Calculation:*

P(A') = P(1)*P(2)*P(3)*….*P(23) = 365/365*364/365*363/365*…..*343/365 = 0,493

Therefore P(A) = 1- P(A') = 0,507 = %50,7

As you can see for a group of 23 people the probability of 2 people having the same birthday reaches just above %50.

**MD5 Hash Collision Probability (Using Birthday Paradox)**

**
** In the light of Birthday Paradox (or Birthday Problem) probability calculations following results can be obtained for MD5 algorithm:

Number of hashed elements such that {probability of at least one hash collision = p}

hash space size (2#bits) | p=10^{?18} | p=10^{?15} | p=10^{?12} | p=10^{?9} | p=10^{?6} | p=0.1% | p=1% | p=25% | p=50% | p=75% |

3.4×10^{38} | 2.6×10^{10} | 8.2×10^{11} | 2.6×10^{13} | 8.2×10^{14} | 2.6×10^{16} | 8.3×10^{17} | 2.6×10^{18} | 1.4×10^{19} | 2.2×10^{19} | 3.1×10^{19} |

Full table for 16-512 bit range: http://en.wikipedia.org/wiki/Birthday_attack

**Conclusion**

It is not that easy to get hash collisions when using MD5 algorithm. Even after you have generated 26 trillion hash values, the probability of the next generated hash value to be the same as one of those 26 trillion previously generated hash values is 1/1trillion (1 out of 1 trillion).

Hope this helps someone.

Good luck,

Serdar.

UC9HCY8UB68Y