Now that we know the basics of the Birthday Problem, we can use this knowledge to understand the security of password hashing.

In the early days, passwords were stored in the server “as-is”. This means that if your username was juan and your password is Password123! then that information is stored in the server like this:

This is information is usually stored in a password file. If this password file is stolen, then it’s easy for another person to use this information and log in using your username and password and impersonate you.

Since the theft of a password file is harder to prevent, the passwords are not anymore stored “as-is” (also known as clear-text). The server will apply an algorithm to the original password which outputs a text called a hash. The algorithm is called a hash function. The hash is what’s put in the password file. A thief in possession of the password file will not be able to know the original password just by looking at it.

For example, the information above will now look like this:

juan,2c103f2c4ed1e59c0b4e2e01821770fa

where “2c103f2c4ed1e59c0b4e2e01821770fa” is the has value of the password “Password123!“.

The hash function I’m using is called the MD5 hash function. Given a password, it will produce a hash value. The set of all hash values is not infinite. In fact, the number of possible hash values is $2^{128}$ for md5. Due to this restriction, the birthday paradox will apply.

How the Birthday Paradox Applies

The birthday paradox tells us that given a hash function $f(x)$, the probability that at least two passwords hash to the same value is given by:

$\displaystyle 1-\frac{N\times N-1\times N-2\times \ldots \times N-k+1}{N^k}$

Since md5 hash function has $N=2^{128}$ possible values, the probability that two passwords hash to the same value is

$\displaystyle 1-\frac{2^{128}\times 2^{128}-1\times 2^{128}-2\times \ldots \times 2^{128}-k+1}{(2^{128})^k}$

We want to compute for k so that this probability is at least 50%.

$\displaystyle 1-\frac{2^{128}\times 2^{128}-1\times 2^{128}-2\times \ldots \times 2^{128}-k+1}{(2^{128})^k} \ge 0.5$

which is equivalent to

$\displaystyle \frac{2^{128}\times 2^{128}-1\times 2^{128}-2\times \ldots \times 2^{128}-k+1}{(2^{128})^k} < 0.5$

Computing for $k$ when N is large is hard so we need to approximate this. To that end, we need some tools to help us.

We can write the probability in the following way:

$\displaystyle 1-\frac{N}{N}\times\frac{N-1}{N}\times\frac{N-2}{N}\times\frac{N-3}{N}\times\ldots\times\frac{N-k+1}{N}$
$= \displaystyle 1-\frac{N}{N}\times (1-\frac{1}{N})\times (1-\frac{2}{N})\times (1-\frac{3}{N}) \times\ldots\times (1-\frac{k-1}{N})$

Since N is large, the quantities

$\displaystyle \frac{1}{N}, \frac{2}{N}, \frac{3}{N}, \frac{k-1}{N}$

are very small. Because of this, we can use the approximation

$e^{-x} \approx 1-x$

The above approximation comes from the Taylor expansion of $e^{-x}$:

$\displaystyle e^{-x} = 1 - x + \frac{x^2}{2!} - \frac{x^3}{3!} + \frac{x^4}{4!} \ldots$

If $x$ is small, the higher order terms like $x^2, x^3, x^4, \ldots$ vanish. Using this approximation, we can write the probability as:

$\displaystyle \frac{N}{N}\times (1-\frac{1}{N})\times (1-\frac{2}{N})\times (1-\frac{3}{N}) \times\ldots\times (1-\frac{k-1}{N})$

$\displaystyle = e^{-\frac{1}{N}}\cdot e^{-\frac{2}{N}}\cdot e^{-\frac{3}{N}}\cdot \ldots\cdot e^{-\frac{k-1}{N}}$

$\displaystyle = e^{-\frac{1+2+3+4+\ldots + k-1}{N}}$

Since

$\displaystyle \sum_1^n j = 1+2+3+4+ \ldots + n = \frac{n(n+1)}{2}$

we have

$e^{-\frac{1+2+3+4+\ldots + k-1}{N}} = e^{-k\cdot (k-1)/2N }$

Computing for k

Let’s compute k so that

$\displaystyle e^{-k\cdot (k-1)/2N} < 0.5$

Taking the natural logarithms of both sides

$\displaystyle \ln e^{-k\cdot (k-1)/2N} < \ln 0.5$

$\displaystyle \frac{-k\cdot (k-1)}{2N} < \ln 0.5$

$\displaystyle k^2 - k + 2N\ln 0.5 > 0$

Using the quadratic equation, we can solve for k:

$\displaystyle k > \frac{-(-1) \pm \sqrt{(-1)^2 -4(1)(2N\ln 0.5}}{2}$
$\displaystyle k > \frac{1 \pm \sqrt{1-8N\ln 0.5}}{2}$

When $N=2^{128}$, we have

$\displaystyle k > \frac{1 \pm 4.343876e+19}{2} \approx 10^{19}$

This is about 10 quintillion. What this means is that when $k > 10^{19}$, there is already a 50% chance that 2 passwords hash to the same value. In fact, the md5 was already cracked in 2004.