Now that we know the basics of the Birthday Problem, we can use this knowledge to understand the security of password hashing.
In the early days, passwords were stored in the server “as-is”. This means that if your username was juan and your password is Password123! then that information is stored in the server like this:
This is information is usually stored in a password file. If this password file is stolen, then it’s easy for another person to use this information and log in using your username and password and impersonate you.
Since the theft of a password file is harder to prevent, the passwords are not anymore stored “as-is” (also known as clear-text). The server will apply an algorithm to the original password which outputs a text called a hash. The algorithm is called a hash function. The hash is what’s put in the password file. A thief in possession of the password file will not be able to know the original password just by looking at it.
For example, the information above will now look like this:
2c103f2c4ed1e59c0b4e2e01821770fa” is the has value of the password “
So when you log in to the server using your password “Password123!”, the server will then run an algorithm that will hash your password and compare the result of the hashing to the one stored in the server, say “2c103f2c4ed1e59c0b4e2e01821770fa”. If they match, then it means that your password was correct and you are given access to the server.
The hash function I’m using is called the MD5 hash function. Given a password, it will produce a hash value. The set of all hash values is not infinite. In fact, the number of possible hash values is for md5. Due to this restriction, the birthday paradox will apply.
How the Birthday Paradox Applies
The birthday paradox tells us that given a hash function , the probability that at least two passwords hash to the same value is given by:
Since md5 hash function has possible values, the probability that two passwords hash to the same value is
We want to compute for k so that this probability is at least 50%.
which is equivalent to
Computing for when N is large is hard so we need to approximate this. To that end, we need some tools to help us.
We can write the probability in the following way:
Since N is large, the quantities
are very small. Because of this, we can use the approximation
The above approximation comes from the Taylor expansion of :
If is small, the higher order terms like vanish. Using this approximation, we can write the probability as:
Computing for k
Let’s compute k so that
Taking the natural logarithms of both sides
Using the quadratic equation, we can solve for k:
When , we have
This is about 10 quintillion. What this means is that when , there is already a 50% chance that 2 passwords hash to the same value. In fact, the md5 was already cracked in 2004.