## Implementing Parallel Algorithms Part 2

In the previous post, we implemented a very simple parallel program to add a set of numbers. In this post, we will implement parallel matrix multiplication.

We have shown a parallel algorithm to multiply 2 big matrices using message passing. The algorithm involved block sub-matrices to be passed from node to node and multiplied within a node until the answer is found.

There will be 2 square matrices A and B. In our example, the dimension of both A and B is 4×4. We will distribute the matrices evenly to 4 nodes.

$A = \left[ \begin{array}{cc|cc} a_{00} & a_{01} & a_{02} & a_{03}\\ a_{10} & a_{11} & a_{12} & a_{13}\\ \hline a_{20} & a_{21} & a_{22} & a_{23}\\ a_{30} & a_{31} & a_{32} & a_{33} \end{array} \right], B = \left[ \begin{array}{cc|cc} b_{00} & b_{01} & b_{02} & b_{03}\\ b_{10} & b_{11} & b_{12} & b_{13}\\ \hline b_{20} & b_{21} & b_{22} & b_{23}\\ b_{30} & b_{31} & b_{32} & b_{33} \end{array} \right]$

In this example, node 00 will have the following matrices:

$A_{00}=\begin{bmatrix} a_{00} & a_{01}\\ a_{10} & a_{11} \end{bmatrix}, B_{00}= \begin{bmatrix} b_{00} & b_{01}\\ b_{10} & b_{11} \end{bmatrix}$

const n = 4;
var vec = 1..n;
var blockSize = 2;

var A: [vec, vec] real;
var B: [vec, vec] real;
var C: [vec, vec] real;

coforall loc in Locales {
on loc {
var i = loc.id/2;
var j = loc.id%2;
var istart = i*blockSize;
var iend = istart + blockSize;
var jstart = j*blockSize;
var jend = jstart + blockSize;

for (r,s) in {istart + 1..iend, jstart + 1..jend} {
B(r,s) = r+s;
A(r,s) = 2*r + s;
}
}
}


Each node has limited memory physically exclusive to itself. In order for node A to have access to the contents of the memory of another node B, node B should pass the data to node A. Fortunately, Chapel can use a library called GASNet that allows each node to have a global view of the memory of all nodes participating in the computation.

In the code above, each node loads its own data. However, the GASNet library allows each node to access the matrix elements loaded by the other nodes. Consequently, we are able to reference the sub-matrix held by each node without doing fancy message passing. The algorithm is then a straightforward implementation of

$\displaystyle \mathbf{C}_{ij}=\sum_{k=0}^{2} \mathbf{A}_{ik} \mathbf{B}_{kj}$

where $\mathbf{A_{ij}}$, $\mathbf{B_{ij}}$ and $\mathbf{C_{ij}}$ are submatrices of $\mathbf{A}$, $\mathbf{B}$, and $\mathbf{C}$, respectively.

Below is the straightforward implementation of parallel block multiplication:

coforall loc in Locales {
on loc {
var i = loc.id/2;
var j= loc.id%2;
var istart = i*blockSize;
var iend = istart + blockSize;
var jstart = j*blockSize;
var jend = jstart + blockSize;
var r = { istart + 1..iend, jstart + 1..jend };
ref W = C[r].reindex( { 1..2,1..2 });

coforall k in 0..1 {
var U=get_block_matrix(A[vec,vec],i,k,blockSize);
var V=get_block_matrix(B[vec,vec],k,j,blockSize);
var P = mat_mul(U,V);
coforall (s,t) in { 1..2,1..2 } {
W(s,t) += P(s,t);
}
}
}
}


The procedure get_block_matrix will return the sub-matrix given the (i,j)th index and the block size.

proc get_block_matrix(A: [?D], i:int, j:int , blockSize:int) {
var r = { i*blockSize+1 .. i*blockSize
+  blockSize, j*blockSize
+ 1 .. j*blockSize + blockSize };
return A[r];
}


The procedure mat_mul will return the matrix product of two sub-matrices:

proc mat_mul(A: [?D1], B: [?D2]) {
var D3 = { 1..2, 1..2 };
var C: [D3] real;
var AA = A.reindex({1..2,1..2});
var BB = B.reindex({1..2,1..2});

for row in 1..2 {
for col in 1..2 {
var sum:real = 0;
for k in 1..2 {
sum += AA(row,k) * BB(k,col);
}
C(row,col) = sum;
}
}
return C;
}

writeln(C[vec,vec]);


To run this code, we need to set the following environment variables:

source $CHPL_HOME/util/setchplenv.bash export CHPL_COMM=gasnet export CHPL_LAUNCHER=amudprun export GASNET_SSH_SERVERS="127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1"  Compiling and running this program gives the output: #Compile chpl mat_mul.chpl -o mat_mul # Run using 4 nodes ./mat_mul -nl 4 A= 3.0 4.0 5.0 6.0 5.0 6.0 7.0 8.0 7.0 8.0 9.0 10.0 9.0 10.0 11.0 12.0 B= 2.0 3.0 4.0 5.0 3.0 4.0 5.0 6.0 4.0 5.0 6.0 7.0 5.0 6.0 7.0 8.0 C= 68.0 86.0 104.0 122.0 96.0 122.0 148.0 174.0 124.0 158.0 192.0 226.0 152.0 194.0 236.0 278.0  ## Implementing Parallel Algorithms Part 1 Now we know that parallel algorithms allow us to make our programs run faster. So how do we implement them? I have used mpich before, but that was more than a decade ago. Recently, I found myself looking for new ways of doing parallel programming. I discovered a very nice parallel programming language called Chapel. This is what we’ll use to implement parallel algorithms. ## Algorithm 1: Parallel sum of consecutive numbers from 1 to N To get the sum of numbers from 1 to N, where N is some integer is easy. There is a formula to do that: $\displaystyle \sum_{i=1}^N i = \frac{N(N+1)}{2}$ However for the sake of illustration, we are going to compute the sum of 1 to N using a cluster of machines. Here is the Chapel code to accomplish it inside the file add_parallel.chpl. config var N:int = 1000; var D:domain(1) = {0..numLocales -1}; var s:[D] int; var sum:int= 0; var bs = N/numLocales; coforall loc in Locales { on loc { var i = loc.id; var start = i*bs + 1; var end = start + bs -1; var _sum:int = 0; for j in start .. end { _sum += j; } writeln("i= " + i + ", start= "+ start + ", end=" + end + ", sum = " + _sum); s[i] = _sum; } } sum = + reduce s; writeln("sum: " + sum);  This program is compiled using the command: chpl add_parallel.chpl -o add_parallel  where add_parallel.chpl is the filename of the program and -o add_parallel specifies the filename of the binary produced after compilation. One line 1, we have defined the default value of N to be 1000. This can be overridden on the command line by specifying the --x parameter. The number of machines we are going to use is also specified on the command line using the -nl parameter. A sample invocation of this program is the following: ./add_parallel -nl 3 --N=120  The above command means that we want to run the add_parallel binary using 3 machines with the value of N=120. Executing the above command will give us: ./add_parallel -nl 3 --N=120 i= 0, start= 1, end=40, sum = 820 i= 1, start= 41, end=80, sum = 2420 i= 2, start= 81, end=120, sum = 4020 sum: 7260  ## How the program works The program will partition N into 3 blocks. This is specified on line 5 where we divided N by numLocales to get the block size. The numLocales will contains the value of the parameter -nl which in this example is 3. The code on line 6 tells chapel to execute a parallel for-loop executing the code inside on loc block on each Locale. A Locale has an id starting from 0. The Locale will determine it’s id and compute the starting and ending number to sum and store this value in the variable _sum. coforall loc in Locales { on loc { var i = loc.id; var start = i*bs + 1; var end = start + bs -1; var _sum:int = 0; for j in start .. end { _sum += j; } writeln("i= " + i + ", start= "+ start + ", end=" + end + ", sum = " + _sum); s[i] = _sum; } }  This _sum is stored in the array s. We take the sum of entries of the s array using the reduce keyword specifying + as the reduction operator. Finally we print the total sum across the machines. sum = + reduce s; writeln("sum: " + sum);  ## Conclusion We have seen that that we can implement parallel programs using Chapel programming language. In part 2, we will show how to do Parallel Matrix Multiplication using Chapel. ## An Interview Question: Using Integer Programming We can solve the Interview Question using a mathematical technique called Integer Programming. Let $d_1, d_2, \ldots, d_N$ be the variables representing diskette 1, diskette 2, diskette 3, etc. The values of the $d_k$ variables can only be 0 or 1. A 0 means the diskette is not used while a 1 means that it is used. Each file is saved to a certain diskette. We want to know to what diskette $d_i$ a given file $f_j$ is assigned. To represent this, we assign the variable $a_{ij}$ a value of 1 if file $f_j$ is assigned to diskette $d_i$. We will normalize the file sizes so that if $s_i$ is the size of $f_i$, the $s_i \le 1$. We do this by simply dividing all file sizes by the size of the diskette. For a given diskette $d_i$, the following constraint should be satisfied: $d_i - s_1a_{i1} - s_2a_{i2} - \ldots - s_N a_{iN} \ge 0$ for diskette $i = 1, 2, \ldots, N$ and $s_i$ are the normalized file sizes of file $f_i$ for $i=1,2,\ldots,N$. Since each file $f_j$ can only be assigned to one diskette, we have the following constraint: $a_{1j} + a_{2j} + \ldots + a_{Nj} = 1$ where $a_{1j}$ is the variable representing the “file $f_j$ is in diskette $d_1$“, etc. Finally, we have to constrain the value of $d_i$ to be either 0 or 1, that is, $d_i \le 1$ for all $i=1,2,\ldots,N$. ## Integer Programming Formulation Given the above information, we can formulate the Integer Programming problem as Minimize: $d_1 + d_2 + d_3 + \ldots + d_N$ subject to $\begin{array}{rl} d_1 - s_1a_{11} - s_2a_{12} - s_3a_{13} - \ldots - s_Na_{1N} &\ge 0\\ d_2 - s_1a_{21} - s_2a_{22} - s_3a_{23} - \ldots - s_Na_{2N} &\ge 0\\ :\\ d_N - s_1a_{N1} - s_2a_{N2} - s_3a_{N3} - \ldots - s_Na_{NN} &\ge 0\\ a_{11} + a_{21} + a_{31} + \ldots + a_{N1} &= 1\\ a_{12} + a_{22} + a_{32} + \ldots + a_{N2} &= 1\\ :\\ a_{1N} + a_{2N} + a_{3N} + \ldots + a_{NN} &= 1\\ d_1 &\le 1\\ d_2 &\le 1\\ :\\ d_n &\le 1 \end{array}$ ## Solving the Problem We will use R to solve this Integer Programming Formulation. Please see code below: library("lpSolve") NUMFILES=4 # Generate random file sizes between 1 and 10 FileSizes=ceiling(10*runif(NUMFILES)) x = -1*FileSizes/10 l=length(x) # Each files can be in any of the diskettes. Suppose there are N files, # to determine if a file j is in diskette i, the value of variable x_ij will # 1 if file j is in diskette i, and 0 otherwise. # Here we construct the coefficients of variables x_ij which are the # sizes of the files (normalized to 1) zz=c() for(i in 1:(l-1)){ zz=c(zz,x,rep(0,l*l)) } zz=c(zz,x) # Construct the coefficients of the indicator variables representing the # diskettes d_i zzmatrix=matrix(zz,ncol=l*l,byrow=T) CoefficientsOfDiskettes=c(); for(i in 1:l){ ttt=rep(0,l) ttt[i] = 1 CoefficientsOfDiskettes= c(CoefficientsOfDiskettes,ttt,zzmatrix[i,]) } # Construct the coefficients of x_ij for constant j. These variables # satisfy the equation \sum_{i=1}^N x_{ij} SumOfFileAcrossDiskettes=c() for(i in 1:l){ ttt=rep(0,l) ttt[i]=1 SumOfFileAcrossDiskettes=c(SumOfFileAcrossDiskettes,rep(ttt,l)) } # Prepend Coefficients of variables d_i. The value of these coefficients is 0. SumOfFileAcrossDiskettesMatrix=matrix(SumOfFileAcrossDiskettes,ncol=l*l,byrow=T) PrependCoefficientsOfDiskettes=c() for(i in 1:l){ PrependCoefficientsOfDiskettes=c(PrependCoefficientsOfDiskettes,c(rep(0,l),SumOfFileAcrossDiskettesMatrix[i,])) } # Construct coefficients of d_i to construct constraint d_i <= 1 DisketteConstraints=c() for(i in 1:l){ ttt=rep(0,l) ttt[i]=1 DisketteConstraints=c(DisketteConstraints,ttt,rep(0,l*l)) } # Construct matrix input of lpSolve const.mat=matrix(c(CoefficientsOfDiskettes,PrependCoefficientsOfDiskettes,DisketteConstraints),ncol=l*(l+1),byrow=T) print("Matrix Coefficients:") print(const.mat) # Construct inequalities/equalities const.dir=c(rep(">=",l),rep("=",l),rep("<=",l)) # Construct Right-Hand side const.rhs=c(rep(0,l),rep(1,l),rep(1,l)) # Construct Objective Function objective.in=c(rep(1,l),rep(0,l*l)) # Invoke lpSolve mylp=lp(direction="min",objective.in=objective.in,const.mat=const.mat,const.dir=const.dir,const.rhs=const.rhs,all.int=T) # Print Results print(paste("Number of Diskettes: ", sum(mylp$solution[1:l])))
tz=matrix(mylp\$solution,ncol=l,byrow=T)
print("File Sizes: ")
print(FileSizes)
for(i in 2:(l+1)){
files = which(tz[i,] == 1)
if(length(files) > 0){
print(paste("Files in diskette ", i-1))
print(files)
}
}



Most of the code above is setting up the matrix of coefficients. The line 70 then calls on lpSolve to compute the optimal values of the variables

## Program Output

Running this code we get the output

[1] "Matrix Coefficients:"
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,]    1    0    0    0   -1 -0.2 -0.1 -0.1    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[2,]    0    1    0    0    0  0.0  0.0  0.0   -1  -0.2  -0.1  -0.1     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[3,]    0    0    1    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0    -1  -0.2  -0.1  -0.1     0   0.0   0.0   0.0
[4,]    0    0    0    1    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0    -1  -0.2  -0.1  -0.1
[5,]    0    0    0    0    1  0.0  0.0  0.0    1   0.0   0.0   0.0     1   0.0   0.0   0.0     1   0.0   0.0   0.0
[6,]    0    0    0    0    0  1.0  0.0  0.0    0   1.0   0.0   0.0     0   1.0   0.0   0.0     0   1.0   0.0   0.0
[7,]    0    0    0    0    0  0.0  1.0  0.0    0   0.0   1.0   0.0     0   0.0   1.0   0.0     0   0.0   1.0   0.0
[8,]    0    0    0    0    0  0.0  0.0  1.0    0   0.0   0.0   1.0     0   0.0   0.0   1.0     0   0.0   0.0   1.0
[9,]    1    0    0    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[10,]    0    1    0    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[11,]    0    0    1    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[12,]    0    0    0    1    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[1] "Number of Diskettes:  2"
[1] "File Sizes: "
[1] 10  2  1  1
[1] "Files in diskette  1"
[1] 2 3 4
[1] "Files in diskette  2"
[1] 1



## Interpreting the Result

Lines 2-14 of the output gives you the matrix of coefficients. Line 15 prints the number of diskettes needed to store the files. Line 17 prints the randomly generated file sizes from 1 to 10. Finally lines 18-21 prints which diskettes contain which files.

The space complexity of this solution is quite substantial. Given $N$ files, we need to specify $N^2 + N$ variables by $3\times N$ equations for a total of $(N^2 + N)\times 3N$ memory space for coefficients.

## An Interview Question

I was given this interview question and I’d like to share it to you. The setting is back in the days when the largest size of a hard disk was 1 GB and there were no CD writers yet and the only way to back up your data is through those 1.44 floppy disks. You want to back up your files but you want to minimize the number of floppy disks you need to use. Assume your most important files are in a single directory. How will you distribute the files across your disks in such a way that the number of disks you use is minimized ?

To make this simple, let’s assume the following:

– we will not take into account that every file you copy to the disk has a record of the metadata of the file and stored on the disk as well. This will eat up space as you put more files. For our purposes, we ignore this complexity.
– The size of each file is less than or equal to 1.44

First we need to have a list of those files including the size and sort the list according to size in descending order. If A is the list of files, we can apply this algorithm:

B := list of files to copy to current floppy disk
remaining_size := 1.44 MB
For file in A:
If remaining_size - file.size > 0:
A.remove(file)
Copy all files listed in B to disk
Empty B
Repeat process for remaining files in A


Although there are other better algorithms than the one above, this is the one I managed to to come up during the interview.

We now need to determine how fast our algorithm can run.

## Worst Case Complexity

How slow can this algorithm get ? If for any two files $F_i$ and $F_j$ in A we have $F_i + F_j > 1.44$, then all files will have their own diskette. If this is the case, for each file, our algorithm will execute step 4. For the first disk, it will execute the step $N$ times. For the second disk, it will execute the step $N-1$ times, for the third disk it will execute $N-2$ times, etc. the total number of times it executes step 4 is the total number of comparisons and is equal to the summation:

$\displaystyle \sum_{1=1}^{N} i$

which is equal to

$\displaystyle \frac{N(N+1)}{2}$

Therefore, in the worst case, the complexity is $O(N^2)$.

## Best Case Complexity

The best case is when all files fit in just one diskette. For this, the total number of comparisons is $N$

## Average Case Complexity

On the average, files have different sizes. We now compute the complexity on the assumption that the probability distribution is uniform.

If $k$ is the number of diskettes, the number of comparisons is a sequence of monotonic decreasing numbers $\{ a_1, a_2, a_3, \ldots, a_k \}$ taken at random from the set $\{ 1, 2, \ldots, N\}$. Each of the numbers $a_j$, $j\in \{1, 2, \ldots, k\}$ has a probability $1/N$ of being chosen. Let $X$ be a random variable such that

$\displaystyle Pr(X=a_j) = \frac{1}{N} \text{ for } j=1,2,\ldots,k$

then the number of comparisons $C$ is equal to

$\displaystyle C = \sum_{i=1}^k X = kX$

The expected value of $C$ is given by the

$E[C] = E[kX] = kE[X]$

However, the expected value of X is given by

$\displaystyle E[X] = \sum_{j=1}^N j\cdot Pr(X=j) = \frac{1}{N} \sum_{j=1}^N j = \frac{1}{N}\frac{N(N+1)}{2} = \frac{N+1}{2}$

Therefore,

$\displaystyle E[C] = k\frac{N+1}{2}$

What remains is to determine the average value of $k$, which is the number of diskettes. If $M=1.44$ is the maximum file size, the average file size is $M/2$. The average total file size is then $NM/2$. The average number of diskettes is equal to the average total size divided by size of diskette, that is

$k = \displaystyle \frac{NM}{2}\frac{1}{M} = \frac{N}{2}$

This means that

$\displaystyle E[C] = \frac{N}{2} \frac{N+1}{2} = O(N^2)$

which is the same as the worst case complexity.

There is another way to solve this problem using Integer Programming.

Now that we know the basics of the Birthday Problem, we can use this knowledge to understand the security of password hashing.

In the early days, passwords were stored in the server “as-is”. This means that if your username was juan and your password is Password123! then that information is stored in the server like this:

juan,Password123!

Since the theft of a password file is harder to prevent, the passwords are not anymore stored “as-is” (also known as clear-text). The server will apply an algorithm to the original password which outputs a text called a hash. The algorithm is called a hash function. The hash is what’s put in the password file. A thief in possession of the password file will not be able to know the original password just by looking at it.

For example, the information above will now look like this:

juan,2c103f2c4ed1e59c0b4e2e01821770fa

where “2c103f2c4ed1e59c0b4e2e01821770fa” is the has value of the password “Password123!“.

The hash function I’m using is called the MD5 hash function. Given a password, it will produce a hash value. The set of all hash values is not infinite. In fact, the number of possible hash values is $2^{128}$ for md5. Due to this restriction, the birthday paradox will apply.

The birthday paradox tells us that given a hash function $f(x)$, the probability that at least two passwords hash to the same value is given by:

$\displaystyle 1-\frac{N\times N-1\times N-2\times \ldots \times N-k+1}{N^k}$

Since md5 hash function has $N=2^{128}$ possible values, the probability that two passwords hash to the same value is

$\displaystyle 1-\frac{2^{128}\times 2^{128}-1\times 2^{128}-2\times \ldots \times 2^{128}-k+1}{(2^{128})^k}$

We want to compute for k so that this probability is at least 50%.

$\displaystyle 1-\frac{2^{128}\times 2^{128}-1\times 2^{128}-2\times \ldots \times 2^{128}-k+1}{(2^{128})^k} \ge 0.5$

which is equivalent to

$\displaystyle \frac{2^{128}\times 2^{128}-1\times 2^{128}-2\times \ldots \times 2^{128}-k+1}{(2^{128})^k} < 0.5$

Computing for $k$ when N is large is hard so we need to approximate this. To that end, we need some tools to help us.

We can write the probability in the following way:

$\displaystyle 1-\frac{N}{N}\times\frac{N-1}{N}\times\frac{N-2}{N}\times\frac{N-3}{N}\times\ldots\times\frac{N-k+1}{N}$
$= \displaystyle 1-\frac{N}{N}\times (1-\frac{1}{N})\times (1-\frac{2}{N})\times (1-\frac{3}{N}) \times\ldots\times (1-\frac{k-1}{N})$

Since N is large, the quantities

$\displaystyle \frac{1}{N}, \frac{2}{N}, \frac{3}{N}, \frac{k-1}{N}$

are very small. Because of this, we can use the approximation

$e^{-x} \approx 1-x$

The above approximation comes from the Taylor expansion of $e^{-x}$:

$\displaystyle e^{-x} = 1 - x + \frac{x^2}{2!} - \frac{x^3}{3!} + \frac{x^4}{4!} \ldots$

If $x$ is small, the higher order terms like $x^2, x^3, x^4, \ldots$ vanish. Using this approximation, we can write the probability as:

$\displaystyle \frac{N}{N}\times (1-\frac{1}{N})\times (1-\frac{2}{N})\times (1-\frac{3}{N}) \times\ldots\times (1-\frac{k-1}{N})$

$\displaystyle = e^{-\frac{1}{N}}\cdot e^{-\frac{2}{N}}\cdot e^{-\frac{3}{N}}\cdot \ldots\cdot e^{-\frac{k-1}{N}}$

$\displaystyle = e^{-\frac{1+2+3+4+\ldots + k-1}{N}}$

Since

$\displaystyle \sum_1^n j = 1+2+3+4+ \ldots + n = \frac{n(n+1)}{2}$

we have

$e^{-\frac{1+2+3+4+\ldots + k-1}{N}} = e^{-k\cdot (k-1)/2N }$

Computing for k

Let’s compute k so that

$\displaystyle e^{-k\cdot (k-1)/2N} < 0.5$

Taking the natural logarithms of both sides

$\displaystyle \ln e^{-k\cdot (k-1)/2N} < \ln 0.5$

$\displaystyle \frac{-k\cdot (k-1)}{2N} < \ln 0.5$

$\displaystyle k^2 - k + 2N\ln 0.5 > 0$

Using the quadratic equation, we can solve for k:

$\displaystyle k > \frac{-(-1) \pm \sqrt{(-1)^2 -4(1)(2N\ln 0.5}}{2}$
$\displaystyle k > \frac{1 \pm \sqrt{1-8N\ln 0.5}}{2}$

When $N=2^{128}$, we have

$\displaystyle k > \frac{1 \pm 4.343876e+19}{2} \approx 10^{19}$

This is about 10 quintillion. What this means is that when $k > 10^{19}$, there is already a 50% chance that 2 passwords hash to the same value. In fact, the md5 was already cracked in 2004.

There are only 365 days in a year (excluding leap year). Given that there are about 7.4 billion people on earth, this means that there are approximately 20 million people with the same birthday on any given day. You just divide 7,400,000,000 by 365 and you get 20 million. Happy Birthday to all 20 million people celebrating their birthday today!

Suppose you’re in a crowd, on a bus, in a restaurant, or stadium. There is a big chance you might be standing next to a person with the same birthday as you.

In fact, you only need about 23 people to have a 50/50 chance of two people having the same birthday! This may sound unbelievable since there are 365 days in a year but you only need 23 people to have a 50% chance of 2 people with the same birthday. How come?

This is called the Birthday Paradox and is very important in digital security, especially the password security.

Basic Counting

Probability is all about counting the possibilities. Let’s make it simple by using a dice as an example. We all know what a dice looks like.

When a balanced dice is thrown, it can land showing any one of its six sides. We refer to the result of throwing a dice as an outcome and we say that a dice has 6 possible outcomes. If a dice is balanced, every side is equally likely to show up. We define the probability of a face showing up as the number of times that face occurs in the possible outcomes divided by the total number of possible outcomes. For example, out of the 6 possible outcomes, the number “1” occurs only once. Since there are 6 possible outcomes, the probability of getting a 1 is, therefore:

$\displaystyle \text{Probability of getting a "1"} = 1/6$

Let’s add a second dice. To identify our two dice, let’s call one of them Dice A and the other Dice B. Let’s throw the dice together. When they land, dice A and dice B will show numbers. For this scenario, an outcome is now defined as the numbers that Dice A and Dice B show when they land. A possible outcome is Dice A shows a 1 and Dice B shows a 2. We can give this outcome a name and call it 1,2. We should remind ourselves that the first number is the result of Dice A and the second number is the result of Dice B. We can also refer to each outcome as a combination.

Here are the possible outcomes that the two dice will show:

If you count the number of combinations above, you’ll get 36. The reason it’s 36 is because dice A has 6 different outcomes and dice B has 6 different outcomes. Multiplying them together gives $6 \times 6=6^2 = 36$.

If you add a third dice, say dice C, the total number of combinations becomes:

$\displaystyle 6^3 = 216$.

In general, for N dice, the total number of combinations is

$\displaystyle 6^N$

How many combinations have at least 2 same numbers?

Since there are only 2 numbers for each combination, this question is also the same as “How many combinations show the same numbers?”. If you look at the diagonal, these are the combinations that have the same number for Dice A and Dice B.

If you count them, you’ll get 6. Therefore, the probability of getting at least two equal numbers (in our 2-Dice system) is

6/36

How many combinations show different numbers?

If you count all combinations outside the diagonal, you’ll get 30. Therefore, the probability of getting two different numbers is

30/36

Notice that the probability of getting at least 2 same numbers PLUS the probability of getting different numbers is equal to 1:

6/36 + 30/36 = 36/36 = 1

Knowing One gives you the other

If we know the probability of getting different numbers (30/36), then we can compute the probability of getting at least 2 same numbers simply by subtracting it from 1:

$\displaystyle \text{probability of getting at least 2 numbers same} = 1-30/36 = 1/6 = 0.167$

Avoid counting manually

When we counted the number of combinations which show different numbers, we counted it with our fingers. There is another way to count which is by doing it mentally. Since we are counting the number of ways that the 2-Dice system will show different numbers, we start by getting Dice A and asking how many different ways Dice A can land so that the number it shows is not equal to the number shown by Dice B. Since we have not yet thrown Dice B, then Dice A is allowed to show any number when it lands. This means there are 6 possible ways for Dice A to do this.

Number of ways Dice A can land = 6

Whatever number results in throwing Dice A, we cannot allow Dice B to have that number. This means that Dice B can only choose from 5 other numbers different from the one chosen by Dice A.

Number of ways Dice B can land = 5

If we multiply them, we get the number of combinations that Dice A and Dice B can land with different numbers:

6*5 = 30

This agrees with our manual counting.

At this point, pause and take note that the probability of getting at least 2 numbers the same for a 2-Dice system is 0.167. If we add more dice, this probability will increase. The question then is

How many dice do we need to throw so that the probability of getting 2 dice showing the same number is at least 50%?

Our 2-Dice example above shows that the probability of at least 2 dices showing the same number is 0.167, which is less than 50%. Let’s add a third dice and compute the probability.

How to compute the probability?

Let’s follow the pattern for the 2-Dice system. Since there are now 3 dice, the number of ways to get all numbers different is:

6*5*4

The total number of combinations of a 3-Dice system is

$\displaystyle 6^3$

Therefore, the probability of getting at least 2 dice with the same number is

$\displaystyle 1- \frac{6\times 5\times 4}{6^3} = 0.444$

This is still less than 50%.

Let’s now add a 4th dice and compute the probability using the same pattern:

$\displaystyle 1- \frac{6\times 5\times 4\times 3}{6^4} = 0.722$

This is greater than 50%! So the answer is we need 4 dice thrown so that the probability of getting at least 2 dice with the same number is at least 50%.

The general formula for the probability for a k-Dice system is:

$\displaystyle 1- \frac{ 6\times 5\times \ldots \times (6-k+1)}{6^k}$

How does this relate to the Birthday Problem?

Now that we have the foundations, it’s easy to translate Dice to people and numbers to birthdays. In our dice example, there are 6 different numbers (faces) per dice. Translating this to birthdays, each person can have 365 possible birthdays since there are 365 days in a year (not including leap year).

This is the analogy:

Dice -> 6 possible faces
Person -> 365 possible birthdays

We want to compute how many random persons we need so that the probability of at least two persons having the same birthday is at least 50%. Let k be the number of random persons. Following the same pattern as the Dice example, the formula to compute the probability, given k persons, is:

$\displaystyle \text{Probability of at least 2 persons with the same birthday} = 1-\frac{365 \times 364 \times 363 \times \ldots (365-k+1)}{365^k}$

If we compute starting from k=1 to k=30, we can construct the following table:

   probability
1  0.000000000
2  0.002739726
3  0.008204166
4  0.016355912
5  0.027135574
6  0.040462484
7  0.056235703
8  0.074335292
9  0.094623834
10 0.116948178
11 0.141141378
12 0.167024789
13 0.194410275
14 0.223102512
15 0.252901320
16 0.283604005
17 0.315007665
18 0.346911418
19 0.379118526
20 0.411438384
21 0.443688335
22 0.475695308
23 0.507297234
24 0.538344258
25 0.568699704
26 0.598240820
27 0.626859282
28 0.654461472
29 0.680968537
30 0.706316243


Below is the graph of the same data where we indicate at what number of persons the graph is greater than or equal to 50%. When the number of persons becomes 23, there is already a 50% chance that at least 2 of them have the same birthday!

## Divided We Compute, United We Reduce

Once upon a time, in a far away village lay a dying old man. He called his sons to his deathbed and spoke to them one last time. He said “Sons, see that bundle of sticks? Each one of you try to break it. The one who can break it will inherit all my riches”. Each son, being greedy, wanted all the riches for himself. So each one of them tried to break the bundle of sticks but none of them succeeded. The old man asked his servant to untie the bundle and said to his sons, “Each one of you now get one stick and break it”. Without any effort, each son was able to break the stick. The old man said “You see, when you unite, no task will be difficult. The riches that I told you was a lie. We are broke. When i’m dead, make sure  you unite so that you can survive.”

Fast forward to modern times. You can think of the bundle of sticks to be a complex problem that is itself composed of smaller problems. The sons are the processors of your computer. When each processor was given the task to solve the complex problem, it fails to solve it in a reasonable amount of time. When the complex problem is decomposed into smaller problems and given to each processor, each processor is now able to solve the smaller problems quickly thereby solving the big problem quickly as well.

The process of decomposing a problem into smaller problems and solving them in separate processors is called Parallel Computing. In this article, we will compute how fast a certain algorithm will run when parallelized. The problem we want to investigate is sorting an array of a million ($2^{20}$) integers.

Efficient Sorting

Suppose you have an array $\{ a_1, a_2, a_3, ..., a_n \}$ that you want to sort based on pairwise comparison.  The sorted array is just one of the many permutations of the array $\{a_1, a_2, a_3,\ldots, a_n\}$. In fact, if you have $n$ different objects to sort, then there are exactly $n!$ ways to arrange these objects, and one of them is the sorted state. You can imagine the sorting process as a decision tree. Say, for example we have array A={ a,b,c }. To sort this, we first compare a with b and there are 2 ways this can go. Either $a \le b$ or $a > b$. If $a \le b$, we then compare b and c. This also give either $b \le c$ or $b > c$. As you can see from the diagram below, this is nothing but a decision tree.

Since the height of this binary tree is lg(n!), then we have

$\lg(n!) = \lg\Big[ n \cdot (n - 1) \cdot (n-2) \cdots 1\Big] \le \lg n + \lg (n-1) \cdots \lg1 \le \underbrace{\lg n \cdots \lg n}_\text{ n times}$
$\lg(n!) \le n\cdot \lg n$

There are efficient algorithms that are able to sort of this complexity. For example, the merge sort has this complexity. Therefore, if you have an array of $2^{20}$ elements, then the complexity is

$2^{20} \cdot \lg(2^{20}) = 2^{20} \cdot (20) = 20971520$

that is, it takes about 20 million comparisons to sort an array of 1 million. Could we do any better than this? We can either upgrade the cpu of the machine doing the sorting or use two or more machines to divide the work among those machines. In this article, we are going to investigate the impact of dividing the work into smaller chunks and farming it to other processors.

Divide and Conquer

Assume we have an array $n=2^{20}$ elements that we need to sort and suppose we have two identical processors we can use. Divide the array into 2 equal sized arrays. Give the first array to the first processor and the other half to the second processor. Apply an efficient sorting algorithm to the subarrays to produce a sorted array for each processor. We then combine the result of processor 1 and processor 2 to one big array by merging the two sorted arrays. The diagram below illustrates the process of computation:

This is also known as the MapReduce algorithm. Mapping is the process of assigning subsets of the input data to processors where each processor computes the partial result. Reducing is the process of aggregating the results of each processor to the final solution of the problem.

The process of merging is straightforward. Given two sorted arrays, begin by comparing the first element of each array. The smaller of the two will then occupy the first slot in the big array. The second element of the array from which we took the smallest element will now become the first element of that array. Repeat the process until all elements of both arrays have already occupied slots in the big array. The diagram below illustrates the algorithm of merging.

If you count the total number of comparisons that you need to merge two sorted arrays, you will find that it takes $n-1$ comparisons. Therefore, the complexity of the merging process is $O(n)$.

Since each processor has $n/2$ sized subarrays, the sorting complexity is therefore $n/p \lg (n/p)$. Furthermore, since the merging process takes $O(n)$ comparisons, the total complexity of the parallel sorting process is therefore

$\displaystyle n/p \lg(n/p) + n$

In our example, $C=2^{20}/2 \lg(2^{20}/2) + 2^{20}= 11010048$ comparisons compared to $2^{20} \lg(2^{20}) = 20971520$ when run sequentially. For large values of $n$, $n/p \lg(n/p)$ dominates $n$, therefore the complexity of the parallel algorithm is $O(n/p \lg(n/p))$.

Can we do any better?

For a given value of $n$, what do you think is the value of $p$ that reduces the running time to $O(n)$? If we take $n=2^{20}$ and plot complexity against $p = \{ 2, 4, 8, 16, 32\}$ we get the diagram below.

In this diagram, we also plotted the horizontal line $y=2^{20}$. The intersection of this line with the plot of $\displaystyle f(p) = \frac{n}{p} \lg(\frac{n}{p})$ gives us the value of $p$ such that the total comparisons is already linear, that is,

$\displaystyle f( p ) = n$
$\displaystyle \frac{n}{p} \lg(\frac{n}{p}) = n$

To get the value of $p$ numerically, we have to solve the root of the equation

$\displaystyle g( p ) = \frac{n}{p} \lg(\frac{n}{p}) - n = 0$

Simplifying,

$\displaystyle \frac{1}{p} \lg(\frac{n}{p}) - 1 = 0$
$\displaystyle \lg(\frac{n}{p}) = p$
$\displaystyle \frac{n}{p} = 2^p$
$\displaystyle p2^p - n = 0$

Since this is a non-linear equation, we can solve this using the Newton's method. It is a method to compute the roots by approximation given an initial value of the solution. Starting from a guess solution $p_1$, the root can be approximated using the recursive formula

$\displaystyle p_{n+1} = p_n - \frac{g( p_n)}{g\prime ( p_n)}$

where $g\prime ( p )$ is the first derivative of $g( p )$. Applying the rules of derivatives, we get

$\displaystyle g\prime ( p ) = p\cdot 2^p \ln 2 + 2^p$

Substituting this to the formula for Newton's method, we get

$\displaystyle p_{n+1} = p_n - \frac{p2^p - n}{p2^p \ln 2 - 2^p}$

Below is an R code using newton’s method to compute the root of the equation $g(p)$.

g=function(n,p){
p* 2^p - n
}

gprime=function(n,p){
p*2^p *log(2) - 2^p
}

newton=function(p,n,iter){
tmp = p
for(i in 1:iter){
p=p-g(n,p)/gprime(n,p)

if(abs(p-tmp)< 0.0001){
break
}

print(p)
tmp=p
}
print(p)
}



Running this code, we get the value of $p = 16$:

> newton(15,n,100)
[1] 16.80905
[1] 16.08829
[1] 15.98603
[1] 16.00286
[1] 15.99944
[1] 16.00011
[1] 15.99998
[1] 16


Ignoring network latency, by distributing the input evenly into 16 processors, we get a running time of $O(n)$ time complexity for $n=2^{20}$ array of items. Therefore, instead of doing 20 million comparisons, you only need 1 million comparisons to sort 1 million objects.

In this age of multicore processors, parallel computing is fast becoming the norm than the exception. Learning to harness the power of multicores is becoming an extremely handy skill to have.