# Divided We Compute, United We Reduce

Once upon a time, in a far away village lay a dying old man. He called his sons to his deathbed and spoke to them one last time. He said “Sons, see that bundle of sticks? Each one of you try to break it. The one who can break it will inherit all my riches”. Each son, being greedy, wanted all the riches for himself. So each one of them tried to break the bundle of sticks but none of them succeeded. The old man asked his servant to untie the bundle and said to his sons, “Each one of you now get one stick and break it”. Without any effort, each son was able to break the stick. The old man said “You see, when you unite, no task will be difficult. The riches that I told you was a lie. We are broke. When i’m dead, make sure  you unite so that you can survive.”

Fast forward to modern times. You can think of the bundle of sticks to be a complex problem that is itself composed of smaller problems. The sons are the processors of your computer. When each processor was given the task to solve the complex problem, it fails to solve it in a reasonable amount of time. When the complex problem is decomposed into smaller problems and given to each processor, each processor is now able to solve the smaller problems quickly thereby solving the big problem quickly as well.

The process of decomposing a problem into smaller problems and solving them in separate processors is called Parallel Computing. In this article, we will compute how fast a certain algorithm will run when parallelized. The problem we want to investigate is sorting an array of a million ($2^{20}$) integers.

Efficient Sorting

Suppose you have an array $\{ a_1, a_2, a_3, ..., a_n \}$ that you want to sort based on pairwise comparison.  The sorted array is just one of the many permutations of the array $\{a_1, a_2, a_3,\ldots, a_n\}$. In fact, if you have $n$ different objects to sort, then there are exactly $n!$ ways to arrange these objects, and one of them is the sorted state. You can imagine the sorting process as a decision tree. Say, for example we have array A={ a,b,c }. To sort this, we first compare a with b and there are 2 ways this can go. Either $a \le b$ or $a > b$. If $a \le b$, we then compare b and c. This also give either $b \le c$ or $b > c$. As you can see from the diagram below, this is nothing but a decision tree.

Since the height of this binary tree is lg(n!), then we have

$\lg(n!) = \lg\Big[ n \cdot (n - 1) \cdot (n-2) \cdots 1\Big] \le \lg n + \lg (n-1) \cdots \lg1 \le \underbrace{\lg n \cdots \lg n}_\text{ n times}$
$\lg(n!) \le n\cdot \lg n$

There are efficient algorithms that are able to sort of this complexity. For example, the merge sort has this complexity. Therefore, if you have an array of $2^{20}$ elements, then the complexity is

$2^{20} \cdot \lg(2^{20}) = 2^{20} \cdot (20) = 20971520$

that is, it takes about 20 million comparisons to sort an array of 1 million. Could we do any better than this? We can either upgrade the cpu of the machine doing the sorting or use two or more machines to divide the work among those machines. In this article, we are going to investigate the impact of dividing the work into smaller chunks and farming it to other processors.

Divide and Conquer

Assume we have an array $n=2^{20}$ elements that we need to sort and suppose we have two identical processors we can use. Divide the array into 2 equal sized arrays. Give the first array to the first processor and the other half to the second processor. Apply an efficient sorting algorithm to the subarrays to produce a sorted array for each processor. We then combine the result of processor 1 and processor 2 to one big array by merging the two sorted arrays. The diagram below illustrates the process of computation:

This is also known as the MapReduce algorithm. Mapping is the process of assigning subsets of the input data to processors where each processor computes the partial result. Reducing is the process of aggregating the results of each processor to the final solution of the problem.

The process of merging is straightforward. Given two sorted arrays, begin by comparing the first element of each array. The smaller of the two will then occupy the first slot in the big array. The second element of the array from which we took the smallest element will now become the first element of that array. Repeat the process until all elements of both arrays have already occupied slots in the big array. The diagram below illustrates the algorithm of merging.

If you count the total number of comparisons that you need to merge two sorted arrays, you will find that it takes $n-1$ comparisons. Therefore, the complexity of the merging process is $O(n)$.

Since each processor has $n/2$ sized subarrays, the sorting complexity is therefore $n/p \lg (n/p)$. Furthermore, since the merging process takes $O(n)$ comparisons, the total complexity of the parallel sorting process is therefore

$\displaystyle n/p \lg(n/p) + n$

In our example, $C=2^{20}/2 \lg(2^{20}/2) + 2^{20}= 11010048$ comparisons compared to $2^{20} \lg(2^{20}) = 20971520$ when run sequentially. For large values of $n$, $n/p \lg(n/p)$ dominates $n$, therefore the complexity of the parallel algorithm is $O(n/p \lg(n/p))$.

Can we do any better?

For a given value of $n$, what do you think is the value of $p$ that reduces the running time to $O(n)$? If we take $n=2^{20}$ and plot complexity against $p = \{ 2, 4, 8, 16, 32\}$ we get the diagram below.

In this diagram, we also plotted the horizontal line $y=2^{20}$. The intersection of this line with the plot of $\displaystyle f(p) = \frac{n}{p} \lg(\frac{n}{p})$ gives us the value of $p$ such that the total comparisons is already linear, that is,

$\displaystyle f( p ) = n$
$\displaystyle \frac{n}{p} \lg(\frac{n}{p}) = n$

To get the value of $p$ numerically, we have to solve the root of the equation

$\displaystyle g( p ) = \frac{n}{p} \lg(\frac{n}{p}) - n = 0$

Simplifying,

$\displaystyle \frac{1}{p} \lg(\frac{n}{p}) - 1 = 0$
$\displaystyle \lg(\frac{n}{p}) = p$
$\displaystyle \frac{n}{p} = 2^p$
$\displaystyle p2^p - n = 0$

Since this is a non-linear equation, we can solve this using the Newton's method. It is a method to compute the roots by approximation given an initial value of the solution. Starting from a guess solution $p_1$, the root can be approximated using the recursive formula

$\displaystyle p_{n+1} = p_n - \frac{g( p_n)}{g\prime ( p_n)}$

where $g\prime ( p )$ is the first derivative of $g( p )$. Applying the rules of derivatives, we get

$\displaystyle g\prime ( p ) = p\cdot 2^p \ln 2 + 2^p$

Substituting this to the formula for Newton's method, we get

$\displaystyle p_{n+1} = p_n - \frac{p2^p - n}{p2^p \ln 2 - 2^p}$

Below is an R code using newton’s method to compute the root of the equation $g(p)$.

g=function(n,p){
p* 2^p - n
}

gprime=function(n,p){
p*2^p *log(2) - 2^p
}

newton=function(p,n,iter){
tmp = p
for(i in 1:iter){
p=p-g(n,p)/gprime(n,p)

if(abs(p-tmp)< 0.0001){
break
}

print(p)
tmp=p
}
print(p)
}



Running this code, we get the value of $p = 16$:

> newton(15,n,100)
[1] 16.80905
[1] 16.08829
[1] 15.98603
[1] 16.00286
[1] 15.99944
[1] 16.00011
[1] 15.99998
[1] 16


Ignoring network latency, by distributing the input evenly into 16 processors, we get a running time of $O(n)$ time complexity for $n=2^{20}$ array of items. Therefore, instead of doing 20 million comparisons, you only need 1 million comparisons to sort 1 million objects.

In this age of multicore processors, parallel computing is fast becoming the norm than the exception. Learning to harness the power of multicores is becoming an extremely handy skill to have.