Divided We Compute, United We Reduce

Once upon a time, in a far away village lay a dying old man. He called his sons to his deathbed and spoke to them one last time. He said “Sons, see that bundle of sticks? Each one of you try to break it. The one who can break it will inherit all my riches”. Each son, being greedy, wanted all the riches for himself. So each one of them tried to break the bundle of sticks but none of them succeeded. The old man asked his servant to untie the bundle and said to his sons, “Each one of you now get one stick and break it”. Without any effort, each son was able to break the stick. The old man said “You see, when you unite, no task will be difficult. The riches that I told you was a lie. We are broke. When i’m dead, make sure  you unite so that you can survive.”

Fast forward to modern times. You can think of the bundle of sticks to be a complex problem that is itself composed of smaller problems. The sons are the processors of your computer. When each processor was given the task to solve the complex problem, it fails to solve it in a reasonable amount of time. When the complex problem is decomposed into smaller problems and given to each processor, each processor is now able to solve the smaller problems quickly thereby solving the big problem quickly as well.

The process of decomposing a problem into smaller problems and solving them in separate processors is called Parallel Computing. In this article, we will compute how fast a certain algorithm will run when parallelized. The problem we want to investigate is sorting an array of a million ($2^{20}$) integers.

Efficient Sorting

Suppose you have an array $\{ a_1, a_2, a_3, ..., a_n \}$ that you want to sort based on pairwise comparison.  The sorted array is just one of the many permutations of the array $\{a_1, a_2, a_3,\ldots, a_n\}$. In fact, if you have $n$ different objects to sort, then there are exactly $n!$ ways to arrange these objects, and one of them is the sorted state. You can imagine the sorting process as a decision tree. Say, for example we have array A={ a,b,c }. To sort this, we first compare a with b and there are 2 ways this can go. Either $a \le b$ or $a > b$. If $a \le b$, we then compare b and c. This also give either $b \le c$ or $b > c$. As you can see from the diagram below, this is nothing but a decision tree.

Since the height of this binary tree is lg(n!), then we have

$\lg(n!) = \lg\Big[ n \cdot (n - 1) \cdot (n-2) \cdots 1\Big] \le \lg n + \lg (n-1) \cdots \lg1 \le \underbrace{\lg n \cdots \lg n}_\text{ n times}$
$\lg(n!) \le n\cdot \lg n$

There are efficient algorithms that are able to sort of this complexity. For example, the merge sort has this complexity. Therefore, if you have an array of $2^{20}$ elements, then the complexity is

$2^{20} \cdot \lg(2^{20}) = 2^{20} \cdot (20) = 20971520$

that is, it takes about 20 million comparisons to sort an array of 1 million. Could we do any better than this? We can either upgrade the cpu of the machine doing the sorting or use two or more machines to divide the work among those machines. In this article, we are going to investigate the impact of dividing the work into smaller chunks and farming it to other processors.

Divide and Conquer

Assume we have an array $n=2^{20}$ elements that we need to sort and suppose we have two identical processors we can use. Divide the array into 2 equal sized arrays. Give the first array to the first processor and the other half to the second processor. Apply an efficient sorting algorithm to the subarrays to produce a sorted array for each processor. We then combine the result of processor 1 and processor 2 to one big array by merging the two sorted arrays. The diagram below illustrates the process of computation:

This is also known as the MapReduce algorithm. Mapping is the process of assigning subsets of the input data to processors where each processor computes the partial result. Reducing is the process of aggregating the results of each processor to the final solution of the problem.

The process of merging is straightforward. Given two sorted arrays, begin by comparing the first element of each array. The smaller of the two will then occupy the first slot in the big array. The second element of the array from which we took the smallest element will now become the first element of that array. Repeat the process until all elements of both arrays have already occupied slots in the big array. The diagram below illustrates the algorithm of merging.

If you count the total number of comparisons that you need to merge two sorted arrays, you will find that it takes $n-1$ comparisons. Therefore, the complexity of the merging process is $O(n)$.

Since each processor has $n/2$ sized subarrays, the sorting complexity is therefore $n/p \lg (n/p)$. Furthermore, since the merging process takes $O(n)$ comparisons, the total complexity of the parallel sorting process is therefore

$\displaystyle n/p \lg(n/p) + n$

In our example, $C=2^{20}/2 \lg(2^{20}/2) + 2^{20}= 11010048$ comparisons compared to $2^{20} \lg(2^{20}) = 20971520$ when run sequentially. For large values of $n$, $n/p \lg(n/p)$ dominates $n$, therefore the complexity of the parallel algorithm is $O(n/p \lg(n/p))$.

Can we do any better?

For a given value of $n$, what do you think is the value of $p$ that reduces the running time to $O(n)$? If we take $n=2^{20}$ and plot complexity against $p = \{ 2, 4, 8, 16, 32\}$ we get the diagram below.

In this diagram, we also plotted the horizontal line $y=2^{20}$. The intersection of this line with the plot of $\displaystyle f(p) = \frac{n}{p} \lg(\frac{n}{p})$ gives us the value of $p$ such that the total comparisons is already linear, that is,

$\displaystyle f( p ) = n$
$\displaystyle \frac{n}{p} \lg(\frac{n}{p}) = n$

To get the value of $p$ numerically, we have to solve the root of the equation

$\displaystyle g( p ) = \frac{n}{p} \lg(\frac{n}{p}) - n = 0$

Simplifying,

$\displaystyle \frac{1}{p} \lg(\frac{n}{p}) - 1 = 0$
$\displaystyle \lg(\frac{n}{p}) = p$
$\displaystyle \frac{n}{p} = 2^p$
$\displaystyle p2^p - n = 0$

Since this is a non-linear equation, we can solve this using the Newton's method. It is a method to compute the roots by approximation given an initial value of the solution. Starting from a guess solution $p_1$, the root can be approximated using the recursive formula

$\displaystyle p_{n+1} = p_n - \frac{g( p_n)}{g\prime ( p_n)}$

where $g\prime ( p )$ is the first derivative of $g( p )$. Applying the rules of derivatives, we get

$\displaystyle g\prime ( p ) = p\cdot 2^p \ln 2 + 2^p$

Substituting this to the formula for Newton's method, we get

$\displaystyle p_{n+1} = p_n - \frac{p2^p - n}{p2^p \ln 2 - 2^p}$

Below is an R code using newton’s method to compute the root of the equation $g(p)$.

g=function(n,p){
p* 2^p - n
}

gprime=function(n,p){
p*2^p *log(2) - 2^p
}

newton=function(p,n,iter){
tmp = p
for(i in 1:iter){
p=p-g(n,p)/gprime(n,p)

if(abs(p-tmp)< 0.0001){
break
}

print(p)
tmp=p
}
print(p)
}



Running this code, we get the value of $p = 16$:

> newton(15,n,100)
[1] 16.80905
[1] 16.08829
[1] 15.98603
[1] 16.00286
[1] 15.99944
[1] 16.00011
[1] 15.99998
[1] 16


Ignoring network latency, by distributing the input evenly into 16 processors, we get a running time of $O(n)$ time complexity for $n=2^{20}$ array of items. Therefore, instead of doing 20 million comparisons, you only need 1 million comparisons to sort 1 million objects.

In this age of multicore processors, parallel computing is fast becoming the norm than the exception. Learning to harness the power of multicores is becoming an extremely handy skill to have.

In my younger years as a novice programmer, I can comfortably sit in a corner and program the day away. Life was so focused back then. You arrive at nine and without even noticing it, it’s time to go home with a great sense of achievement, perhaps because you were able to solve a major bug or develop an elegant framework. Alas, those were the days! Multitasking is now the norm and not the exception.

These days, you come at 9 AM doing all sorts of work, handling all sorts of interruption. By six o’clock you feel you haven’t even accomplished anything. What will all those meetings and interruptions every now and then. How do you measure your productivity then, with all these tasks you need to do every day. Some of these tasks are not even done in a day or two as you wait for data or input to proceed with it. Good thing there is already workflow technology to help you and the management keep track of the time you spent on your tasks without even consciously recording the time you started your task and the time you finished with it.

Suppose a certain employee is multitasking between 3 projects A, B, and C. Each project is composed one or more tasks. To represent Task1 of Project A, we symbolize it as A.1. The same goes with other tasks of other projects. When the employee works on a task, the system records the start time and the end time of the task. Most of the time, the employee switches between tasks, especially if one of the tasks is more urgent, and switches back to the previous task. We can represent this scenario using the diagram below:

In this example, you can see that the employee multitasked between tasks 1 and 2 of project A. We observe that the end time of task A.1 is greater than the start time of task A.2, which means that they overlap. The numbers 1, 2, and 1.5 represent the number of hours of the duration of each segment. The question we want to answer is, what is the total time spent by the employee in each task? Since the tasks overlap for a duration of 2 hours, we can assume, for the sake of simplicity , that the employee did half of the work on task A.1 and half of the work on task A.2 (As to the validity of this assumption, that is another story). Therefore, the total time alloted to each task is

$A.1 = 1 + 2/2 = 2$
$A.2 = 2/2 + 1.5 = 2.5$

Well that was easy. Now what happens if the employee’s multitasking graph is like the one below, how will you compute for the duration of each task?

The diagram shows 5 overlaps. The challenge is to create a program that will take as input the start and end times of each task and compute the total time of each task on the assumption that for every time interval that overlaps, the duration assigned to each task is divided equally among the overlapping tasks.

The algorithm

Label the start times and end times of each task like in the figure below.

Sort these times from earliest to latest to get $t_1 < t_5 < t_3 < t_2 < t_6 < t_4$. Let us define the segment between any two consecutive $t_j$ a TimeSegment. Create two structures named Project and TimeSegment. The Project structure will hold the name of the project and the start and end times. The TimeSegment structure will hold the duration two timestamps and the projects that intersect that time segment. We can then compute for the duration of each task using the algorithm below:

times= [ t1, t5, t3, t2, t6, t4 ]
times=sorted(times)
t = times[0]
for tj in times[1:len(times)]:
duration = tj - t
timesegment = TimeSegment(duration)
# iterate over all projects and determine intersection
for p in projectList:
if p.start < tj and p.end >= tj:
t=tj

for ts in segmentList:
for p in ts.projects:
theduration = ts.duration/float(len(ts.projects))
p.duration += theduration

for p in projectList:
print(p.name + " : " + str(p.duration))


I created a python script to implement this algorithm with the following output below:


project P0: start = 0, end = 3
project P1: start = 2, end = 5
project P2: start = 1, end = 4

Duration of Each Project:

P0 : 1.83333333333
P1 : 1.83333333333
P2 : 1.33333333333



In the next article, I’ll change the output of this program to a csv file and create an R script that will automatically plot the output to show clearly the overlaps.