Divided We Compute, United We Reduce

Once upon a time, in a far away village lay a dying old man. He called his sons to his deathbed and spoke to them one last time. He said “Sons, see that bundle of sticks? Each one of you try to break it. The one who can break it will inherit all my riches”. Each son, being greedy, wanted all the riches for himself. So each one of them tried to break the bundle of sticks but none of them succeeded. The old man asked his servant to untie the bundle and said to his sons, “Each one of you now get one stick and break it”. Without any effort, each son was able to break the stick. The old man said “You see, when you unite, no task will be difficult. The riches that I told you was a lie. We are broke. When i’m dead, make sure  you unite so that you can survive.”

Fast forward to modern times. You can think of the bundle of sticks to be a complex problem that is itself composed of smaller problems. The sons are the processors of your computer. When each processor was given the task to solve the complex problem, it fails to solve it in a reasonable amount of time. When the complex problem is decomposed into smaller problems and given to each processor, each processor is now able to solve the smaller problems quickly thereby solving the big problem quickly as well.

The process of decomposing a problem into smaller problems and solving them in separate processors is called Parallel Computing. In this article, we will compute how fast a certain algorithm will run when parallelized. The problem we want to investigate is sorting an array of a million (2^{20}) integers.

Efficient Sorting

Suppose you have an array \{ a_1, a_2, a_3, ..., a_n \} that you want to sort based on pairwise comparison.  The sorted array is just one of the many permutations of the array \{a_1, a_2, a_3,\ldots, a_n\}. In fact, if you have n different objects to sort, then there are exactly n! ways to arrange these objects, and one of them is the sorted state. You can imagine the sorting process as a decision tree. Say, for example we have array A={ a,b,c }. To sort this, we first compare a with b and there are 2 ways this can go. Either a \le b or a > b. If a \le b, we then compare b and c. This also give either b \le c or b > c. As you can see from the diagram below, this is nothing but a decision tree.



Since the height of this binary tree is lg(n!), then we have

\lg(n!) = \lg\Big[ n \cdot (n - 1) \cdot (n-2) \cdots 1\Big] \le \lg n + \lg (n-1) \cdots \lg1 \le \underbrace{\lg n \cdots \lg n}_\text{ n times}
\lg(n!) \le n\cdot \lg n

There are efficient algorithms that are able to sort of this complexity. For example, the merge sort has this complexity. Therefore, if you have an array of 2^{20} elements, then the complexity is

2^{20} \cdot \lg(2^{20}) = 2^{20} \cdot (20) = 20971520

that is, it takes about 20 million comparisons to sort an array of 1 million. Could we do any better than this? We can either upgrade the cpu of the machine doing the sorting or use two or more machines to divide the work among those machines. In this article, we are going to investigate the impact of dividing the work into smaller chunks and farming it to other processors.

Divide and Conquer

Assume we have an array n=2^{20} elements that we need to sort and suppose we have two identical processors we can use. Divide the array into 2 equal sized arrays. Give the first array to the first processor and the other half to the second processor. Apply an efficient sorting algorithm to the subarrays to produce a sorted array for each processor. We then combine the result of processor 1 and processor 2 to one big array by merging the two sorted arrays. The diagram below illustrates the process of computation:



This is also known as the MapReduce algorithm. Mapping is the process of assigning subsets of the input data to processors where each processor computes the partial result. Reducing is the process of aggregating the results of each processor to the final solution of the problem.

The process of merging is straightforward. Given two sorted arrays, begin by comparing the first element of each array. The smaller of the two will then occupy the first slot in the big array. The second element of the array from which we took the smallest element will now become the first element of that array. Repeat the process until all elements of both arrays have already occupied slots in the big array. The diagram below illustrates the algorithm of merging.




If you count the total number of comparisons that you need to merge two sorted arrays, you will find that it takes n-1 comparisons. Therefore, the complexity of the merging process is O(n).

Since each processor has n/2 sized subarrays, the sorting complexity is therefore n/p \lg (n/p). Furthermore, since the merging process takes O(n) comparisons, the total complexity of the parallel sorting process is therefore

\displaystyle n/p \lg(n/p) + n

In our example, C=2^{20}/2 \lg(2^{20}/2) + 2^{20}=  11010048 comparisons compared to 2^{20} \lg(2^{20}) = 20971520 when run sequentially. For large values of n, n/p \lg(n/p) dominates n, therefore the complexity of the parallel algorithm is O(n/p \lg(n/p)).

Can we do any better?

For a given value of n, what do you think is the value of p that reduces the running time to O(n)? If we take n=2^{20} and plot complexity against p = \{ 2, 4, 8, 16, 32\} we get the diagram below.



In this diagram, we also plotted the horizontal line y=2^{20}. The intersection of this line with the plot of \displaystyle f(p) = \frac{n}{p} \lg(\frac{n}{p}) gives us the value of p such that the total comparisons is already linear, that is,

\displaystyle f( p ) = n
\displaystyle \frac{n}{p} \lg(\frac{n}{p})  = n

To get the value of p numerically, we have to solve the root of the equation

\displaystyle g( p ) = \frac{n}{p} \lg(\frac{n}{p}) - n = 0

Simplifying,

\displaystyle \frac{1}{p} \lg(\frac{n}{p}) - 1 = 0
\displaystyle \lg(\frac{n}{p}) = p
\displaystyle \frac{n}{p} = 2^p
\displaystyle p2^p - n = 0

Since this is a non-linear equation, we can solve this using the Newton's method. It is a method to compute the roots by approximation given an initial value of the solution. Starting from a guess solution p_1, the root can be approximated using the recursive formula

\displaystyle p_{n+1} = p_n - \frac{g( p_n)}{g\prime ( p_n)}

where g\prime ( p ) is the first derivative of g( p ) . Applying the rules of derivatives, we get

\displaystyle g\prime ( p ) = p\cdot 2^p \ln 2 + 2^p

Substituting this to the formula for Newton's method, we get

\displaystyle p_{n+1} = p_n - \frac{p2^p - n}{p2^p \ln 2 - 2^p}

Below is an R code using newton’s method to compute the root of the equation g(p).

g=function(n,p){
	p* 2^p - n
}

gprime=function(n,p){
	p*2^p *log(2) - 2^p
}

newton=function(p,n,iter){
	tmp = p
	for(i in 1:iter){
		p=p-g(n,p)/gprime(n,p)
		
		if(abs(p-tmp)< 0.0001){
			break		
		}

		print(p)
		tmp=p
	}
	print(p)
}

Running this code, we get the value of p = 16:

> newton(15,n,100)
[1] 16.80905
[1] 16.08829
[1] 15.98603
[1] 16.00286
[1] 15.99944
[1] 16.00011
[1] 15.99998
[1] 16

Ignoring network latency, by distributing the input evenly into 16 processors, we get a running time of O(n) time complexity for n=2^{20} array of items. Therefore, instead of doing 20 million comparisons, you only need 1 million comparisons to sort 1 million objects.

In this age of multicore processors, parallel computing is fast becoming the norm than the exception. Learning to harness the power of multicores is becoming an extremely handy skill to have.

Do It Yourself Supercomputing in Linux Part 1

If you recently purchased a laptop or desktop computer, chances are you have a dual core system. There does not seem to be any indication anymore of us going back to single core unless some technological breakthrough will break the power barrier of CPUs. This means that more and more people will have a high powered computer in their homes without any idea how to harness such power.

What does it mean to have a dual processor? On first impulse, you probably will think it will speed up the execution of your programs. You would probably perceive a significant difference between the response time of your programs in a single core versus a dual core. But why do programs run faster in dual processor systems? For one thing, the speed of each processor is faster as compared to older single processors. The other thing is due to symmetric multiprocessing. An analogy of the dual processor system is a bank with 2 tellers and having a single line for customers. When a teller is done with one customer, it will get the next customer from the line. You can read about the performance of multiprocessor systems in this article.

There is another way you can make use of multiprocessors to make your programs run faster. This is through parallel programming. Parallel computing has been around for a long time already but are usually confined to university laboratories or supercomputing facilities. It has not caught the interest of the common people because they have no access to such machines. However, the future is already about multi-processing. More and more of these machines will reach the masses. This means that highly talented members of the masses get to experiment with parallel computing on a day-to-day basis. We are in for another revolution in computing. Are you ready for this revolution?

Continue reading Do It Yourself Supercomputing in Linux Part 1

Talk on Parallel Computing

Parallel computing used to be the province of scientists and engineers. The tools and algorithms to do this has always been available on the net. However, since the advent of multicore machines, there is now a compelling reason for all programmers to learn and master it. Parallel programming is already the future of computing and there future is now.

I remember way back in 1998, I was introduced to parallel computing in MPI. I also wrote some papers on parallel algorithms and still active in parallel algorithms research. Now I am advocating and teaching a lot of people in parallel computing.

dsc_4057.JPG

Celebrating Labor Day

It’s labor day and what a fitting way to celebrate Labor day than to work!

First order of business today was to setup the MPI (Message Passing Interface) and demo programs in my computer. Since MPI is a shared-memory distributed computing platform, I need to set it up also in another laptop. I’m glad my friend Roli Balicas agreed to let me setup his laptop. He spent about 5 hours in our house. It does not really take a long time to setup MPI in two dual-core laptops. We just spent most of the time eating pizza from Pizza Hut and watching Jet Li movies. We were able to watch two full movies while I was also setting up MPI on the two laptops.

Roli’s laptop was running Ubuntu 7.10 and is not yet configured as a developer’s laptop so I had to install xorg developer packages in order to make the MPE of MPI compile with X enabled. After upgrading his laptop, I was able to recompile MPI and compile the Mandelbrot demo. Now the two laptops can run the demo.

After Roli left, I went back to my experiments on Model-Driven Architecture. I’m looking at AndroMDA, an MDA generator and MagicDraw as UML tool. I’m enjoying my explorations into this subject. However, MDA seems to have a bleak future. A lot of heavy-weight personalities don’t see any future on MDA. See for example this. Whatever may be the case, the fact that i’m enjoying it is already enough for me.