## Implementing Parallel Algorithms Part 2

In the previous post, we implemented a very simple parallel program to add a set of numbers. In this post, we will implement parallel matrix multiplication.

We have shown a parallel algorithm to multiply 2 big matrices using message passing. The algorithm involved block sub-matrices to be passed from node to node and multiplied within a node until the answer is found.

There will be 2 square matrices A and B. In our example, the dimension of both A and B is 4×4. We will distribute the matrices evenly to 4 nodes.

$A = \left[ \begin{array}{cc|cc} a_{00} & a_{01} & a_{02} & a_{03}\\ a_{10} & a_{11} & a_{12} & a_{13}\\ \hline a_{20} & a_{21} & a_{22} & a_{23}\\ a_{30} & a_{31} & a_{32} & a_{33} \end{array} \right], B = \left[ \begin{array}{cc|cc} b_{00} & b_{01} & b_{02} & b_{03}\\ b_{10} & b_{11} & b_{12} & b_{13}\\ \hline b_{20} & b_{21} & b_{22} & b_{23}\\ b_{30} & b_{31} & b_{32} & b_{33} \end{array} \right]$

In this example, node 00 will have the following matrices:

$A_{00}=\begin{bmatrix} a_{00} & a_{01}\\ a_{10} & a_{11} \end{bmatrix}, B_{00}= \begin{bmatrix} b_{00} & b_{01}\\ b_{10} & b_{11} \end{bmatrix}$

const n = 4;
var vec = 1..n;
var blockSize = 2;

var A: [vec, vec] real;
var B: [vec, vec] real;
var C: [vec, vec] real;

coforall loc in Locales {
on loc {
var i = loc.id/2;
var j = loc.id%2;
var istart = i*blockSize;
var iend = istart + blockSize;
var jstart = j*blockSize;
var jend = jstart + blockSize;

for (r,s) in {istart + 1..iend, jstart + 1..jend} {
B(r,s) = r+s;
A(r,s) = 2*r + s;
}
}
}


Each node has limited memory physically exclusive to itself. In order for node A to have access to the contents of the memory of another node B, node B should pass the data to node A. Fortunately, Chapel can use a library called GASNet that allows each node to have a global view of the memory of all nodes participating in the computation.

In the code above, each node loads its own data. However, the GASNet library allows each node to access the matrix elements loaded by the other nodes. Consequently, we are able to reference the sub-matrix held by each node without doing fancy message passing. The algorithm is then a straightforward implementation of

$\displaystyle \mathbf{C}_{ij}=\sum_{k=0}^{2} \mathbf{A}_{ik} \mathbf{B}_{kj}$

where $\mathbf{A_{ij}}$, $\mathbf{B_{ij}}$ and $\mathbf{C_{ij}}$ are submatrices of $\mathbf{A}$, $\mathbf{B}$, and $\mathbf{C}$, respectively.

Below is the straightforward implementation of parallel block multiplication:

coforall loc in Locales {
on loc {
var i = loc.id/2;
var j= loc.id%2;
var istart = i*blockSize;
var iend = istart + blockSize;
var jstart = j*blockSize;
var jend = jstart + blockSize;
var r = { istart + 1..iend, jstart + 1..jend };
ref W = C[r].reindex( { 1..2,1..2 });

coforall k in 0..1 {
var U=get_block_matrix(A[vec,vec],i,k,blockSize);
var V=get_block_matrix(B[vec,vec],k,j,blockSize);
var P = mat_mul(U,V);
coforall (s,t) in { 1..2,1..2 } {
W(s,t) += P(s,t);
}
}
}
}


The procedure get_block_matrix will return the sub-matrix given the (i,j)th index and the block size.

proc get_block_matrix(A: [?D], i:int, j:int , blockSize:int) {
var r = { i*blockSize+1 .. i*blockSize
+  blockSize, j*blockSize
+ 1 .. j*blockSize + blockSize };
return A[r];
}


The procedure mat_mul will return the matrix product of two sub-matrices:

proc mat_mul(A: [?D1], B: [?D2]) {
var D3 = { 1..2, 1..2 };
var C: [D3] real;
var AA = A.reindex({1..2,1..2});
var BB = B.reindex({1..2,1..2});

for row in 1..2 {
for col in 1..2 {
var sum:real = 0;
for k in 1..2 {
sum += AA(row,k) * BB(k,col);
}
C(row,col) = sum;
}
}
return C;
}

writeln(C[vec,vec]);


To run this code, we need to set the following environment variables:

source \$CHPL_HOME/util/setchplenv.bash

export CHPL_COMM=gasnet
export CHPL_LAUNCHER=amudprun

export GASNET_SSH_SERVERS="127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1"


Compiling and running this program gives the output:

#Compile
chpl mat_mul.chpl -o mat_mul

# Run using 4 nodes
./mat_mul -nl 4

A=
3.0 4.0 5.0 6.0
5.0 6.0 7.0 8.0
7.0 8.0 9.0 10.0
9.0 10.0 11.0 12.0
B=
2.0 3.0 4.0 5.0
3.0 4.0 5.0 6.0
4.0 5.0 6.0 7.0
5.0 6.0 7.0 8.0
C=
68.0 86.0 104.0 122.0
96.0 122.0 148.0 174.0
124.0 158.0 192.0 226.0
152.0 194.0 236.0 278.0


## Implementing Parallel Algorithms Part 1

Now we know that parallel algorithms allow us to make our programs run faster. So how do we implement them?

I have used mpich before, but that was more than a decade ago. Recently, I found myself looking for new ways of doing parallel programming. I discovered a very nice parallel programming language called Chapel. This is what we’ll use to implement parallel algorithms.

## Algorithm 1: Parallel sum of consecutive numbers from 1 to N

To get the sum of numbers from 1 to N, where N is some integer is easy. There is a formula to do that:

$\displaystyle \sum_{i=1}^N i = \frac{N(N+1)}{2}$

However for the sake of illustration, we are going to compute the sum of 1 to N using a cluster of machines. Here is the Chapel code to accomplish it inside the file add_parallel.chpl.

config var N:int = 1000;
var D:domain(1) = {0..numLocales -1};
var s:[D] int;
var sum:int= 0;
var bs = N/numLocales;
coforall loc in Locales {
on loc {
var i = loc.id;
var start = i*bs + 1;
var end = start + bs -1;
var _sum:int = 0;
for j in start .. end {
_sum += j;
}
writeln("i= " + i + ", start= "+ start + ", end=" + end + ", sum = " + _sum);
s[i] = _sum;
}
}

sum = + reduce s;
writeln("sum: " + sum);


This program is compiled using the command:

chpl add_parallel.chpl -o add_parallel


where add_parallel.chpl is the filename of the program and -o add_parallel specifies the filename of the binary produced after compilation.

One line 1, we have defined the default value of N to be 1000. This can be overridden on the command line by specifying the --x parameter. The number of machines we are going to use is also specified on the command line using the -nl parameter. A sample invocation of this program is the following:

./add_parallel -nl 3 --N=120


The above command means that we want to run the add_parallel binary using 3 machines with the value of N=120.

Executing the above command will give us:

./add_parallel -nl 3 --N=120
i= 0, start= 1, end=40, sum = 820
i= 1, start= 41, end=80, sum = 2420
i= 2, start= 81, end=120, sum = 4020
sum: 7260


## How the program works

The program will partition N into 3 blocks. This is specified on line 5 where we divided N by numLocales to get the block size. The numLocales will contains the value of the parameter -nl which in this example is 3.

The code on line 6 tells chapel to execute a parallel for-loop executing the code inside on loc block on each Locale. A Locale has an id starting from 0. The Locale will determine it’s id and compute the starting and ending number to sum and store this value in the variable _sum.

coforall loc in Locales {
on loc {
var i = loc.id;
var start = i*bs + 1;
var end = start + bs -1;
var _sum:int = 0;
for j in start .. end {
_sum += j;
}
writeln("i= " + i + ", start= "+ start + ", end=" + end + ", sum = " + _sum);
s[i] = _sum;
}
}


This _sum is stored in the array s. We take the sum of entries of the s array using the reduce keyword specifying + as the reduction operator. Finally we print the total sum across the machines.

sum = + reduce s;
writeln("sum: " + sum);


## Conclusion

We have seen that that we can implement parallel programs using Chapel programming language. In part 2, we will show how to do Parallel Matrix Multiplication using Chapel.

## Divided We Compute, United We Reduce

Once upon a time, in a far away village lay a dying old man. He called his sons to his deathbed and spoke to them one last time. He said “Sons, see that bundle of sticks? Each one of you try to break it. The one who can break it will inherit all my riches”. Each son, being greedy, wanted all the riches for himself. So each one of them tried to break the bundle of sticks but none of them succeeded. The old man asked his servant to untie the bundle and said to his sons, “Each one of you now get one stick and break it”. Without any effort, each son was able to break the stick. The old man said “You see, when you unite, no task will be difficult. The riches that I told you was a lie. We are broke. When i’m dead, make sure  you unite so that you can survive.”

Fast forward to modern times. You can think of the bundle of sticks to be a complex problem that is itself composed of smaller problems. The sons are the processors of your computer. When each processor was given the task to solve the complex problem, it fails to solve it in a reasonable amount of time. When the complex problem is decomposed into smaller problems and given to each processor, each processor is now able to solve the smaller problems quickly thereby solving the big problem quickly as well.

The process of decomposing a problem into smaller problems and solving them in separate processors is called Parallel Computing. In this article, we will compute how fast a certain algorithm will run when parallelized. The problem we want to investigate is sorting an array of a million ($2^{20}$) integers.

Efficient Sorting

Suppose you have an array $\{ a_1, a_2, a_3, ..., a_n \}$ that you want to sort based on pairwise comparison.  The sorted array is just one of the many permutations of the array $\{a_1, a_2, a_3,\ldots, a_n\}$. In fact, if you have $n$ different objects to sort, then there are exactly $n!$ ways to arrange these objects, and one of them is the sorted state. You can imagine the sorting process as a decision tree. Say, for example we have array A={ a,b,c }. To sort this, we first compare a with b and there are 2 ways this can go. Either $a \le b$ or $a > b$. If $a \le b$, we then compare b and c. This also give either $b \le c$ or $b > c$. As you can see from the diagram below, this is nothing but a decision tree.

Since the height of this binary tree is lg(n!), then we have

$\lg(n!) = \lg\Big[ n \cdot (n - 1) \cdot (n-2) \cdots 1\Big] \le \lg n + \lg (n-1) \cdots \lg1 \le \underbrace{\lg n \cdots \lg n}_\text{ n times}$
$\lg(n!) \le n\cdot \lg n$

There are efficient algorithms that are able to sort of this complexity. For example, the merge sort has this complexity. Therefore, if you have an array of $2^{20}$ elements, then the complexity is

$2^{20} \cdot \lg(2^{20}) = 2^{20} \cdot (20) = 20971520$

that is, it takes about 20 million comparisons to sort an array of 1 million. Could we do any better than this? We can either upgrade the cpu of the machine doing the sorting or use two or more machines to divide the work among those machines. In this article, we are going to investigate the impact of dividing the work into smaller chunks and farming it to other processors.

Divide and Conquer

Assume we have an array $n=2^{20}$ elements that we need to sort and suppose we have two identical processors we can use. Divide the array into 2 equal sized arrays. Give the first array to the first processor and the other half to the second processor. Apply an efficient sorting algorithm to the subarrays to produce a sorted array for each processor. We then combine the result of processor 1 and processor 2 to one big array by merging the two sorted arrays. The diagram below illustrates the process of computation:

This is also known as the MapReduce algorithm. Mapping is the process of assigning subsets of the input data to processors where each processor computes the partial result. Reducing is the process of aggregating the results of each processor to the final solution of the problem.

The process of merging is straightforward. Given two sorted arrays, begin by comparing the first element of each array. The smaller of the two will then occupy the first slot in the big array. The second element of the array from which we took the smallest element will now become the first element of that array. Repeat the process until all elements of both arrays have already occupied slots in the big array. The diagram below illustrates the algorithm of merging.

If you count the total number of comparisons that you need to merge two sorted arrays, you will find that it takes $n-1$ comparisons. Therefore, the complexity of the merging process is $O(n)$.

Since each processor has $n/2$ sized subarrays, the sorting complexity is therefore $n/p \lg (n/p)$. Furthermore, since the merging process takes $O(n)$ comparisons, the total complexity of the parallel sorting process is therefore

$\displaystyle n/p \lg(n/p) + n$

In our example, $C=2^{20}/2 \lg(2^{20}/2) + 2^{20}= 11010048$ comparisons compared to $2^{20} \lg(2^{20}) = 20971520$ when run sequentially. For large values of $n$, $n/p \lg(n/p)$ dominates $n$, therefore the complexity of the parallel algorithm is $O(n/p \lg(n/p))$.

Can we do any better?

For a given value of $n$, what do you think is the value of $p$ that reduces the running time to $O(n)$? If we take $n=2^{20}$ and plot complexity against $p = \{ 2, 4, 8, 16, 32\}$ we get the diagram below.

In this diagram, we also plotted the horizontal line $y=2^{20}$. The intersection of this line with the plot of $\displaystyle f(p) = \frac{n}{p} \lg(\frac{n}{p})$ gives us the value of $p$ such that the total comparisons is already linear, that is,

$\displaystyle f( p ) = n$
$\displaystyle \frac{n}{p} \lg(\frac{n}{p}) = n$

To get the value of $p$ numerically, we have to solve the root of the equation

$\displaystyle g( p ) = \frac{n}{p} \lg(\frac{n}{p}) - n = 0$

Simplifying,

$\displaystyle \frac{1}{p} \lg(\frac{n}{p}) - 1 = 0$
$\displaystyle \lg(\frac{n}{p}) = p$
$\displaystyle \frac{n}{p} = 2^p$
$\displaystyle p2^p - n = 0$

Since this is a non-linear equation, we can solve this using the Newton's method. It is a method to compute the roots by approximation given an initial value of the solution. Starting from a guess solution $p_1$, the root can be approximated using the recursive formula

$\displaystyle p_{n+1} = p_n - \frac{g( p_n)}{g\prime ( p_n)}$

where $g\prime ( p )$ is the first derivative of $g( p )$. Applying the rules of derivatives, we get

$\displaystyle g\prime ( p ) = p\cdot 2^p \ln 2 + 2^p$

Substituting this to the formula for Newton's method, we get

$\displaystyle p_{n+1} = p_n - \frac{p2^p - n}{p2^p \ln 2 - 2^p}$

Below is an R code using newton’s method to compute the root of the equation $g(p)$.

g=function(n,p){
p* 2^p - n
}

gprime=function(n,p){
p*2^p *log(2) - 2^p
}

newton=function(p,n,iter){
tmp = p
for(i in 1:iter){
p=p-g(n,p)/gprime(n,p)

if(abs(p-tmp)< 0.0001){
break
}

print(p)
tmp=p
}
print(p)
}



Running this code, we get the value of $p = 16$:

> newton(15,n,100)
[1] 16.80905
[1] 16.08829
[1] 15.98603
[1] 16.00286
[1] 15.99944
[1] 16.00011
[1] 15.99998
[1] 16


Ignoring network latency, by distributing the input evenly into 16 processors, we get a running time of $O(n)$ time complexity for $n=2^{20}$ array of items. Therefore, instead of doing 20 million comparisons, you only need 1 million comparisons to sort 1 million objects.

In this age of multicore processors, parallel computing is fast becoming the norm than the exception. Learning to harness the power of multicores is becoming an extremely handy skill to have.

## Do It Yourself Supercomputing in Linux Part 1

If you recently purchased a laptop or desktop computer, chances are you have a dual core system. There does not seem to be any indication anymore of us going back to single core unless some technological breakthrough will break the power barrier of CPUs. This means that more and more people will have a high powered computer in their homes without any idea how to harness such power.

What does it mean to have a dual processor? On first impulse, you probably will think it will speed up the execution of your programs. You would probably perceive a significant difference between the response time of your programs in a single core versus a dual core. But why do programs run faster in dual processor systems? For one thing, the speed of each processor is faster as compared to older single processors. The other thing is due to symmetric multiprocessing. An analogy of the dual processor system is a bank with 2 tellers and having a single line for customers. When a teller is done with one customer, it will get the next customer from the line. You can read about the performance of multiprocessor systems in this article.

There is another way you can make use of multiprocessors to make your programs run faster. This is through parallel programming. Parallel computing has been around for a long time already but are usually confined to university laboratories or supercomputing facilities. It has not caught the interest of the common people because they have no access to such machines. However, the future is already about multi-processing. More and more of these machines will reach the masses. This means that highly talented members of the masses get to experiment with parallel computing on a day-to-day basis. We are in for another revolution in computing. Are you ready for this revolution?

## Talk on Parallel Computing

Parallel computing used to be the province of scientists and engineers. The tools and algorithms to do this has always been available on the net. However, since the advent of multicore machines, there is now a compelling reason for all programmers to learn and master it. Parallel programming is already the future of computing and there future is now.

I remember way back in 1998, I was introduced to parallel computing in MPI. I also wrote some papers on parallel algorithms and still active in parallel algorithms research. Now I am advocating and teaching a lot of people in parallel computing.

## Celebrating Labor Day

It’s labor day and what a fitting way to celebrate Labor day than to work!

First order of business today was to setup the MPI (Message Passing Interface) and demo programs in my computer. Since MPI is a shared-memory distributed computing platform, I need to set it up also in another laptop. I’m glad my friend Roli Balicas agreed to let me setup his laptop. He spent about 5 hours in our house. It does not really take a long time to setup MPI in two dual-core laptops. We just spent most of the time eating pizza from Pizza Hut and watching Jet Li movies. We were able to watch two full movies while I was also setting up MPI on the two laptops.

Roli’s laptop was running Ubuntu 7.10 and is not yet configured as a developer’s laptop so I had to install xorg developer packages in order to make the MPE of MPI compile with X enabled. After upgrading his laptop, I was able to recompile MPI and compile the Mandelbrot demo. Now the two laptops can run the demo.

After Roli left, I went back to my experiments on Model-Driven Architecture. I’m looking at AndroMDA, an MDA generator and MagicDraw as UML tool. I’m enjoying my explorations into this subject. However, MDA seems to have a bleak future. A lot of heavy-weight personalities don’t see any future on MDA. See for example this. Whatever may be the case, the fact that i’m enjoying it is already enough for me.