Now we know that parallel algorithms allow us to make our programs run faster. So how do we implement them?

I have used mpich before, but that was more than a decade ago. Recently, I found myself looking for new ways of doing parallel programming. I discovered a very nice parallel programming language called Chapel. This is what we’ll use to implement parallel algorithms.

## Algorithm 1: Parallel sum of consecutive numbers from 1 to N

To get the sum of numbers from 1 to N, where N is some integer is easy. There is a formula to do that:

However for the sake of illustration, we are going to compute the sum of 1 to N using a cluster of machines. Here is the Chapel code to accomplish it inside the file `add_parallel.chpl`

.

config var N:int = 1000; var D:domain(1) = {0..numLocales -1}; var s:[D] int; var sum:int= 0; var bs = N/numLocales; coforall loc in Locales { on loc { var i = loc.id; var start = i*bs + 1; var end = start + bs -1; var _sum:int = 0; for j in start .. end { _sum += j; } writeln("i= " + i + ", start= "+ start + ", end=" + end + ", sum = " + _sum); s[i] = _sum; } } sum = + reduce s; writeln("sum: " + sum);

This program is compiled using the command:

chpl add_parallel.chpl -o add_parallel

where `add_parallel.chpl`

is the filename of the program and `-o add_parallel`

specifies the filename of the binary produced after compilation.

One line 1, we have defined the default value of N to be 1000. This can be overridden on the command line by specifying the `--x`

parameter. The number of machines we are going to use is also specified on the command line using the `-nl`

parameter. A sample invocation of this program is the following:

./add_parallel -nl 3 --N=120

The above command means that we want to run the `add_parallel`

binary using 3 machines with the value of N=120.

Executing the above command will give us:

./add_parallel -nl 3 --N=120 i= 0, start= 1, end=40, sum = 820 i= 1, start= 41, end=80, sum = 2420 i= 2, start= 81, end=120, sum = 4020 sum: 7260

## How the program works

The program will partition N into 3 blocks. This is specified on line 5 where we divided N by `numLocales`

to get the block size. The `numLocales`

will contains the value of the parameter `-nl`

which in this example is 3.

The code on line 6 tells chapel to execute a parallel for-loop executing the code inside `on loc`

block on each Locale. A Locale has an id starting from 0. The Locale will determine it’s id and compute the `start`

ing and `end`

ing number to sum and store this value in the variable `_sum`

.

coforall loc in Locales { on loc { var i = loc.id; var start = i*bs + 1; var end = start + bs -1; var _sum:int = 0; for j in start .. end { _sum += j; } writeln("i= " + i + ", start= "+ start + ", end=" + end + ", sum = " + _sum); s[i] = _sum; } }

This `_sum`

is stored in the array `s`

. We take the sum of entries of the `s`

array using the `reduce`

keyword specifying `+`

as the reduction operator. Finally we print the total sum across the machines.

sum = + reduce s; writeln("sum: " + sum);

## Conclusion

We have seen that that we can implement parallel programs using Chapel programming language. In part 2, we will show how to do Parallel Matrix Multiplication using Chapel.