In the previous post, we implemented a very simple parallel program to add a set of numbers. In this post, we will implement parallel matrix multiplication.

We have shown a parallel algorithm to multiply 2 big matrices using message passing. The algorithm involved block sub-matrices to be passed from node to node and multiplied within a node until the answer is found.

There will be 2 square matrices A and B. In our example, the dimension of both A and B is 4×4. We will distribute the matrices evenly to 4 nodes.

In this example, node 00 will have the following matrices:

Let’s simulate each node loading entries of sub-matrices assigned to it.

const n = 4; var vec = 1..n; var blockSize = 2; var A: [vec, vec] real; var B: [vec, vec] real; var C: [vec, vec] real; coforall loc in Locales { on loc { var i = loc.id/2; var j = loc.id%2; var istart = i*blockSize; var iend = istart + blockSize; var jstart = j*blockSize; var jend = jstart + blockSize; for (r,s) in {istart + 1..iend, jstart + 1..jend} { B(r,s) = r+s; A(r,s) = 2*r + s; } } }

## Global Address Space

Each node has limited memory physically exclusive to itself. In order for node A to have access to the contents of the memory of another node B, node B should pass the data to node A. Fortunately, Chapel can use a library called GASNet that allows each node to have a global view of the memory of all nodes participating in the computation.

In the code above, each node loads its own data. However, the GASNet library allows each node to access the matrix elements loaded by the other nodes. Consequently, we are able to reference the sub-matrix held by each node without doing fancy message passing. The algorithm is then a straightforward implementation of

where , and are submatrices of , , and , respectively.

Below is the straightforward implementation of parallel block multiplication:

coforall loc in Locales { on loc { var i = loc.id/2; var j= loc.id%2; var istart = i*blockSize; var iend = istart + blockSize; var jstart = j*blockSize; var jend = jstart + blockSize; var r = { istart + 1..iend, jstart + 1..jend }; ref W = C[r].reindex( { 1..2,1..2 }); coforall k in 0..1 { var U=get_block_matrix(A[vec,vec],i,k,blockSize); var V=get_block_matrix(B[vec,vec],k,j,blockSize); var P = mat_mul(U,V); coforall (s,t) in { 1..2,1..2 } { W(s,t) += P(s,t); } } } }

The procedure `get_block_matrix`

will return the sub-matrix given the (i,j)th index and the block size.

proc get_block_matrix(A: [?D], i:int, j:int , blockSize:int) { var r = { i*blockSize+1 .. i*blockSize + blockSize, j*blockSize + 1 .. j*blockSize + blockSize }; return A[r]; }

The procedure `mat_mul`

will return the matrix product of two sub-matrices:

proc mat_mul(A: [?D1], B: [?D2]) { var D3 = { 1..2, 1..2 }; var C: [D3] real; var AA = A.reindex({1..2,1..2}); var BB = B.reindex({1..2,1..2}); for row in 1..2 { for col in 1..2 { var sum:real = 0; for k in 1..2 { sum += AA(row,k) * BB(k,col); } C(row,col) = sum; } } return C; } writeln(C[vec,vec]);

To run this code, we need to set the following environment variables:

source $CHPL_HOME/util/setchplenv.bash export CHPL_COMM=gasnet export CHPL_LAUNCHER=amudprun export GASNET_SSH_SERVERS="127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1"

Compiling and running this program gives the output:

#Compile chpl mat_mul.chpl -o mat_mul # Run using 4 nodes ./mat_mul -nl 4 A= 3.0 4.0 5.0 6.0 5.0 6.0 7.0 8.0 7.0 8.0 9.0 10.0 9.0 10.0 11.0 12.0 B= 2.0 3.0 4.0 5.0 3.0 4.0 5.0 6.0 4.0 5.0 6.0 7.0 5.0 6.0 7.0 8.0 C= 68.0 86.0 104.0 122.0 96.0 122.0 148.0 174.0 124.0 158.0 192.0 226.0 152.0 194.0 236.0 278.0