# A Very Brief Introduction to Parallel Computing

Imagine you were given a hundred 3-digit numbers to add, how much time would it take you to get the answer? If it would take you 30 seconds to add ten numbers (using a calculator), then it would take you about 300 seconds (or 5 minutes) to add 100 numbers.

Now imagine there are a hundred people and each person has a number. How would a hundred people compute the sum of all numbers? Seems like a recipe for disaster. However, we can do this:

1. Group the people by two, if there is an extra person with no group, this person can join the nearest group (to make a group of 3 people).
2. Each group will add the numbers that they have to get the sum S.
3. Each group will nominate a representative that will carry this new number S. The remaining members can sit down.
4. Repeat step 1 until there is only one person remaining.
5. The last person remaining will have the sum of the 100 numbers.

At the beginning, we have 100 people in groups of two. It takes 3 seconds for them to add their respective numbers. In the next iteration, only 50 people remain that will then add their numbers. Continuing in this way, the remaining number of people will be halved until in the 7th iteration we get our answers. So if each iteration can execute in 3 seconds, then it will take 21 seconds for a hundred people to compute the sum of 100 numbers!

I mentioned that in the 7th iteration, we are able to get the answer. There is a formula to get the number of iterations and it is

$\displaystyle \lceil\log_2(n)\rceil$

where n is the number of people to start with and the symbol $\lceil\rceil$ is the ceiling function which rounds off the result of the function $\log_2(n)$ to the next highest integer. If $n=100$, the number of iterations is

$\displaystyle \lceil\log_2 100\rceil = 7$

Having a hundred people to compute the sum of 100 numbers might be practically infeasible. We might not have the space to accommodate them. However, if we only have say 10 people, we can still have a faster computation.

## Map Reduce

Given a hundred numbers, we can group the numbers by 10 and distribute it to 10 people. Each person will then add the numbers they have and combine the sum with the rest of the participants to get the sum of 100 numbers in 1/10th of the time.

Grouping the numbers into smaller subsets and distributing them to each person is called mapping. Combining the sum computed by each person to the total sum is called reduction. This is called map-reduce and the example given is a simple example of map reduce.

## A More Complex Example

Let $\mathbf A$ and $\mathbf B$ be two matrices. The product of matrices $\mathbf A$ and $\mathbf B$ is the matrix $\mathbf C$

$\displaystyle \begin{bmatrix} c_{00} & c_{01} & \cdots & c_{0n}\\ c_{10} & c_{11} & \cdots & c_{1n}\\ \cdots & \cdots & \cdots & \cdots\\ c_{n0} & \cdots & \cdots & c_{nn} \end{bmatrix} = \begin{bmatrix} a_{00} & a_{01} & \cdots & a_{0n}\\ a_{10} & a_{11} & \cdots & a_{1n}\\ \cdots & \cdots & \cdots & \cdots\\ a_{n0} & \cdots & \cdots & a_{nn} \end{bmatrix} \begin{bmatrix} b_{00} & b_{01} & \cdots & b_{0n}\\ b_{10} & b_{11} & \cdots & b_{1n}\\ \cdots & \cdots & \cdots & \cdots\\ b_{n0} & \cdots & \cdots & b_{nn} \end{bmatrix}$

such that

$\displaystyle c_{ij} = \sum_{k=0}^{n-1} a_{ik}b_{kj}$

where $c_{ij}$ is the entry of the matrix C on row i and column j.

Here is a sequential algorithm to compute the matrix C:

//a and b are nxn matrices
//c[i][j] is initialized to 0 for all i,j
for(var i=0;i<n;i++){
for(var j=0;j<n;j++){
for(var k=0;k<n;k++){
c[i][j] += a[i][k] * b[k][j]
}
}
}


There are 3 loops in the above algorithm, the innermost loop will execute n times to compute the sum of element-wise product of row i and column j. In the diagram below, the inner loop will do the following: multiply each element inside the of the box of matrix A and the elements inside the box of matrix B and take the sum. Since there are n such products, the number of addition operations is n.

Then you have to do this n times for each column of matrix B and n times for each row of matrix A, as shown below, for a total of $n^3$ operations.

So if you to multiply a matrix with n=100, you will need to execute $100^3$ (or 1 million) operations!

Parallel Computing can help us here.

## Parallel Matrix Multiplication

The good thing about matrix multiplication is that we can multiply by blocks. For the sake of simplicity, suppose we have 2 square matrices of dimension 100. We can divide the matrices into small square sub-matrices of dimension 25:

$A= \left[\begin{array}{c|c|c|c} \mathbf A_{00} & \mathbf A_{01} & \mathbf A_{02} & \mathbf A_{03}\\ \hline \mathbf A_{10} & \mathbf A_{11} & \mathbf A_{12} & \mathbf A_{13}\\ \hline \mathbf A_{20} & \mathbf A_{21} & \mathbf A_{22} & \mathbf A_{23}\\ \hline \mathbf A_{30} & \mathbf A_{31} & \mathbf A_{32} & \mathbf A_{33} \end{array} \right] , B= \left[\begin{array}{c|c|c|c} \mathbf B_{00} & \mathbf B_{01} & \mathbf B_{02} & \mathbf B_{03}\\ \hline \mathbf B_{10} & \mathbf B_{11} & \mathbf B_{12} & \mathbf B_{13}\\ \hline \mathbf B_{20} & \mathbf B_{21} & \mathbf B_{22} & \mathbf B_{23}\\ \hline \mathbf B_{30} & \mathbf B_{31} & \mathbf B_{32} & \mathbf B_{33} \end{array} \right]$

where each $\mathbf A_{ij}$ and $\mathbf B_{ij}$ are square matrices of dimension 25.

We can then compute the resulting sub-matrix of C using the usual formula:

$\displaystyle \mathbf C_{ij} = \sum_{k=0}^{p-1} \mathbf A_{ik}\mathbf B_{jk}$

where $p=100/25=4$.

For example, to compute $\mathbf C_{00}$ we have

$\mathbf C_{00} = \mathbf A_{00} \mathbf B_{00} + \mathbf A_{01} \mathbf B_{10} + \mathbf A_{02} \mathbf B_{20} + \mathbf A_{03} \mathbf B_{30}$

We then use this mechanism to distribute our sub-matrices to different computers (or in modern parlance “compute nodes”). For this example, we need 4×4=16 compute nodes. Each compute node will contain 2 sub-matrices, one for A and one for B. For easy visualization, we can make a drawing of our compute nodes arranged in a square of side 4 and labelled as shown below:

Now that we have evenly distributed our matrix data to each node, the next question is how do we compute? Notice that not one node contains all the data. Each node has a very limited subset of the data.

### The Trick

We can get a clue by looking at the end result of the computation for Node 00. At the end of the computation, Node 00 should have the following result:

$\mathbf C_{00} = \mathbf A_{00} \mathbf B_{00} + \mathbf A_{01} \mathbf B_{10} + \mathbf A_{02} \mathbf B_{20} + \mathbf A_{03} \mathbf B_{30}$

Looking at the above formula, Node 00 only has the following data

$\mathbf A_{00} \text{ and } \mathbf B_{00}$

The rest of the sub-matrices are not in its memory. Therefore, Node 00 can only compute

$\mathbf A_{00} \times \mathbf B_{00}$

We are missing the following products:

$\mathbf A_{01}\mathbf B_{10}, \mathbf A_{02}\mathbf B_{20}, \text{ and } \mathbf A_{03}\mathbf B_{30}$

The matrix $\mathbf A_{01}$ is with Node 01 and the matrix $\mathbf B_{10}$ is with Node 10. So if Node 01 can send matrix $\mathbf A_{01}$ to Node 00 and Node 10 can send matrix $\mathbf B_{10}$ to Node 00, we can then get the product $\mathbf A_{01}\mathbf B_{10}$. In fact, if we do a slight rearrangement like the below, we can use the following algorithm to compute matrix $\mathbf C$:

1. Each node will send the current $A$ sub-matrix to the node on the left and receive a new $A$ sub-matrix from the node on the right. If there is no node on the left, it will send the sub-matrix to the last node on its row.
2. Each node will send the $B$ sub-matrix to the node on top and receive a new $B$ sub-matrix from the node below it. If there is no node on top, it will send the sub-matrix to the last node on its column.
3. Multiply the new sub-matrices and add the result to the current value of the sub-matrix of C that it keeps in memory.
4. Repeat until the number of iterations equal to N, where $N^2$ is the number of nodes.

The figure below describes this algorithm for Node 00.

Doing this for all nodes, we can visualize the parallel algorithm at work on all nodes as shown in the animation below:

This algorithm is one of my favorite algorithms since it looks like the pumping of blood in to the heart.

### How fast is this algorithm?

Given a 2 square matrices of dimension N, the number of sequential computations is $O(N^3)$. Using parallel matrix multiplication above, we divide the matrices into sub-matrices of dimension $n$. The resulting block matrix is of dimension $N/m=p$. Each node will now compute the product in parallel using $O(n^3)$ operations per sub-matrix multiplication. Since there are $p$ multiplications per node, the number of operations per node is $O(pn^3)$ The ratio of these two quantities will give us how fast the parallel algorithm is

$\displaystyle \frac{N^3}{pn^3} = \frac{1}{p}\frac{N^3}{n^3} = \frac{1}{p}\cdot p^3 = p^2$

For our particular example, the theoretical speedup is $p^2 = 4^2 = 16$, that is, our parallel algorithm can compute the product 16 times faster than the sequential algorithm. If we increase p, we can increase the speedup. However, we can only increase p up to a certain point as the network will then become the bottleneck and will make things slower than the sequential algorithm.

In the next 3 posts, we will see how to program simple parallel algorithms and parallel matrix multiplication. We will also show how you can use OpenShift to prototype your parallel programs.