Implementing Parallel Algorithms Part 2

In the previous post, we implemented a very simple parallel program to add a set of numbers. In this post, we will implement parallel matrix multiplication.

We have shown a parallel algorithm to multiply 2 big matrices using message passing. The algorithm involved block sub-matrices to be passed from node to node and multiplied within a node until the answer is found.

There will be 2 square matrices A and B. In our example, the dimension of both A and B is 4×4. We will distribute the matrices evenly to 4 nodes.

A = \left[  \begin{array}{cc|cc}  a_{00} & a_{01} & a_{02} & a_{03}\\   a_{10} & a_{11} & a_{12} & a_{13}\\ \hline  a_{20} & a_{21} & a_{22} & a_{23}\\  a_{30} & a_{31} & a_{32} & a_{33}  \end{array}  \right],    B = \left[  \begin{array}{cc|cc}  b_{00} & b_{01} & b_{02} & b_{03}\\   b_{10} & b_{11} & b_{12} & b_{13}\\ \hline  b_{20} & b_{21} & b_{22} & b_{23}\\  b_{30} & b_{31} & b_{32} & b_{33}  \end{array}  \right]

In this example, node 00 will have the following matrices:

A_{00}=\begin{bmatrix}  a_{00} & a_{01}\\  a_{10} & a_{11}  \end{bmatrix},  B_{00}=  \begin{bmatrix}  b_{00} & b_{01}\\  b_{10} & b_{11}  \end{bmatrix}

Let’s simulate each node loading entries of sub-matrices assigned to it.

const n = 4;
var vec = 1..n;
var blockSize = 2;

var A: [vec, vec] real;
var B: [vec, vec] real;
var C: [vec, vec] real;

coforall loc in Locales {
  on loc {
    var i = loc.id/2;
    var j = loc.id%2;
    var istart = i*blockSize;
    var iend = istart + blockSize;
    var jstart = j*blockSize;
    var jend = jstart + blockSize;

    for (r,s) in {istart + 1..iend, jstart + 1..jend} {
      B(r,s) = r+s;
      A(r,s) = 2*r + s;
    }
  }
}

Global Address Space

Each node has limited memory physically exclusive to itself. In order for node A to have access to the contents of the memory of another node B, node B should pass the data to node A. Fortunately, Chapel can use a library called GASNet that allows each node to have a global view of the memory of all nodes participating in the computation.

In the code above, each node loads its own data. However, the GASNet library allows each node to access the matrix elements loaded by the other nodes. Consequently, we are able to reference the sub-matrix held by each node without doing fancy message passing. The algorithm is then a straightforward implementation of

\displaystyle \mathbf{C}_{ij}=\sum_{k=0}^{2} \mathbf{A}_{ik} \mathbf{B}_{kj}

where \mathbf{A_{ij}}, \mathbf{B_{ij}} and \mathbf{C_{ij}} are submatrices of \mathbf{A}, \mathbf{B}, and \mathbf{C}, respectively.

Below is the straightforward implementation of parallel block multiplication:

coforall loc in Locales {
  on loc {
    var i = loc.id/2;
    var j= loc.id%2;
    var istart = i*blockSize;
    var iend = istart + blockSize;
    var jstart = j*blockSize;
    var jend = jstart + blockSize;
    var r = { istart + 1..iend, jstart + 1..jend };
    ref W = C[r].reindex( { 1..2,1..2 });
 
    coforall k in 0..1 {
      var U=get_block_matrix(A[vec,vec],i,k,blockSize);
      var V=get_block_matrix(B[vec,vec],k,j,blockSize);
      var P = mat_mul(U,V);
      coforall (s,t) in { 1..2,1..2 } {       
        W(s,t) += P(s,t);
      }
    }
  }
}

The procedure get_block_matrix will return the sub-matrix given the (i,j)th index and the block size.

proc get_block_matrix(A: [?D], i:int, j:int , blockSize:int) {
  var r = { i*blockSize+1 .. i*blockSize 
            +  blockSize, j*blockSize 
            + 1 .. j*blockSize + blockSize };
  return A[r];
}

The procedure mat_mul will return the matrix product of two sub-matrices:

proc mat_mul(A: [?D1], B: [?D2]) {
  var D3 = { 1..2, 1..2 };
  var C: [D3] real;
  var AA = A.reindex({1..2,1..2});
  var BB = B.reindex({1..2,1..2});

  for row in 1..2 {
    for col in 1..2 {
      var sum:real = 0;
      for k in 1..2 {
         sum += AA(row,k) * BB(k,col);
      }
      C(row,col) = sum;
    }
  }
  return C;
}

writeln(C[vec,vec]);

To run this code, we need to set the following environment variables:

source $CHPL_HOME/util/setchplenv.bash 

export CHPL_COMM=gasnet
export CHPL_LAUNCHER=amudprun

export GASNET_SSH_SERVERS="127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1"

Compiling and running this program gives the output:

#Compile
chpl mat_mul.chpl -o mat_mul

# Run using 4 nodes  
./mat_mul -nl 4

A=
3.0 4.0 5.0 6.0
5.0 6.0 7.0 8.0
7.0 8.0 9.0 10.0
9.0 10.0 11.0 12.0
B=
2.0 3.0 4.0 5.0
3.0 4.0 5.0 6.0
4.0 5.0 6.0 7.0
5.0 6.0 7.0 8.0
C=
68.0 86.0 104.0 122.0
96.0 122.0 148.0 174.0
124.0 158.0 192.0 226.0
152.0 194.0 236.0 278.0

Implementing Parallel Algorithms Part 1

Now we know that parallel algorithms allow us to make our programs run faster. So how do we implement them?

I have used mpich before, but that was more than a decade ago. Recently, I found myself looking for new ways of doing parallel programming. I discovered a very nice parallel programming language called Chapel. This is what we’ll use to implement parallel algorithms.

Algorithm 1: Parallel sum of consecutive numbers from 1 to N

To get the sum of numbers from 1 to N, where N is some integer is easy. There is a formula to do that:

\displaystyle \sum_{i=1}^N i = \frac{N(N+1)}{2}

However for the sake of illustration, we are going to compute the sum of 1 to N using a cluster of machines. Here is the Chapel code to accomplish it inside the file add_parallel.chpl.

config var N:int = 1000;
var D:domain(1) = {0..numLocales -1};
var s:[D] int;
var sum:int= 0;
var bs = N/numLocales;
coforall loc in Locales {
  on loc {
      var i = loc.id;
      var start = i*bs + 1;
      var end = start + bs -1;
      var _sum:int = 0;
      for j in start .. end {
        _sum += j;
      }
    writeln("i= " + i + ", start= "+ start + ", end=" + end + ", sum = " + _sum);
    s[i] = _sum;
  }
}

sum = + reduce s;
writeln("sum: " + sum);

This program is compiled using the command:

chpl add_parallel.chpl -o add_parallel

where add_parallel.chpl is the filename of the program and -o add_parallel specifies the filename of the binary produced after compilation.

One line 1, we have defined the default value of N to be 1000. This can be overridden on the command line by specifying the --x parameter. The number of machines we are going to use is also specified on the command line using the -nl parameter. A sample invocation of this program is the following:

./add_parallel -nl 3 --N=120

The above command means that we want to run the add_parallel binary using 3 machines with the value of N=120.

Executing the above command will give us:

./add_parallel -nl 3 --N=120
i= 0, start= 1, end=40, sum = 820
i= 1, start= 41, end=80, sum = 2420
i= 2, start= 81, end=120, sum = 4020
sum: 7260

How the program works

The program will partition N into 3 blocks. This is specified on line 5 where we divided N by numLocales to get the block size. The numLocales will contains the value of the parameter -nl which in this example is 3.

The code on line 6 tells chapel to execute a parallel for-loop executing the code inside on loc block on each Locale. A Locale has an id starting from 0. The Locale will determine it’s id and compute the starting and ending number to sum and store this value in the variable _sum.

coforall loc in Locales {
  on loc {
      var i = loc.id;
      var start = i*bs + 1;
      var end = start + bs -1;
      var _sum:int = 0;
      for j in start .. end {
        _sum += j;
      }
    writeln("i= " + i + ", start= "+ start + ", end=" + end + ", sum = " + _sum);
    s[i] = _sum;
  }
}

This _sum is stored in the array s. We take the sum of entries of the s array using the reduce keyword specifying + as the reduction operator. Finally we print the total sum across the machines.

sum = + reduce s;
writeln("sum: " + sum);

Conclusion

We have seen that that we can implement parallel programs using Chapel programming language. In part 2, we will show how to do Parallel Matrix Multiplication using Chapel.

A Very Brief Introduction to Parallel Computing

Imagine you were given a hundred 3-digit numbers to add, how much time would it take you to get the answer? If it would take you 30 seconds to add ten numbers (using a calculator), then it would take you about 300 seconds (or 5 minutes) to add 100 numbers.

Now imagine there are a hundred people and each person has a number. How would a hundred people compute the sum of all numbers? Seems like a recipe for disaster. However, we can do this:

1. Group the people by two, if there is an extra person with no group, this person can join the nearest group (to make a group of 3 people).
2. Each group will add the numbers that they have to get the sum S.
3. Each group will nominate a representative that will carry this new number S. The remaining members can sit down.
4. Repeat step 1 until there is only one person remaining.
5. The last person remaining will have the sum of the 100 numbers.

At the beginning, we have 100 people in groups of two. It takes 3 seconds for them to add their respective numbers. In the next iteration, only 50 people remain that will then add their numbers. Continuing in this way, the remaining number of people will be halved until in the 7th iteration we get our answers. So if each iteration can execute in 3 seconds, then it will take 21 seconds for a hundred people to compute the sum of 100 numbers!

I mentioned that in the 7th iteration, we are able to get the answer. There is a formula to get the number of iterations and it is

\displaystyle \lceil\log_2(n)\rceil

where n is the number of people to start with and the symbol \lceil\rceil is the ceiling function which rounds off the result of the function \log_2(n) to the next highest integer. If n=100, the number of iterations is

\displaystyle \lceil\log_2 100\rceil = 7

Having a hundred people to compute the sum of 100 numbers might be practically infeasible. We might not have the space to accommodate them. However, if we only have say 10 people, we can still have a faster computation.

Map Reduce

Given a hundred numbers, we can group the numbers by 10 and distribute it to 10 people. Each person will then add the numbers they have and combine the sum with the rest of the participants to get the sum of 100 numbers in 1/10th of the time.

Grouping the numbers into smaller subsets and distributing them to each person is called mapping. Combining the sum computed by each person to the total sum is called reduction. This is called map-reduce and the example given is a simple example of map reduce.

A More Complex Example

Let \mathbf A and \mathbf B be two matrices. The product of matrices \mathbf A and \mathbf B is the matrix \mathbf C

\displaystyle  \begin{bmatrix}  c_{00} & c_{01} & \cdots & c_{0n}\\  c_{10} & c_{11} & \cdots & c_{1n}\\  \cdots & \cdots & \cdots & \cdots\\  c_{n0} & \cdots & \cdots & c_{nn}  \end{bmatrix} =  \begin{bmatrix}  a_{00} & a_{01} & \cdots & a_{0n}\\  a_{10} & a_{11} & \cdots & a_{1n}\\  \cdots & \cdots & \cdots & \cdots\\  a_{n0} & \cdots & \cdots & a_{nn}  \end{bmatrix}  \begin{bmatrix}  b_{00} & b_{01} & \cdots & b_{0n}\\  b_{10} & b_{11} & \cdots & b_{1n}\\  \cdots & \cdots & \cdots & \cdots\\  b_{n0} & \cdots & \cdots & b_{nn}  \end{bmatrix}

such that

\displaystyle c_{ij} = \sum_{k=0}^{n-1} a_{ik}b_{kj}

where c_{ij} is the entry of the matrix C on row i and column j.

Here is a sequential algorithm to compute the matrix C:

//a and b are nxn matrices
//c[i][j] is initialized to 0 for all i,j
for(var i=0;i<n;i++){
  for(var j=0;j<n;j++){
    for(var k=0;k<n;k++){
      c[i][j] += a[i][k] * b[k][j]
    }
  }
}

There are 3 loops in the above algorithm, the innermost loop will execute n times to compute the sum of element-wise product of row i and column j. In the diagram below, the inner loop will do the following: multiply each element inside the of the box of matrix A and the elements inside the box of matrix B and take the sum. Since there are n such products, the number of addition operations is n.

Then you have to do this n times for each column of matrix B and n times for each row of matrix A, as shown below, for a total of n^3 operations.

So if you to multiply a matrix with n=100, you will need to execute 100^3 (or 1 million) operations!

Parallel Computing can help us here.

Parallel Matrix Multiplication

The good thing about matrix multiplication is that we can multiply by blocks. For the sake of simplicity, suppose we have 2 square matrices of dimension 100. We can divide the matrices into small square sub-matrices of dimension 25:

A=  \left[\begin{array}{c|c|c|c}  \mathbf A_{00} & \mathbf A_{01} & \mathbf A_{02} & \mathbf A_{03}\\  \hline  \mathbf A_{10} & \mathbf A_{11} & \mathbf A_{12} & \mathbf A_{13}\\  \hline  \mathbf A_{20} & \mathbf A_{21} & \mathbf A_{22} & \mathbf A_{23}\\  \hline  \mathbf A_{30} & \mathbf A_{31} & \mathbf A_{32} & \mathbf A_{33}  \end{array}  \right]  ,  B=  \left[\begin{array}{c|c|c|c}  \mathbf B_{00} & \mathbf B_{01} & \mathbf B_{02} & \mathbf B_{03}\\  \hline  \mathbf B_{10} & \mathbf B_{11} & \mathbf B_{12} & \mathbf B_{13}\\  \hline  \mathbf B_{20} & \mathbf B_{21} & \mathbf B_{22} & \mathbf B_{23}\\  \hline  \mathbf B_{30} & \mathbf B_{31} & \mathbf B_{32} & \mathbf B_{33}  \end{array}  \right]

where each \mathbf A_{ij} and \mathbf B_{ij} are square matrices of dimension 25.

We can then compute the resulting sub-matrix of C using the usual formula:

\displaystyle \mathbf C_{ij} = \sum_{k=0}^{p-1} \mathbf A_{ik}\mathbf B_{jk}

where p=100/25=4.

For example, to compute \mathbf C_{00} we have

\mathbf C_{00} = \mathbf A_{00} \mathbf B_{00} + \mathbf A_{01} \mathbf B_{10} + \mathbf A_{02} \mathbf B_{20} + \mathbf A_{03} \mathbf B_{30}

We then use this mechanism to distribute our sub-matrices to different computers (or in modern parlance “compute nodes”). For this example, we need 4×4=16 compute nodes. Each compute node will contain 2 sub-matrices, one for A and one for B. For easy visualization, we can make a drawing of our compute nodes arranged in a square of side 4 and labelled as shown below:

Now that we have evenly distributed our matrix data to each node, the next question is how do we compute? Notice that not one node contains all the data. Each node has a very limited subset of the data.

The Trick

We can get a clue by looking at the end result of the computation for Node 00. At the end of the computation, Node 00 should have the following result:

\mathbf C_{00} = \mathbf A_{00} \mathbf B_{00} + \mathbf A_{01} \mathbf B_{10} + \mathbf A_{02} \mathbf B_{20} + \mathbf A_{03} \mathbf B_{30}

Looking at the above formula, Node 00 only has the following data

\mathbf A_{00} \text{ and } \mathbf B_{00}

The rest of the sub-matrices are not in its memory. Therefore, Node 00 can only compute

\mathbf A_{00} \times \mathbf B_{00}

We are missing the following products:

\mathbf A_{01}\mathbf B_{10}, \mathbf A_{02}\mathbf B_{20}, \text{ and } \mathbf A_{03}\mathbf B_{30}

The matrix \mathbf A_{01} is with Node 01 and the matrix \mathbf B_{10} is with Node 10. So if Node 01 can send matrix \mathbf A_{01} to Node 00 and Node 10 can send matrix \mathbf B_{10} to Node 00, we can then get the product \mathbf A_{01}\mathbf B_{10}. In fact, if we do a slight rearrangement like the below, we can use the following algorithm to compute matrix \mathbf C:

1. Each node will send the current A sub-matrix to the node on the left and receive a new A sub-matrix from the node on the right. If there is no node on the left, it will send the sub-matrix to the last node on its row.
2. Each node will send the B sub-matrix to the node on top and receive a new B sub-matrix from the node below it. If there is no node on top, it will send the sub-matrix to the last node on its column.
3. Multiply the new sub-matrices and add the result to the current value of the sub-matrix of C that it keeps in memory.
4. Repeat until the number of iterations equal to N, where N^2 is the number of nodes.

The figure below describes this algorithm for Node 00.

Doing this for all nodes, we can visualize the parallel algorithm at work on all nodes as shown in the animation below:

This algorithm is one of my favorite algorithms since it looks like the pumping of blood in to the heart.

How fast is this algorithm?

Given a 2 square matrices of dimension N, the number of sequential computations is O(N^3). Using parallel matrix multiplication above, we divide the matrices into sub-matrices of dimension n. The resulting block matrix is of dimension N/m=p. Each node will now compute the product in parallel using O(n^3) operations per sub-matrix multiplication. Since there are p multiplications per node, the number of operations per node is O(pn^3) The ratio of these two quantities will give us how fast the parallel algorithm is

\displaystyle \frac{N^3}{pn^3} = \frac{1}{p}\frac{N^3}{n^3} = \frac{1}{p}\cdot p^3 = p^2

For our particular example, the theoretical speedup is p^2 = 4^2 = 16, that is, our parallel algorithm can compute the product 16 times faster than the sequential algorithm. If we increase p, we can increase the speedup. However, we can only increase p up to a certain point as the network will then become the bottleneck and will make things slower than the sequential algorithm.

In the next post, we will see how to program parallel algorithms.

Have You Seen A Half-Balloon?

Balloons come in all shapes and sizes but the shape the comes to mind when I hear the word balloon is that of a spherical or round shape. Given a spherical balloon, create an imaginary partition that divides the balloon into left and right hemispheres. This effectively places the air molecules into left and right. The number of molecules in the left is more or less the same as the number of molecules on the right. This is what we see in real life.

As we pump air into the balloon, a given air molecule can randomly go left or right of the partition. If all air molecules for some reason go left (or right), then we have a situation called a “half-balloon”.

So why aren’t we seeing half-balloons? It’s because that situation is highly improbable. For illustration purposes, imagine there are 6 air molecules. Each molecule has two equally likely choices, be on the left or on the right. Now imagine each air molecule has chosen it’s “side”. A possible combination or configuration (a term we will use moving forward) will be 3 molecules choose Left and 3 choose Right:

L  L  L  R  R  R

Where L stands for Left and R stands for Right.

In fact, the number of such configurations is equal to

\displaystyle \underbrace{2\times 2\times \ldots \times 2}_{\text{6 times}} = 2^6 = 64

We enumerate below the list of possible air molecule configurations:


# This is the code in R
# x=c("L","R")
# expand.grid(m1=x,m2=x,m3=x,m4=x,m5=x,m6=x)
# d1=expand.grid(m1=x,m2=x,m3=x,m4=x,m5=x,m6=x)
   m1 m2 m3 m4 m5 m6
1   L  L  L  L  L  L
2   R  L  L  L  L  L
3   L  R  L  L  L  L
4   R  R  L  L  L  L
5   L  L  R  L  L  L
6   R  L  R  L  L  L
7   L  R  R  L  L  L
8   R  R  R  L  L  L
9   L  L  L  R  L  L
10  R  L  L  R  L  L
11  L  R  L  R  L  L
12  R  R  L  R  L  L
13  L  L  R  R  L  L
14  R  L  R  R  L  L
15  L  R  R  R  L  L
16  R  R  R  R  L  L
17  L  L  L  L  R  L
18  R  L  L  L  R  L
19  L  R  L  L  R  L
20  R  R  L  L  R  L
21  L  L  R  L  R  L
22  R  L  R  L  R  L
23  L  R  R  L  R  L
24  R  R  R  L  R  L
25  L  L  L  R  R  L
26  R  L  L  R  R  L
27  L  R  L  R  R  L
28  R  R  L  R  R  L
29  L  L  R  R  R  L
30  R  L  R  R  R  L
31  L  R  R  R  R  L
32  R  R  R  R  R  L
33  L  L  L  L  L  R
34  R  L  L  L  L  R
35  L  R  L  L  L  R
36  R  R  L  L  L  R
37  L  L  R  L  L  R
38  R  L  R  L  L  R
39  L  R  R  L  L  R
40  R  R  R  L  L  R
41  L  L  L  R  L  R
42  R  L  L  R  L  R
43  L  R  L  R  L  R
44  R  R  L  R  L  R
45  L  L  R  R  L  R
46  R  L  R  R  L  R
47  L  R  R  R  L  R
48  R  R  R  R  L  R
49  L  L  L  L  R  R
50  R  L  L  L  R  R
51  L  R  L  L  R  R
52  R  R  L  L  R  R
53  L  L  R  L  R  R
54  R  L  R  L  R  R
55  L  R  R  L  R  R
56  R  R  R  L  R  R
57  L  L  L  R  R  R
58  R  L  L  R  R  R
59  L  R  L  R  R  R
60  R  R  L  R  R  R
61  L  L  R  R  R  R
62  R  L  R  R  R  R
63  L  R  R  R  R  R
64  R  R  R  R  R  R

Looking at the data above, we can see that there are 6 configurations that contain only 1 “L”:

# This is the code in R
# cc=c()
# for(i in 1:64){
#  tmp=d1[i,]
#  cc=c(cc,sum(tmp == "L"))
# }
# d1$count = cc
# d1[d1$count == 1, ]
   m1 m2 m3 m4 m5 m6 count
32  R  R  R  R  R  L     1
48  R  R  R  R  L  R     1
56  R  R  R  L  R  R     1
60  R  R  L  R  R  R     1
62  R  L  R  R  R  R     1
63  L  R  R  R  R  R     1

This means that the probability of getting a configuration having one molecule on the left and 5 molecules on the right is:

\displaystyle \text{Probability of 1 molecule Left and 5 molecules Right} = \frac{6}{2^6} = 0.093750

If we continue summarizing our data so that we get the number of combinations containing 2, 3, 4, 5, and 6 “L”s, we can generate what is known as a probability distribution of air molecules on the left hemisphere:

# cc=c()
# for(i in 0:6){
#  cc=c(cc,length(attributes(d1[d1$count == i, ])$row.names))
# }
# data.frame(X=0:6,count=cc,probability=cc/2^6)
  X count probability
1 0     1    0.015625
2 1     6    0.093750
3 2    15    0.234375
4 3    20    0.312500
5 4    15    0.234375
6 5     6    0.093750
7 6     1    0.015625

The graph of this probability distribution is shown below:

# Here is the code that generated the plot above
barplot(cc/2^6,names.arg=0:6,main="Probability Distribution \nNumber of Air Molecules on the Left Hemisphere")

The graph above tells us that the most probable configuration of balloon molecules is that half of them are on the right and the other half on the left as shown by the high probability of the value X = 3. It also tells us the probability of all molecules choosing the Left side equal to 0.015625. When the number of air molecules is very large, this probability will turn out to be extremely small, as we will show below.

The Mathematics

At this point, manually counting the number of rows will not be practical anymore for a large number of molecules. We need to use some mathematical formula to compute the combinations. We don’t really care which molecules chose left or right. We just care about the number of molecules on the left. Given N molecules, there are 2^N possible configurations of air molecules. Of these configurations, there are

\displaystyle {N \choose m } = \frac{N!}{m! (N-m)!}

combinations that have m molecules on the left. Therefore, the probability of a configuration having m molecules on the left is

P(m) = \displaystyle \frac{\displaystyle {N\choose m}}{\displaystyle 2^N}

This is a probability density function since

\displaystyle \sum_{m=0}^N P(m) = \displaystyle \sum_{m=0}^N \frac{\displaystyle {N\choose m}}{\displaystyle 2^N} = 1

To show this, we will use the Binomial Theorem

\displaystyle \sum_{m=0}^N {N \choose m} x^m = (1+x)^N

If we let x=1, the Binomial Theorem gives us

\displaystyle \sum_{m=0}^N {N \choose m} = (1+1)^N = 2^N

Therefore

\begin{array}{rl}  \displaystyle \sum_{m=0}^N P(m) &= \displaystyle \sum_{m=0}^N \frac{\displaystyle {N\choose m}}{\displaystyle 2^N}\\  &= \displaystyle \frac{1}{2^N} \sum_{m=0}^N {N\choose m}\\  &= \displaystyle \frac{1}{2^N} 2^N\\  &= 1  \end{array}

A Mole of Air

One mole of air contains 6.022 \times 10^{23} . This means the probability of that all molecules are on the left side of the balloon is

\displaystyle \frac{1}{\displaystyle 2^{6.022 \times 10^{23}}} < 1/1024^{10^{22}} < 1/(10^3)^{10^{22}}=10^{-30,000,000,000,000,000,000,000}

This is a very small number such that it contains 30 million trillion zeros to the right of the decimal point. When you write this zeros on a piece of paper with a thickness of 0.05 millimeters, you would need to stack it up to a height 74 times the distance of earth to pluto in kilometers!

An Interview Question: Using Integer Programming

We can solve the Interview Question using a mathematical technique called Integer Programming. Let d_1, d_2, \ldots, d_N be the variables representing diskette 1, diskette 2, diskette 3, etc. The values of the d_k variables can only be 0 or 1. A 0 means the diskette is not used while a 1 means that it is used.

Each file is saved to a certain diskette. We want to know to what diskette d_i a given file f_j is assigned. To represent this, we assign the variable a_{ij} a value of 1 if file f_j is assigned to diskette d_i.

We will normalize the file sizes so that if s_i is the size of f_i, the s_i \le 1. We do this by simply dividing all file sizes by the size of the diskette. For a given diskette d_i, the following constraint should be satisfied:

d_i - s_1a_{i1} - s_2a_{i2} - \ldots - s_N a_{iN} \ge 0

for diskette i = 1, 2, \ldots, N and s_i are the normalized file sizes of file f_i for i=1,2,\ldots,N.

Since each file f_j can only be assigned to one diskette, we have the following constraint:

a_{1j} + a_{2j} + \ldots + a_{Nj} = 1

where a_{1j} is the variable representing the “file f_j is in diskette d_1“, etc.

Finally, we have to constrain the value of d_i to be either 0 or 1, that is,

d_i \le 1

for all i=1,2,\ldots,N.

Integer Programming Formulation

Given the above information, we can formulate the Integer Programming problem as

Minimize:

d_1 + d_2 + d_3 + \ldots + d_N

subject to

\begin{array}{rl}  d_1 - s_1a_{11} - s_2a_{12} - s_3a_{13} - \ldots - s_Na_{1N} &\ge 0\\  d_2 - s_1a_{21} - s_2a_{22} - s_3a_{23} - \ldots - s_Na_{2N} &\ge 0\\  :\\  d_N - s_1a_{N1} - s_2a_{N2} - s_3a_{N3} - \ldots - s_Na_{NN} &\ge 0\\  a_{11} + a_{21} + a_{31} + \ldots + a_{N1} &= 1\\  a_{12} + a_{22} + a_{32} + \ldots + a_{N2} &= 1\\  :\\  a_{1N} + a_{2N} + a_{3N} + \ldots + a_{NN} &= 1\\  d_1 &\le 1\\  d_2 &\le 1\\  :\\  d_n &\le 1  \end{array}

Solving the Problem

We will use R to solve this Integer Programming Formulation. Please see code below:

library("lpSolve")
NUMFILES=4

# Generate random file sizes between 1 and 10
FileSizes=ceiling(10*runif(NUMFILES))
x = -1*FileSizes/10
l=length(x)

# Each files can be in any of the diskettes. Suppose there are N files,
# to determine if a file j is in diskette i, the value of variable x_ij will
# 1 if file j is in diskette i, and 0 otherwise.
# Here we construct the coefficients of variables x_ij which are the
# sizes of the files (normalized to 1)
zz=c()
for(i in 1:(l-1)){
  zz=c(zz,x,rep(0,l*l))
}
zz=c(zz,x)

# Construct the coefficients of the indicator variables representing the 
# diskettes d_i
zzmatrix=matrix(zz,ncol=l*l,byrow=T)
CoefficientsOfDiskettes=c();
for(i in 1:l){
 ttt=rep(0,l)
 ttt[i] = 1
 CoefficientsOfDiskettes= c(CoefficientsOfDiskettes,ttt,zzmatrix[i,])
}

# Construct the coefficients of x_ij for constant j. These variables 
# satisfy the equation \sum_{i=1}^N x_{ij}
SumOfFileAcrossDiskettes=c()
for(i in 1:l){
  ttt=rep(0,l)
  ttt[i]=1
  SumOfFileAcrossDiskettes=c(SumOfFileAcrossDiskettes,rep(ttt,l))
}

# Prepend Coefficients of variables d_i. The value of these coefficients is 0. 
SumOfFileAcrossDiskettesMatrix=matrix(SumOfFileAcrossDiskettes,ncol=l*l,byrow=T)
PrependCoefficientsOfDiskettes=c()
for(i in 1:l){
 PrependCoefficientsOfDiskettes=c(PrependCoefficientsOfDiskettes,c(rep(0,l),SumOfFileAcrossDiskettesMatrix[i,]))
}

# Construct coefficients of d_i to construct constraint d_i <= 1
DisketteConstraints=c()
for(i in 1:l){
 ttt=rep(0,l)
 ttt[i]=1
 DisketteConstraints=c(DisketteConstraints,ttt,rep(0,l*l))
}

# Construct matrix input of lpSolve
const.mat=matrix(c(CoefficientsOfDiskettes,PrependCoefficientsOfDiskettes,DisketteConstraints),ncol=l*(l+1),byrow=T)

print("Matrix Coefficients:")
print(const.mat)

# Construct inequalities/equalities
const.dir=c(rep(">=",l),rep("=",l),rep("<=",l))

# Construct Right-Hand side
const.rhs=c(rep(0,l),rep(1,l),rep(1,l))

# Construct Objective Function
objective.in=c(rep(1,l),rep(0,l*l))

# Invoke lpSolve
mylp=lp(direction="min",objective.in=objective.in,const.mat=const.mat,const.dir=const.dir,const.rhs=const.rhs,all.int=T)

# Print Results
print(paste("Number of Diskettes: ", sum(mylp$solution[1:l])))
tz=matrix(mylp$solution,ncol=l,byrow=T)
print("File Sizes: ")
print(FileSizes)
for(i in 2:(l+1)){
   files = which(tz[i,] == 1)
   if(length(files) > 0){
      print(paste("Files in diskette ", i-1))
      print(files)
   }
}

Most of the code above is setting up the matrix of coefficients. The line 70 then calls on lpSolve to compute the optimal values of the variables

Program Output

Running this code we get the output

[1] "Matrix Coefficients:"
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
 [1,]    1    0    0    0   -1 -0.2 -0.1 -0.1    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
 [2,]    0    1    0    0    0  0.0  0.0  0.0   -1  -0.2  -0.1  -0.1     0   0.0   0.0   0.0     0   0.0   0.0   0.0
 [3,]    0    0    1    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0    -1  -0.2  -0.1  -0.1     0   0.0   0.0   0.0
 [4,]    0    0    0    1    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0    -1  -0.2  -0.1  -0.1
 [5,]    0    0    0    0    1  0.0  0.0  0.0    1   0.0   0.0   0.0     1   0.0   0.0   0.0     1   0.0   0.0   0.0
 [6,]    0    0    0    0    0  1.0  0.0  0.0    0   1.0   0.0   0.0     0   1.0   0.0   0.0     0   1.0   0.0   0.0
 [7,]    0    0    0    0    0  0.0  1.0  0.0    0   0.0   1.0   0.0     0   0.0   1.0   0.0     0   0.0   1.0   0.0
 [8,]    0    0    0    0    0  0.0  0.0  1.0    0   0.0   0.0   1.0     0   0.0   0.0   1.0     0   0.0   0.0   1.0
 [9,]    1    0    0    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[10,]    0    1    0    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[11,]    0    0    1    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[12,]    0    0    0    1    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[1] "Number of Diskettes:  2"
[1] "File Sizes: "
[1] 10  2  1  1
[1] "Files in diskette  1"
[1] 2 3 4
[1] "Files in diskette  2"
[1] 1

Interpreting the Result

Lines 2-14 of the output gives you the matrix of coefficients. Line 15 prints the number of diskettes needed to store the files. Line 17 prints the randomly generated file sizes from 1 to 10. Finally lines 18-21 prints which diskettes contain which files.

The space complexity of this solution is quite substantial. Given N files, we need to specify N^2 + N variables by 3\times N equations for a total of (N^2 + N)\times 3N memory space for coefficients.

An Interview Question

I was given this interview question and I’d like to share it to you. The setting is back in the days when the largest size of a hard disk was 1 GB and there were no CD writers yet and the only way to back up your data is through those 1.44 floppy disks. You want to back up your files but you want to minimize the number of floppy disks you need to use. Assume your most important files are in a single directory. How will you distribute the files across your disks in such a way that the number of disks you use is minimized ?

Click here to see the Integer Programming Solution.

To make this simple, let’s assume the following:

– we will not take into account that every file you copy to the disk has a record of the metadata of the file and stored on the disk as well. This will eat up space as you put more files. For our purposes, we ignore this complexity.
– The size of each file is less than or equal to 1.44

First we need to have a list of those files including the size and sort the list according to size in descending order. If A is the list of files, we can apply this algorithm:

B := list of files to copy to current floppy disk
remaining_size := 1.44 MB
For file in A:
    If remaining_size - file.size > 0:
        B.add(file)
        A.remove(file)
Copy all files listed in B to disk
Empty B
Repeat process for remaining files in A

Although there are other better algorithms than the one above, this is the one I managed to to come up during the interview.

We now need to determine how fast our algorithm can run.

Worst Case Complexity

How slow can this algorithm get ? If for any two files F_i and F_j in A we have F_i + F_j > 1.44, then all files will have their own diskette. If this is the case, for each file, our algorithm will execute step 4. For the first disk, it will execute the step N times. For the second disk, it will execute the step N-1 times, for the third disk it will execute N-2 times, etc. the total number of times it executes step 4 is the total number of comparisons and is equal to the summation:

\displaystyle \sum_{1=1}^{N} i

which is equal to

\displaystyle \frac{N(N+1)}{2}

Therefore, in the worst case, the complexity is O(N^2).

Best Case Complexity

The best case is when all files fit in just one diskette. For this, the total number of comparisons is N

Average Case Complexity

On the average, files have different sizes. We now compute the complexity on the assumption that the probability distribution is uniform.

If k is the number of diskettes, the number of comparisons is a sequence of monotonic decreasing numbers \{ a_1, a_2, a_3, \ldots, a_k \} taken at random from the set \{ 1, 2, \ldots, N\}. Each of the numbers a_j, j\in \{1, 2, \ldots, k\} has a probability 1/N of being chosen. Let X be a random variable such that

\displaystyle Pr(X=a_j) = \frac{1}{N} \text{ for } j=1,2,\ldots,k

then the number of comparisons C is equal to

\displaystyle C = \sum_{i=1}^k X = kX

The expected value of C is given by the

E[C] = E[kX] = kE[X]

However, the expected value of X is given by

\displaystyle E[X] = \sum_{j=1}^N j\cdot Pr(X=j) = \frac{1}{N} \sum_{j=1}^N j = \frac{1}{N}\frac{N(N+1)}{2} = \frac{N+1}{2}

Therefore,

\displaystyle E[C] = k\frac{N+1}{2}

What remains is to determine the average value of k, which is the number of diskettes. If M=1.44 is the maximum file size, the average file size is M/2. The average total file size is then NM/2. The average number of diskettes is equal to the average total size divided by size of diskette, that is

k = \displaystyle \frac{NM}{2}\frac{1}{M} = \frac{N}{2}

This means that

\displaystyle E[C] = \frac{N}{2} \frac{N+1}{2} = O(N^2)

which is the same as the worst case complexity.

There is another way to solve this problem using Integer Programming.

Asynchronous Versus Synchronous: A Performance Perspective

I just had my coffee from my favorite coffee shop down the corner. I’m a regular at that coffee shop that the baristas know me and I’m able to sit down after I order and do something else while they make my coffee and serve to me when ready.  If I order from other branches of this coffee shop, I would have to wait for my coffee in a corner and not being able to do anything productive.  So while I was seated comfortable waiting for my coffee, a thought came to me. This is a great example of synchronous versus asynchronous service invocation.

A synchronous service invocation is just like buying your coffee and having to wait in the corner for the baristas to complete your coffee before you can even sit down and do something useful. Asynchronous service invocation is having to get seated and do something else while waiting for your coffee to be served.  My instinct tells me that asynchronous invocation is better than synchronous in terms of performance especially when it takes a long for the service to complete and the waiting time is much better used doing something else. But how can we show that this is the case?

We can model the synchronous/asynchronous invocation as a Markov Chain and compute a few metrics from our model. If you’re not familiar with this approach, you can refer to this article MODELING QUEUING SYSTEMS USING MARKOV CHAINS.

First, we model the asynchronous invocation. We can identify each state of our system by  2 slots each of which contains a number. The first slot is the number of responses waiting for the client to process and the second slot is the number of requests waiting for the server to process.  To simplify the modeling, we assume the following:

  1. The maximum number of requests that a server can handle at any given time is 2. Which means that a client cannot anymore send a request if the second number is equal to 2.
  2. When the server responds, the first number increments by 1. Once a client receives a response, it will stop whatever it is doing and process that response. When the client finishes processing the response, the first number goes down to zero. As a consequence, this assumption says that the maximum value of slot number 1 is 1 at any given time.

With these assumptions, we can draw the Markov Diagram of the asynchronous system to come up with the below diagram:

Explanation of the Diagram

The system initially starts at 00 where there are no requests and no responses. It will then transition to state 01 when the client will send a  request which increments the server’s pending requests to 1. From this state, the system can transition to either state 02 or state 10. State 02 occurs when the client sends another request without waiting for a response. State 10 is when the server is able to process the request and return a response, thereby decrementing it’s pending items and incrementing the client’s number of responses to process.

At state 02, the client cannot anymore send requests since the maximum requests that the server can handle is 2. At this state, it can only transition to state 11, where the server has processed one of the requests (thus decrementing the second slot from 2 to 1 and incrementing the first slot from 0 to 1).

State 11 can only transition to 01 since the client will stop whatever it is doing to process the single response it has received (thus decrementing the first slot from 1 to 0).

The numbers a,b,c are the rates at which the transitions will occur. For example, the rate at which the system will transition from state 00 to state 01 is b.

In the remainder of this article, we assume that the client can request at the rate of 60 requests per unit time and can process the response at 60 per unit time. We also assume that the server can process requests at 30 per unit time. We therefore have the following:

a=60,  b=60,  c=30

Computing the Probabilities

At steady state, the flow going into each state is equal to the flow out of that state. For example, for state 00, the following is true:

\displaystyle bP_{00} = aP_{10}

Doing this for all states, we get the following balance equations:

\begin{array}{rlr}  \displaystyle  bP_{00} &= aP_{10} & (1)\\  bP_{00} + a P_{11} &= bP_{01} + cP_{01} & (2)\\  bP_{01} &= cP_{02} & (3)\\  aP_{10} &= cP_{01} & (4)\\  aP_{11} &= cP_{02} & (5)  \end{array}

Since the sum of the probabilities should be equal to 1, we have our last equation:

P_{00} + P_{01} + P_{02} + P_{10} + P_{11} = 1

We have 6 equations in 5 unknowns, one of the equations above is actually redundant. In fact, you can show that equation 2 above is redundant.

We can form the matrix equation of the system of equations above and solve for the probabilities:

\begin{bmatrix}  -b & 0 & 0 & a & 0 \\  0 & b & -c & 0 & 0 \\  0 & -c & 0 & a & 0 \\  0 & 0 & -c & 0 & a \\  1 & 1 & 1 & 1 & 1  \end{bmatrix}  \begin{bmatrix}  P_{00}\\  P_{01}\\  P_{02}\\  P_{10}\\  P_{11}  \end{bmatrix}  =  \begin{bmatrix}  0\\  0\\  0\\  0\\  1  \end{bmatrix}

Solving this matrix equation when a=60, b=60 and c=30, we can find the probabilities:

\begin{bmatrix}  P_{00}\\  P_{01}\\  P_{02}\\  P_{10}\\  P_{11}  \end{bmatrix}  =  \displaystyle  \frac{1}{ab^2+abc+ac^2+b^2c+bc^2}  \begin{bmatrix}  ac^2\\  abc\\  ab^2\\  bc^2\\  b^2c  \end{bmatrix}  =  \begin{bmatrix}  1/10 \\ 1/5 \\2/5\\ 1/10\\ 1/5  \end{bmatrix}

Utilization and Throughput: Asynchronous Case

The client utilization is defined to be the probability that the client is busy. In the same way, the server utilization is the probability that the server is busy. Looking at the diagram, the client is busy sending requests at state 00 and state 01. It is busy processing responses at state 10 and 11.  On the other hand, the server is busy processing requests at state 01 and 02.

Therefore, the client utilization is  equal to

\begin{array}{rl}  U_{\text{client}} &= P_{00} + P_{01} + P_{10} + P_{11}\\  &= 1/10 + 1/5 + 1/10 + 1/5\\  &= 3/5\\  &= 60\%  \end{array}

The server utilization is equal to

\begin{array}{rl}  U_{\text{server}} &= P_{01} + P_{02}\\  &= 1/5 + 2/5 \\  &= 3/5 \\  &= 60\%  \end{array}

The system througput is the number of requests the client is able to submit and is equal to

\begin{array}{rl}  X &= 60P_{00} + 60 P_{01} \\  &= 60*1/10 + 60 * 1/5 \\  &= 18  \end{array}

Comparison with Synchronous Invocation

For the synchronous invocation, the client will submit a request at state 00. The server will then receive this request and process it immediately at state 01. The client will get the response and process it at state 10 and do the loop again at state 00. Please see the Markov diagram below describing this process.

We can solve the probabitlies of this Markov Chain by solving the balance equation

\begin{bmatrix}  b & 0 & -a \\  0 & c & -a \\  1 & 1 & 1  \end{bmatrix}  \begin{bmatrix}  P_{00}\\  P_{01}\\  P_{10}  \end{bmatrix}  =  \begin{bmatrix}  0\\0\\1  \end{bmatrix}

Solving for the probabilities, we get

\begin{bmatrix}  P_{00}\\  P_{01}\\  P_{10}  \end{bmatrix}  =  \displaystyle  \frac{1}{ab + ac + bc}  \begin{bmatrix}  ac\\  ab\\  bc  \end{bmatrix}  =  \begin{bmatrix}  1/4\\ 1/2\\ 1/4  \end{bmatrix}

At state 00, the client is busy sending requests at the rate of 60 requests per unit time, therefore the system throughput is

\begin{array}{rl}  X &= 60 P_{00} \\  &= 60/4 \\  &= 15  \end{array}

which is less than the Asynchronous case. In addition, the client utilization is

\begin{array}{rl}  U_{\text{client}} &= P_{00} + P_{10}\\  &= 1/4 + 1/4\\  &= 1/2  \end{array}

This is lower than the Asynchronous case where the client utilization is 60%.

Therefore, the asynchronous service invocation is indeed better compared to the synchronous case. It has a much higher throughput and a much higher utilization compared to the synchronous case.

P.S. I used Mathics to solve the matrix equations. Mathics is an open source alternative to Mathematica. It features Mathematica-compatible syntax and functions so if you’re a user of Mathematica you can run basic commands in Mathics. If you’re a Mathics user, you can easily switch to Mathematica as well. Please visit http://mathics.github.io/ to know more about Mathics.