## Quantum Searching

Imagine a shuffled deck of 52 cards, and you are asked to find the ace of spades. The most natural thing to do is to take the top most card, see if it’s the ace of spades. If not, put it aside, take the top most and repeat the process until you find the ace of spades. If you are lucky, the ace of spades is the top most then we’re done. If you’re not so lucky, the ace of spades is at the bottom and it will take you 52 peeks before you find the card.

If we scatter the cards face down on the floor and randomly pick a card, then on the average, we will need $52/2$ peeks before we can find the ace of spades.

With quantum computing we can do even better!

For the sake of demonstration, let’s say we have 8 cards as shown in the figure below. We want to find the ace of spades.

Here are the steps:

1. Label each card from 0 to 7 in random order. The positions of the cards will represent the states of cards. The state $\left|6\right\rangle$ will therefore represent the ace of spades.

2. Let $\left|\phi\right\rangle$ be the superposition of states:

$\displaystyle \left|\phi\right\rangle = \frac{1}{2^{n/2}} \sum_{k=0}^{2^n-1} \left|x\right\rangle = \frac{1}{2^{n/2}} \Big( \left|0\right\rangle + \ldots + \left|7\right\rangle \Big) = \frac{1}{\sqrt{8}} \left[ \begin{array}{c} 1\\ 1\\ 1\\ 1\\ 1\\ 1\\ 1\\ 1 \end{array} \right]$

3. Let $f$ be a function such that

$f(x) = \begin{cases} 1 & x \text{ corresponds to ace of spades}\\ 0 & \text{ otherwise} \end{cases}$

Define the operator $\mathbf{V}$

$\mathbf{V} = \mathbf{I} - 2\left|a\right\rangle \left\langle a\right|$

where a is the unique value where

$f(a) = 1$

In our example, the value of a is 6. Therefore,

$\mathbf{V} = \begin{bmatrix} 1.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 1.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 1.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 1.00 & 0.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 1.00 & 0.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 1.00 & 0.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & -1.00 & 0.00 \\ 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 0.00 & 1.00 \\ \end{bmatrix}$

Observe that the matrix element $\mathbf{V}_{6,6} = -1$

4. Define the operator W:

$\mathbf{W} = 2\left|\phi\right\rangle \left\langle\phi\right| - \mathbf{I} = \begin{bmatrix} -0.75 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & -0.75 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & -0.75 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & -0.75 & 0.25 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 & -0.75 & 0.25 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & -0.75 & 0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & -0.75 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & -0.75 \\ \end{bmatrix}$

5. Compute the operator $\mathbf{WV}$:

$\mathbf{WV} = \begin{bmatrix} -0.75 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & -0.25 & 0.25 \\ 0.25 & -0.75 & 0.25 & 0.25 & 0.25 & 0.25 & -0.25 & 0.25 \\ 0.25 & 0.25 & -0.75 & 0.25 & 0.25 & 0.25 & -0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & -0.75 & 0.25 & 0.25 & -0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 & -0.75 & 0.25 & -0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & -0.75 & -0.25 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & 0.75 & 0.25 \\ 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & 0.25 & -0.25 & -0.75 \\ \end{bmatrix}$

6. Multiply $\mathbf{WV}$ to $\left|\phi\right\rangle$ a number of times, about $\pi/4 \cdot 2^{n/2} = \pi/4 \cdot \sqrt{8} = 2.2 \approx 2$:

$\mathbf{WV}\left|\phi\right\rangle = \begin{bmatrix} 0.18 \\ 0.18 \\ 0.18 \\ 0.18 \\ 0.18 \\ 0.18 \\ 0.88 \\ 0.18 \\ \end{bmatrix}, \mathbf{(WV)^2}\left|\phi\right\rangle = \begin{bmatrix} -0.09 \\ -0.09 \\ -0.09 \\ -0.09 \\ -0.09 \\ -0.09 \\ 0.97 \\ -0.09 \\ \end{bmatrix}$

As you can see, the probability of getting the state $\left|6\right\rangle$ becomes very close to 1.

Making a measurement of the input register at this point will give us the state $\left|6\right\rangle$ with a probability very close to 1.

We have just demonstrated the quantum search algorithm!

## Why it works

Let $f$ be a function such that

$f(x) = \begin{cases} 1 & x \text{ the item we are looking for}\\ 0 & \text{ otherwise} \end{cases}$

Define the operator $\mathbf{U}_f$ whose action on an n qubit register and 1 qubit output register is

$\mathbf{U}_f(\left|x\right\rangle \otimes \left|y\right\rangle) = \left|x\right\rangle\otimes\left|y\oplus f(x)\right\rangle$

Prepare the n qubit input register and 1 qubit output register in the following state:

$\underbrace{\left|0\right\rangle\ldots \left|0\right\rangle}_{\text{n times}} \otimes \left|1\right\rangle$

Applying the Hadamard operator on the input and output qubits gives us a superposition of $N=2^n$ states:

$\begin{array}{rl} \displaystyle \mathbf{H}^{\otimes n}\otimes\mathbf{H}(\left|0\ldots 0\right\rangle\otimes \left|1\right\rangle )&= \mathbf{H}^{\otimes n}\left|0\ldots 0\right\rangle\otimes \mathbf{H}\left|1\right\rangle\\ &= \displaystyle \frac{1}{2^{n/2}}\sum_{x=0}^{2^n-1} \left|x\right\rangle \otimes \frac{1}{\sqrt{2}} \left(\left|0\right\rangle - \left|1\right\rangle\right) \end{array}$

Next apply the operator $\mathbf{U}_f$ to get

$\begin{array}{rl} \displaystyle \mathbf{U}_f \left[\frac{1}{2^{n/2}}\sum_{x=0}^{2^n-1} \left|x\right\rangle \otimes \frac{1}{\sqrt{2}} \left(\left|0\right\rangle - \left|1\right\rangle\right)\right] &= \displaystyle \frac{1}{\sqrt{2}} \frac{1}{2^{n/2}}\left[\sum_{x=0}^{2^n-1} \mathbf{U}_f \left|x\right\rangle \otimes \left|0\right\rangle - \sum_{x=0}^{2^n-1} \mathbf{U}_f \left|x\right\rangle \otimes \left|1\right\rangle\right] \\ &= \displaystyle \frac{1}{\sqrt{2}} \frac{1}{2^{n/2}}\left[\underbrace{\sum_{x=0}^{2^n-1} \left|x\right\rangle \otimes \left|0\oplus f(x) \right\rangle }_{\text{first underbrace}} - \underbrace{\sum_{x=0}^{2^n-1} \left|x\right\rangle \otimes \left|1\oplus f(x) \right\rangle}_{\text{second underbrace}}\right] \end{array}$

When $f(x) = 1$ for x=a, the first underbrace will expand to

$\displaystyle \sum_{x=0}^{2^n-1} \left|x\right\rangle \otimes \left|0\oplus f(x) \right\rangle = \left|0\right\rangle\otimes\left|0\right\rangle + \ldots + \underbrace{\left|a\right\rangle\otimes\left|1\right\rangle} + \ldots + \left|2^n-1\right\rangle \otimes \left|0\right\rangle$

The second underbrace will expand to

$\displaystyle \sum_{x=0}^{2^n-1} \left|x\right\rangle \otimes \left|1\oplus f(x) \right\rangle = \left|0\right\rangle\otimes\left|1\right\rangle + \ldots + \underbrace{\left|a\right\rangle\otimes\left|0\right\rangle} + \ldots + \left|2^n-1\right\rangle \otimes \left|1\right\rangle\\$

The underbraces in the above summations can be swapped so that we get

$\displaystyle \frac{1}{\sqrt{2}} \frac{1}{2^{n/2}} \left[ \sum_{x=0}^{2^n-1} (-1)^{f(x)} \left|x\right\rangle \otimes \left|0\right\rangle - \sum_{x=0}^{2^n-1} (-1)^{f(x)} \left|x\right\rangle \otimes \left|1\right\rangle\right]$

$\begin{array}{rl} &= \displaystyle \frac{1}{2^{n/2}} \sum_{x=0}^{2^n-1} (-1)^{f(x)} \left|x\right\rangle \otimes \frac{1}{\sqrt{2}} \left( \left|0\right\rangle - \left|1\right\rangle\right)\\ &= \displaystyle \frac{1}{2^{n/2}} \sum_{x=0}^{2^n-1} (-1)^{f(x)} \left|x\right\rangle \otimes \mathbf{H}\left|1\right\rangle \end{array}$

This means that the action of $\mathbf{U}_f$ leaves the output qubit unentangled with the input qubit. We can just ignore this moving forward and focus our attention to the input qubits.

We view the current state of the input qubit as the result of some operator $\mathbf{V}$ defined by

$\mathbf{V} \left|\phi\right\rangle = \displaystyle \frac{1}{2^{n/2}} \sum_{x=0}^{2^n-1} (-1)^{f(x)} \left|x\right\rangle$

where

$\left|\phi\right\rangle = \displaystyle \frac{1}{2^{n/2}} \sum_{x=0}^{2^n-1} \left|x\right\rangle$

We can derive the expression of $\mathbf{V}$ by expanding $\mathbf{V} \left|\phi\right\rangle$, noting that at x=a, $f(x) = 1$:

$\begin{array}{rl} \mathbf{V} \left|\phi\right\rangle &= \displaystyle \frac{1}{2^{n/2}} \sum_{x=0}^{2^n-1} (-1)^{f(x)} \left|x\right\rangle\\ &= \displaystyle \frac{1}{2^{n/2}} \left[\left|0\right\rangle + \ldots - \left|a\right\rangle +\ldots + \left|2^n-1\right\rangle\right]\\ &= \displaystyle \frac{1}{2^{n/2}} \left[\left|0\right\rangle + \ldots + \left|a\right\rangle +\ldots + \left|2^n-1\right\rangle - 2\left|a\right\rangle\right]\\ &= \displaystyle \frac{1}{2^{n/2}} \sum_{x=0}^{2^n-1} \left|x\right\rangle - \frac{2}{2^{n/2}} \left|a\right\rangle\\ &= \left|\phi\right\rangle - 2\left\langle a|\phi\right\rangle \left|a\right\rangle\\ &= \Big[\mathbf{I} - 2\left|a\right\rangle \left\langle a\right|\Big] \left|\phi\right\rangle\\ \end{array}$

which means

$\mathbf{V} = \mathbf{I} - 2\left|a\right\rangle \left\langle a\right|$

The two vectors $\left|\phi\right\rangle$ and $\left|a\right\rangle$ determine a plane P. Let $\left|a_{\perp}\right\rangle$ be the vector in the plane perpendicular to $\left|a\right\rangle$. We can write $\left|\phi\right\rangle$ as

$\left|\phi\right\rangle = \phi_1 \left|a_\perp\right\rangle + \phi_2\left|a\right\rangle$

Applying $\mathbf{V}$ to $\left|\phi\right\rangle$ gives us:

$\begin{array}{rl} \mathbf{V}\left|\phi\right\rangle &= \left(\mathbf{I} - 2\left|a\right\rangle \left\langle a\right| \right) \left( \phi_1 \left|a_\perp\right\rangle + \phi_2\left|a\right\rangle \right)\\ &= \phi_1 \left|a_\perp\right\rangle + \phi_2\left|a\right\rangle - 2\phi_2\left|a\right\rangle\\ &= \phi_1 \left|a_\perp\right\rangle - \phi_2\left|a\right\rangle \end{array}$

The effect of the operator $\mathbf{V}$ is therefore to reflect the vector $\left|\phi\right\rangle$ with respect to the vector $\left|a_\perp\right\rangle$.

Now, we want to reflect this vector $\mathbf{V}\left|\phi\right\rangle$ with respect to $\left|\phi\right\rangle$. To accomplish this, we define the operator $\mathbf{W}$ by

$\mathbf{W} = 2\left|\phi\right\rangle \left\langle\phi\right| - \mathbf{I}$

Apply this operator to $\mathbf{V}\left|\phi\right\rangle$:

$\begin{array}{rl} \mathbf{WV}\left|\phi\right\rangle &= \Big[ 2\left|\phi\right\rangle \left\langle\phi\right| - \mathbf{I} \Big]\mathbf{V}\left|\phi\right\rangle\\ &= 2\left|\phi\right\rangle \left\langle\phi\right|\mathbf{V}\left|\phi\right\rangle -\mathbf{V}\left|\phi\right\rangle \end{array}$

If we express $\mathbf{V}\left|\phi\right\rangle$ as a linear combination of $\left|\phi\right\rangle$ and a vector $\left|\phi_\perp\right\rangle$ in the plane P,

$\mathbf{V}\left|\phi\right\rangle = \alpha \left|\phi\right\rangle + \beta \left|\phi_\perp\right\rangle$

We have,

$\begin{array}{rl} \mathbf{WV}\left|\phi\right\rangle &= \Big[ 2\left|\phi\right\rangle \left\langle\phi\right| - \mathbf{I} \Big]\mathbf{V}\left|\phi\right\rangle\\ &= 2\left|\phi\right\rangle \left\langle\phi\right|\mathbf{V}\left|\phi\right\rangle -\mathbf{V}\left|\phi\right\rangle\\ &= 2\alpha\left|\phi\right\rangle - \alpha \left|\phi\right\rangle - \beta \left|\phi_\perp\right\rangle\\ &= \alpha\left|\phi\right\rangle - \beta \left|\phi_\perp\right\rangle \end{array}$

which demonstrates that $\mathbf{W}$ reflects $\mathbf{V}\left|\phi\right\rangle$ with respect to $\left|\phi\right\rangle$.

Therefore, the effect of $\mathbf{WV}$ on $\left|\phi\right\rangle$ is to rotate it by an angle $\gamma$ counter-clockwise.

We can compute this $\gamma$ by getting the inner product of $\mathbf{WV}\left|\phi\right\rangle$ and $\left|\phi\right\rangle$. First, let’s find the expression of $\mathbf{WV}\left|\phi\right\rangle$:

$\begin{array}{rl} \mathbf{WV}\left|\phi\right\rangle &= \mathbf{W} \left( \mathbf{I} - 2\left|a\right\rangle \langle a |\right) \left|\phi\right\rangle\\ &= \mathbf{W} \left(\left|\phi\right\rangle - 2 \underbrace{\langle a| \phi \rangle} \left|a\right\rangle \right) \end{array}$

The quantity $\langle a| \phi \rangle$ is

$\langle a| \phi \rangle = \displaystyle \langle a| \left( \frac{1}{2^{n/2}} \sum_{k=0}^{2^n-1} \left|x\right\rangle \right) = \frac{1}{2^{n/2}} = \cos \theta$

where $\theta$ is the angle between $\left|a\right\rangle$ and $\left|\phi\right\rangle$.

The complementary angle, $\rho = 90-\theta$ is the angle between $\left|\phi\right\rangle$ and $\left|a_\perp\right\rangle$. Using a well-known trigonometric identity, we can compute for the angle of $\rho$:

$\cos \theta = \sin \rho = \displaystyle \frac{1}{2^{n/2}}$

Since $\displaystyle \frac{1}{2^{n/2}}$ is very small if n is large,

$\sin \rho = \displaystyle \frac{1}{2^{n/2}} \approx \rho$

Continuing,

$\begin{array}{rl} \mathbf{WV}\left|\phi\right\rangle &= \mathbf{W} \left( \mathbf{I} - 2\left|a\right\rangle \langle a |\right) \left|\phi\right\rangle\\ &= \mathbf{W} \left(\left|\phi\right\rangle - 2 \cos\rho \left|a\right\rangle \right)\\ &= (2\left|\phi\right\rangle \langle \phi| - \mathbf{I}) \left(\left|\phi\right\rangle - 2 \cos\rho \left|a\right\rangle \right)\\ &= 2\left|\phi\right\rangle - \left|\phi\right\rangle - 2\cos\rho \left|\phi\right\rangle \langle \phi |a\rangle + 2 \cos\rho \left|a\right\rangle\\ &= \left|\phi\right\rangle - 4\cos^2\rho \left|\phi\right\rangle + 2 \cos\rho \left|a\right\rangle\\ &= (1- 4\cos^2\rho) \left|\phi\right\rangle + 2 \cos\rho \left|a\right\rangle \end{array}$

The inner product of $\mathbf{WV}\left|\phi\right\rangle$ and $\left|\phi\right\rangle$ is given by

$\begin{array}{rl} \langle\phi|\mathbf{WV}\left|\phi\right\rangle &= \langle\phi|\left[ (1- 4\cos^2\rho) \left|\phi\right\rangle + 2 \cos\rho \left|a\right\rangle\right]\\ &= (1- 4\cos^2\rho) + 2\cos^2\rho\\ &= 1- 2\cos^2\rho\\ &= \cos 2\rho \end{array}$

Therefore, the angle between these two vectors is $2\rho = \displaystyle \frac{1}{2^{n/2}}$. How many times do we have to apply the operator $\mathbf{WV}$ to get to $\pi/2$?

$\begin{array}{rl} m\cdot 2\rho &= \displaystyle \frac{\pi}{2}\\ \displaystyle m \frac{2}{2^{n/2}}&= \displaystyle \frac{\pi}{2}\\ m &=\displaystyle \frac{\pi\cdot 2^{n/2}}{4} \end{array}$

Therefore, we need to apply the operator $\mathbf{WV}$ $O(\sqrt{N})$ times to find our value (where $N=2^n$).

## Prototyping Parallel Programs on OpenShift

In the previous posts, we were introduced to parallel computing and saw how it can speed up computation and by what factor (depending on the algorithm). We also introduced Chapel, a language for parallel computation, developed by Cray. Using Chapel, we first developed a very simple parallel algorithm to compute the sum of a list of numbers. Then we learned how to implement parallel matrix multiplication. In this post, we will talk about how to run our parallel programs using a cluster of machines. The source code referenced here can be found on github.

To run our parallel program, we will need to set up a bare-metal server or a Virtual Machine (VM), install Chapel and clone it to N instances. A shared filesystem will be mounted on a pre-defined directory on each instance. This filesystem will allow all instances to load data from the same location. Any output from each instance will also be written to this filesystem (although we must ensure they write to different files or directories to avoid corruption of data).

Here is how our setup will look like:

We will also need to setup the following:

• SSH server – this will run as an ordinary user using port 2222. The Chapel platform will connect to port 2222 on each machine.

We will need to export the following environment variables to enable parallel computation across different machines. The purpose of those variables can be found here. The variable GASNET_SSH_SERVERS is a space-separated list of IP addresses or hostnames of our nodes. In our setup, we will use the following:

export CHPL_COMM=gasnet
export CHPL_LAUNCHER=amudprun
export GASNET_SSH_SERVERS="node00 node01 node10 node11"


where node00, node01, node10, node11 are hostnames of our machines/nodes.

## How to Run a Parallel Program

We will run our parallel matrix_multiplication on our cluster. To launch the program, we will login to one of the nodes and change directory to /home/chapel containing the mat_mul binary and issue the following command:

chpl mat_mul -nl 4


## Running in OpenShift

If having to provision a bare-metal or VM environment for parallel computing is costly, we can turn to containers for a much cheaper and faster way to provision. To make our lives easier, we will use OpenShift Origin platform to run our parallel computing environment.

In the succeeding sections, we assume that we have a working installation of OpenShift. See the instructions here on how to set up OpenShift on VirtualBox or here to install a single-node OpenShift in AWS.

### Containerizing the “Node”

We will have to build a container image of the bare metal or VM we mentioned above. The format we will use is Docker. First, we clone the project from github:

git clone https://github.com/corpbob/chapel-openshift-demo.git


Change directory to chapel-openshift-demo and build the image:

docker built -t chapel .


Create the chapel project on OpenShift. We use the admin user in this example:

oc login -u admin
oc new-project chapel


We need to know the ip address of the docker-registry service in OpenShift. We execute the following command:

[root@openshift chapel-openshift-demo]# oc describe svc docker-registry -n default
Name:			docker-registry
Namespace:		default
Labels:			docker-registry=default
Selector:		docker-registry=default
Type:			ClusterIP
IP:			172.30.1.1
Port:			5000-tcp	5000/TCP
Endpoints:		172.17.0.7:5000
Session Affinity:	ClientIP
No events.


Line 7 (highlighted) of the output gives us the IP address of the docker-registry service. We will use this to push the Chapel image we just created to the registry. First we login to the docker registry:

docker login -u admin -p $(oc whoami -t) 172.30.1.1:5000  Tag the chapel image using the following format <registry_ip>:<port>/<project_name>/<image_name>:<version>  For our example, we use the following command to tag the image: docker tag chapel 172.30.1.1:5000/chapel/chapel:v0.1  We can now push the image to the internal docker registry: docker push 172.30.1.1:5000/chapel/chapel  ### Importing the Chapel Template In the previous section, we built the container image of Chapel and pushed it to the private docker registry. In this section, we will import a template that will do the following: • Set up the Chapel containers inside OpenShift • Create a file that will dynamically generate the variable GASNET_SSH_SERVERS containing the IP addresses of the Chapel pods that will be used in the parallel computation. The name of the template is chapel.yml. Import the template using the command oc create -f chapel.yml  We need to give the default service account the view role so that it can read the IP addresses of the pods associated with chapel. To do this, execute the command: oc policy add-role-to-user view system:serviceaccount:chapel:default  After this we can now create the chapel application: oc new-app chapel  This will automatically trigger a deployment of 4 chapel instances with a shared volume mounted at /home/chapel/mnt. The “hook-post” pod is a separate instance of the chapel image that will execute the following commands echo "export GASNET_MASTERIP=\$MY_NODE_IP" > /home/chapel/mnt/exports && \
echo "export GASNET_SSH_OPTIONS=\"-p 2222\"" >> /home/chapel/mnt/exports && \
for pod in oc get pods -l app=chapel|grep chapel|awk '{print $1}'; \ do \ oc describe pod$pod |grep ^IP:|awk '{print $2}'; \ done| \ awk 'BEGIN { x="" } \ {x = x$1" "} \
END {print "export GASNET_SSH_SERVERS=\""x"\""}' >> \
/home/chapel/mnt/exports


The output of the above command is a file named exports and looks like the below:

### Running the Sample Program

We now go to the web console-> Applications -> Pods. Select any of the pods and click Terminal. In the /home/chapel directory, there is a file named run-test.sh. This file contains the following commands:

export GASNET_SPAWNFN=S
source /home/chapel/mnt/exports

./hello6-taskpar-dist -nl $*  The commands above executes the pre-compiled chapel binary hello6-taskpar-dist which was compiled when we built the container image earlier. Executing this file gives us: # the parameter 4 tells chapel to use 4 pods to execute the command. sh-4.2$ ./run-test.sh 4
Warning: Permanently added '[172.17.0.16]:2222' (ECDSA) to the list of known hosts.
Warning: Permanently added '[172.17.0.18]:2222' (ECDSA) to the list of known hosts.
Warning: Permanently added '[172.17.0.15]:2222' (ECDSA) to the list of known hosts.
Warning: Permanently added '[172.17.0.17]:2222' (ECDSA) to the list of known hosts.
Hello, world! (from locale 0 of 4 named chapel-1-fgg4b)
Hello, world! (from locale 2 of 4 named chapel-1-pt3v9)
Hello, world! (from locale 1 of 4 named chapel-1-rg668)
Hello, world! (from locale 3 of 4 named chapel-1-rw2rc)


### Running the Parallel Matrix Multiplication

Copy the file mat_mul.chpl to the pod and compile.

chpl mat_mul.chpl -o mnt/mat_mul


The command above will place the resulting binary inside the directory /home/chapel/mnt. This will be accessible from all pods.

Finally, execute the parallel matrix multiplication:

./run.sh mnt/mat_mul 4


## Conclusion

I have not tried this in a real production environment or even on a bare-metal installation of OpenShift. This seems to be a promising use of OpenShift but I still have to find out.

## Asynchronous Versus Synchronous: A Performance Perspective

I just had my coffee from my favorite coffee shop down the corner. I’m a regular at that coffee shop that the baristas know me and I’m able to sit down after I order and do something else while they make my coffee and serve to me when ready.  If I order from other branches of this coffee shop, I would have to wait for my coffee in a corner and not being able to do anything productive.  So while I was seated comfortable waiting for my coffee, a thought came to me. This is a great example of synchronous versus asynchronous service invocation.

A synchronous service invocation is just like buying your coffee and having to wait in the corner for the baristas to complete your coffee before you can even sit down and do something useful. Asynchronous service invocation is having to get seated and do something else while waiting for your coffee to be served.  My instinct tells me that asynchronous invocation is better than synchronous in terms of performance especially when it takes a long for the service to complete and the waiting time is much better used doing something else. But how can we show that this is the case?

We can model the synchronous/asynchronous invocation as a Markov Chain and compute a few metrics from our model. If you’re not familiar with this approach, you can refer to this article MODELING QUEUING SYSTEMS USING MARKOV CHAINS.

First, we model the asynchronous invocation. We can identify each state of our system by  2 slots each of which contains a number. The first slot is the number of responses waiting for the client to process and the second slot is the number of requests waiting for the server to process.  To simplify the modeling, we assume the following:

1. The maximum number of requests that a server can handle at any given time is 2. Which means that a client cannot anymore send a request if the second number is equal to 2.
2. When the server responds, the first number increments by 1. Once a client receives a response, it will stop whatever it is doing and process that response. When the client finishes processing the response, the first number goes down to zero. As a consequence, this assumption says that the maximum value of slot number 1 is 1 at any given time.

With these assumptions, we can draw the Markov Diagram of the asynchronous system to come up with the below diagram:

Explanation of the Diagram

The system initially starts at 00 where there are no requests and no responses. It will then transition to state 01 when the client will send a  request which increments the server’s pending requests to 1. From this state, the system can transition to either state 02 or state 10. State 02 occurs when the client sends another request without waiting for a response. State 10 is when the server is able to process the request and return a response, thereby decrementing it’s pending items and incrementing the client’s number of responses to process.

At state 02, the client cannot anymore send requests since the maximum requests that the server can handle is 2. At this state, it can only transition to state 11, where the server has processed one of the requests (thus decrementing the second slot from 2 to 1 and incrementing the first slot from 0 to 1).

State 11 can only transition to 01 since the client will stop whatever it is doing to process the single response it has received (thus decrementing the first slot from 1 to 0).

The numbers $a,b,c$ are the rates at which the transitions will occur. For example, the rate at which the system will transition from state 00 to state 01 is $b$.

In the remainder of this article, we assume that the client can request at the rate of 60 requests per unit time and can process the response at 60 per unit time. We also assume that the server can process requests at 30 per unit time. We therefore have the following:

$a=60, b=60, c=30$

Computing the Probabilities

At steady state, the flow going into each state is equal to the flow out of that state. For example, for state 00, the following is true:

$\displaystyle bP_{00} = aP_{10}$

Doing this for all states, we get the following balance equations:

$\begin{array}{rlr} \displaystyle bP_{00} &= aP_{10} & (1)\\ bP_{00} + a P_{11} &= bP_{01} + cP_{01} & (2)\\ bP_{01} &= cP_{02} & (3)\\ aP_{10} &= cP_{01} & (4)\\ aP_{11} &= cP_{02} & (5) \end{array}$

Since the sum of the probabilities should be equal to 1, we have our last equation:

$P_{00} + P_{01} + P_{02} + P_{10} + P_{11} = 1$

We have 6 equations in 5 unknowns, one of the equations above is actually redundant. In fact, you can show that equation 2 above is redundant.

We can form the matrix equation of the system of equations above and solve for the probabilities:

$\begin{bmatrix} -b & 0 & 0 & a & 0 \\ 0 & b & -c & 0 & 0 \\ 0 & -c & 0 & a & 0 \\ 0 & 0 & -c & 0 & a \\ 1 & 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} P_{00}\\ P_{01}\\ P_{02}\\ P_{10}\\ P_{11} \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ 0\\ 0\\ 1 \end{bmatrix}$

Solving this matrix equation when a=60, b=60 and c=30, we can find the probabilities:

$\begin{bmatrix} P_{00}\\ P_{01}\\ P_{02}\\ P_{10}\\ P_{11} \end{bmatrix} = \displaystyle \frac{1}{ab^2+abc+ac^2+b^2c+bc^2} \begin{bmatrix} ac^2\\ abc\\ ab^2\\ bc^2\\ b^2c \end{bmatrix} = \begin{bmatrix} 1/10 \\ 1/5 \\2/5\\ 1/10\\ 1/5 \end{bmatrix}$

Utilization and Throughput: Asynchronous Case

The client utilization is defined to be the probability that the client is busy. In the same way, the server utilization is the probability that the server is busy. Looking at the diagram, the client is busy sending requests at state 00 and state 01. It is busy processing responses at state 10 and 11.  On the other hand, the server is busy processing requests at state 01 and 02.

Therefore, the client utilization is  equal to

$\begin{array}{rl} U_{\text{client}} &= P_{00} + P_{01} + P_{10} + P_{11}\\ &= 1/10 + 1/5 + 1/10 + 1/5\\ &= 3/5\\ &= 60\% \end{array}$

The server utilization is equal to

$\begin{array}{rl} U_{\text{server}} &= P_{01} + P_{02}\\ &= 1/5 + 2/5 \\ &= 3/5 \\ &= 60\% \end{array}$

The system througput is the number of requests the client is able to submit and is equal to

$\begin{array}{rl} X &= 60P_{00} + 60 P_{01} \\ &= 60*1/10 + 60 * 1/5 \\ &= 18 \end{array}$

Comparison with Synchronous Invocation

For the synchronous invocation, the client will submit a request at state 00. The server will then receive this request and process it immediately at state 01. The client will get the response and process it at state 10 and do the loop again at state 00. Please see the Markov diagram below describing this process.

We can solve the probabitlies of this Markov Chain by solving the balance equation

$\begin{bmatrix} b & 0 & -a \\ 0 & c & -a \\ 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} P_{00}\\ P_{01}\\ P_{10} \end{bmatrix} = \begin{bmatrix} 0\\0\\1 \end{bmatrix}$

Solving for the probabilities, we get

$\begin{bmatrix} P_{00}\\ P_{01}\\ P_{10} \end{bmatrix} = \displaystyle \frac{1}{ab + ac + bc} \begin{bmatrix} ac\\ ab\\ bc \end{bmatrix} = \begin{bmatrix} 1/4\\ 1/2\\ 1/4 \end{bmatrix}$

At state 00, the client is busy sending requests at the rate of 60 requests per unit time, therefore the system throughput is

$\begin{array}{rl} X &= 60 P_{00} \\ &= 60/4 \\ &= 15 \end{array}$

which is less than the Asynchronous case. In addition, the client utilization is

$\begin{array}{rl} U_{\text{client}} &= P_{00} + P_{10}\\ &= 1/4 + 1/4\\ &= 1/2 \end{array}$

This is lower than the Asynchronous case where the client utilization is 60%.

Therefore, the asynchronous service invocation is indeed better compared to the synchronous case. It has a much higher throughput and a much higher utilization compared to the synchronous case.

P.S. I used Mathics to solve the matrix equations. Mathics is an open source alternative to Mathematica. It features Mathematica-compatible syntax and functions so if you’re a user of Mathematica you can run basic commands in Mathics. If you’re a Mathics user, you can easily switch to Mathematica as well. Please visit http://mathics.github.io/ to know more about Mathics.

## When Average Is Not Enough: Thoughts on Designing for Capacity

Designing a system from scratch to handle a workload you don’t know is a challenge. If you put to much hardware, you might be wasting money. You put little, then your users will complain of how slow the system is.

If you’re given only a rate, like 6000 hits/hour, you don’t know how these are distributed in a minute by minute or per second interval. We can make a guess and say that there are about 100 hits per minute or 1.67 hits/sec. If hits come uniformly at that rate, then we can design a system that can handle 2 hits/sec and all users will be happy since all requests will be served quickly and no queueing of requests. But we know it’s not going to happen. There will be some interval where the number of hits is less than 3 and some more than 3.

Theoretically, requests to our server come randomly. Let’s imagine 60 bins represented by seconds in one minute. We also imagine that requests are like balls we throw into the bins. Each bin is equally likely to be landed by a ball. It’s possible that all balls land on only one bin!

After throwing the balls into bins, let’s see what we have.

As you can see, some bins have more than 2 balls (which is the average number of balls in a bin). Therefore if we design our system based on the average, 50% of our users will have a great experience while the other 50% will have a bad experience. Therefore we need to find how many requests per second our server needs to handle so that our users will have a good experience (without overspending).

To determine how many requests per second we need to support, we need to get the probability of getting 4, 5, 6 or more request per second. We will compute the probability starting from 3 requests per second and increment by one until we can get a low enough probability. If we design the system for a rate that has a low probability, we are going to spend money for something that rarely occurs.

Computing the Probability Distribution

We can view the distribution of balls into bins in another way. Imagine labeling each ball with a number from 1 to 60. Each number has an equal chance to be picked. The meaning of this labeling is this: the number that was assigned to the ball is the bin (time bucket) it belongs to. After labeling all balls, what you have is a distribution of balls into bins.

Since each ball can be labeled in 60 different ways and there are 100 balls, the number of ways we can label 100 different balls is therefore

$\displaystyle 60^{100}$

Pick a number from 1-60. Say number 1. Assume 2 balls out of 100 are labeled with number 1. In how many ways can you do this ? Choose the first ball to label. There are 100 ways to choose the ball. Choose the second ball. Now there are 99 ways to choose the second ball. We therefore have 990 ways to select 2 balls and label them 1. Since we don’t really care in what order we picked the ball, we divide 990 with the number of possible arrangements of ball 1 and ball 2, which is 2! (where the exclamation mark stands for “factorial”). So far, the number of ways to label 2 balls with the same number is

$\displaystyle \frac{100 \times 99}{2!}$

Since these are the only balls with label 1, the third ball can be labeled anything except number 1. In that case, there are 59 ways to label ball 3. In the same way, there are 59 ways to label ball 4. Continuing this reasoning until ball 100, the total ways we can label 2 balls with number 1 and the rest with anything else is therefore:

$\displaystyle \frac{100 \times 99}{2!} \times 59^{98}$

Notice that the exponent of 59 is 98 since there are 98 balls starting from ball 3 to ball 100.

Therefore, the probability of having two balls in the same bin is

$\displaystyle \frac{100 \times 99}{2!} \times \frac{59^{98}}{60^{100}} = 0.2648$

We can also write this as

$\displaystyle \frac{100!}{2! \times 98!} \times \frac{(60-1)^{98}}{60^{100}} = \binom{100}{2} \frac{(60-1)^{98}}{60^{100}}$

In general, if m is the number of balls, n the number of bins and k the number of balls with the same label, then the probability of having k balls within the same bin is given by

$\displaystyle \binom{m}{k} \frac{(n-1)^{m-k}}{n^{m}}$

,

where

$\displaystyle \binom{m}{k} = \frac{m!}{k!(m-k)!}$

is the binomial coefficient.

It turns out that this is a probability distribution since the sum of all probabilities from k=0 to k=m is equal to 1. that is

$\displaystyle \sum_{k=0}^{n} \binom{m}{k} \frac{(n-1)^{m-k}}{n^{m}} = 1$

To see this, recall from the Binomial Theorem that

$\displaystyle \big( x + y \big)^n = \sum_{k=0}^{n} \binom{n}{k} x^{n-k}y^k$

If we let x=n-1 and y=1, we can write the above equation as

$\displaystyle \begin{array}{ll} \displaystyle \sum_{k=0}^{m} \binom{m}{k} \frac{(n-1)^{m-k}}{n^{m}} &= \displaystyle \sum_{k=0}^{m} \binom{m}{k} \frac{(n-1)^{m-k}\cdot 1^k}{n^{m}}\\ &= \displaystyle\frac{(n-1+1)^m}{n^{m}}\\ &= \displaystyle\frac{n^m}{n^m}\\ &= \displaystyle 1 \end{array}$

Here is a graph of this probability distribution.

Here’s the plot data:

 probability 1 0.315663315854 2 0.264836171776 3 0.146632456689 4 0.060268424995 5 0.019612775592 6 0.005263315484 7 0.001197945897 8 0.000236035950 9 0.000040895118 10 0.000006307552 

We can see that for k=9, the probability of it occurring is .004%. Anything beyond that we can call rare and no need to spend money with.

Just For Fun

What’s the probability that a given bin is empty, that is, there are no balls in it?

Other Probability Distributions

Our computation above was based on a uniform probability distribution. However, there are other distributions that are more suitable for arrival of requests. One of the most widely used is called the Poisson Distribution where you can read from here.

R Code

The R code to generate the simulation:

par(mfrow=c(4,4))
f=function(){
x=c()
for(i in 1:100){
x=c(x,sample(seq(1,60),1,replace=T))
}
plot(tabulate(x),type="h", ylab="tx", xlab="secs")
}

for(i in 1:16){
f()
}


The R code to generate the probability distribution:

p=function(m,n,s){
prod(seq(m,m-s+1))/factorial(s)*(n-1)^(m-s)
}

tt=c()
for(i in 1:10){
tt=c(tt,p(100,60,i)/60^100)
}
plot(tt,type="h",xlab="Number of Balls",ylab="Probability")


Dedication

This post is dedicated to my friend Ernesto Adorio, a mathematician. He loves combinatorial mathematics.

Rest in peace my friend! I miss bouncing ideas with you.

## Divided We Compute, United We Reduce

Once upon a time, in a far away village lay a dying old man. He called his sons to his deathbed and spoke to them one last time. He said “Sons, see that bundle of sticks? Each one of you try to break it. The one who can break it will inherit all my riches”. Each son, being greedy, wanted all the riches for himself. So each one of them tried to break the bundle of sticks but none of them succeeded. The old man asked his servant to untie the bundle and said to his sons, “Each one of you now get one stick and break it”. Without any effort, each son was able to break the stick. The old man said “You see, when you unite, no task will be difficult. The riches that I told you was a lie. We are broke. When i’m dead, make sure  you unite so that you can survive.”

Fast forward to modern times. You can think of the bundle of sticks to be a complex problem that is itself composed of smaller problems. The sons are the processors of your computer. When each processor was given the task to solve the complex problem, it fails to solve it in a reasonable amount of time. When the complex problem is decomposed into smaller problems and given to each processor, each processor is now able to solve the smaller problems quickly thereby solving the big problem quickly as well.

The process of decomposing a problem into smaller problems and solving them in separate processors is called Parallel Computing. In this article, we will compute how fast a certain algorithm will run when parallelized. The problem we want to investigate is sorting an array of a million ($2^{20}$) integers.

Efficient Sorting

Suppose you have an array $\{ a_1, a_2, a_3, ..., a_n \}$ that you want to sort based on pairwise comparison.  The sorted array is just one of the many permutations of the array $\{a_1, a_2, a_3,\ldots, a_n\}$. In fact, if you have $n$ different objects to sort, then there are exactly $n!$ ways to arrange these objects, and one of them is the sorted state. You can imagine the sorting process as a decision tree. Say, for example we have array A={ a,b,c }. To sort this, we first compare a with b and there are 2 ways this can go. Either $a \le b$ or $a > b$. If $a \le b$, we then compare b and c. This also give either $b \le c$ or $b > c$. As you can see from the diagram below, this is nothing but a decision tree.

Since the height of this binary tree is lg(n!), then we have

$\lg(n!) = \lg\Big[ n \cdot (n - 1) \cdot (n-2) \cdots 1\Big] \le \lg n + \lg (n-1) \cdots \lg1 \le \underbrace{\lg n \cdots \lg n}_\text{ n times}$
$\lg(n!) \le n\cdot \lg n$

There are efficient algorithms that are able to sort of this complexity. For example, the merge sort has this complexity. Therefore, if you have an array of $2^{20}$ elements, then the complexity is

$2^{20} \cdot \lg(2^{20}) = 2^{20} \cdot (20) = 20971520$

that is, it takes about 20 million comparisons to sort an array of 1 million. Could we do any better than this? We can either upgrade the cpu of the machine doing the sorting or use two or more machines to divide the work among those machines. In this article, we are going to investigate the impact of dividing the work into smaller chunks and farming it to other processors.

Divide and Conquer

Assume we have an array $n=2^{20}$ elements that we need to sort and suppose we have two identical processors we can use. Divide the array into 2 equal sized arrays. Give the first array to the first processor and the other half to the second processor. Apply an efficient sorting algorithm to the subarrays to produce a sorted array for each processor. We then combine the result of processor 1 and processor 2 to one big array by merging the two sorted arrays. The diagram below illustrates the process of computation:

This is also known as the MapReduce algorithm. Mapping is the process of assigning subsets of the input data to processors where each processor computes the partial result. Reducing is the process of aggregating the results of each processor to the final solution of the problem.

The process of merging is straightforward. Given two sorted arrays, begin by comparing the first element of each array. The smaller of the two will then occupy the first slot in the big array. The second element of the array from which we took the smallest element will now become the first element of that array. Repeat the process until all elements of both arrays have already occupied slots in the big array. The diagram below illustrates the algorithm of merging.

If you count the total number of comparisons that you need to merge two sorted arrays, you will find that it takes $n-1$ comparisons. Therefore, the complexity of the merging process is $O(n)$.

Since each processor has $n/2$ sized subarrays, the sorting complexity is therefore $n/p \lg (n/p)$. Furthermore, since the merging process takes $O(n)$ comparisons, the total complexity of the parallel sorting process is therefore

$\displaystyle n/p \lg(n/p) + n$

In our example, $C=2^{20}/2 \lg(2^{20}/2) + 2^{20}= 11010048$ comparisons compared to $2^{20} \lg(2^{20}) = 20971520$ when run sequentially. For large values of $n$, $n/p \lg(n/p)$ dominates $n$, therefore the complexity of the parallel algorithm is $O(n/p \lg(n/p))$.

Can we do any better?

For a given value of $n$, what do you think is the value of $p$ that reduces the running time to $O(n)$? If we take $n=2^{20}$ and plot complexity against $p = \{ 2, 4, 8, 16, 32\}$ we get the diagram below.

In this diagram, we also plotted the horizontal line $y=2^{20}$. The intersection of this line with the plot of $\displaystyle f(p) = \frac{n}{p} \lg(\frac{n}{p})$ gives us the value of $p$ such that the total comparisons is already linear, that is,

$\displaystyle f( p ) = n$
$\displaystyle \frac{n}{p} \lg(\frac{n}{p}) = n$

To get the value of $p$ numerically, we have to solve the root of the equation

$\displaystyle g( p ) = \frac{n}{p} \lg(\frac{n}{p}) - n = 0$

Simplifying,

$\displaystyle \frac{1}{p} \lg(\frac{n}{p}) - 1 = 0$
$\displaystyle \lg(\frac{n}{p}) = p$
$\displaystyle \frac{n}{p} = 2^p$
$\displaystyle p2^p - n = 0$

Since this is a non-linear equation, we can solve this using the Newton's method. It is a method to compute the roots by approximation given an initial value of the solution. Starting from a guess solution $p_1$, the root can be approximated using the recursive formula

$\displaystyle p_{n+1} = p_n - \frac{g( p_n)}{g\prime ( p_n)}$

where $g\prime ( p )$ is the first derivative of $g( p )$. Applying the rules of derivatives, we get

$\displaystyle g\prime ( p ) = p\cdot 2^p \ln 2 + 2^p$

Substituting this to the formula for Newton's method, we get

$\displaystyle p_{n+1} = p_n - \frac{p2^p - n}{p2^p \ln 2 - 2^p}$

Below is an R code using newton’s method to compute the root of the equation $g(p)$.

g=function(n,p){
p* 2^p - n
}

gprime=function(n,p){
p*2^p *log(2) - 2^p
}

newton=function(p,n,iter){
tmp = p
for(i in 1:iter){
p=p-g(n,p)/gprime(n,p)

if(abs(p-tmp)< 0.0001){
break
}

print(p)
tmp=p
}
print(p)
}



Running this code, we get the value of $p = 16$:

> newton(15,n,100)
[1] 16.80905
[1] 16.08829
[1] 15.98603
[1] 16.00286
[1] 15.99944
[1] 16.00011
[1] 15.99998
[1] 16


Ignoring network latency, by distributing the input evenly into 16 processors, we get a running time of $O(n)$ time complexity for $n=2^{20}$ array of items. Therefore, instead of doing 20 million comparisons, you only need 1 million comparisons to sort 1 million objects.

In this age of multicore processors, parallel computing is fast becoming the norm than the exception. Learning to harness the power of multicores is becoming an extremely handy skill to have.

## Performance of Shortest Queue Discipline

We computed the performance of random line queuing discipline in the previous article. However, in real life, we don’t really throw a coin in order to pick which of the two lines we take. Most of the time, we go to the shortest line for service. Using the techniques we have learned in modeling, we will compute the performance of this queue and compare the performance to the other queueing discplines we have computed so far.

Markov Diagram

As usual, the arrival rate of customers is 2 customers per minute. Each customer is given a service at an average of 20 secs, which means that an average of 3 customers are being serviced per minute.

## Computing Performance of Symmetric Multi-Processing Scheduling

You might have noticed that when you do a transaction in a bank, you usually see a single line being served by 2 or more tellers. You must have perceived that this technique of lining up customers is better than when each teller has its own line. This will lessen the chance of you being stuck for a long time when the guy before you takes a long time to complete transaction.

In computer systems, you can view the customers as tasks to be executed by 2 or more processors. Each processor will pick from a common line, a task that it will execute. In this article, we will analyze the performance of symmetric multiprocessing using continuous markov chain analysis. We will also be using the tool available from www.extremecomputing.org to help us compute the probabilities.