Parallel Computing in the Small

As I watched my one-month old baby sleeping, a thought ran through my mind. I wonder what technology would look like when she’s my age. There’s a lot of things going on simultaneously in technology today. I remember doing Artificial Intelligence/Machine Learning about 15 years ago. They used to be done in a university setting. They are now the new normal. Some technologies are quite recent like Big Data and Blockchain but they have become mainstream very quickly. I wonder what it will look like when my daughter grows up.

There is a technology I’m currently watching. It’s still in it’s infancy but this technology is really fascinating. It is a technology based on a remarkably counter-intuitive theory of the universe that governs the atoms and sub-atomic particles. It is called Quantum Computing. Given the high speed at which science fiction become realities, it’s possible that my baby girl will someday be programming on a quantum computer. But for now, we can only hope.

I have not seen a Quantum Computer and based on what I read so far, it’s still an engineering challenge. It might probably be a few more years before they are mass-produced but it is something to look forward to. But the good thing is, we can already talk about quantum algorithms! So let’s sharpen our pencils and get a few sheets of paper for we are going to talk about this new way of computing.

In classical computing, a bit has a value that is either one or zero. In quantum computing, a bit can now have a value that is a superposition of 1 and 0. We call this a QBit. The values 1 and 0 are now symbolized as \left| 1\right\rangle and \left| 0\right\rangle respectively. These are unit vectors in what is known as a Hilbert space. They are also known as kets, derived from the word bracket which we’ll talk more of later. An example of a vector space that you are familiar with is the color wheel. Each color is actually a combination of red, blue and green. For example, the color yellow is a combination of red and green. It’s also the same as with the qubit. A general QBit vector is of the form

\displaystyle \left| \Psi \right\rangle = a\left| 0\right\rangle + b\left| 1\right\rangle

such that \mid a\mid^2 + \mid b\mid^2 = 1. The numbers a and b are complex numbers and the symbol \mid a\mid^2 is defined as

\displaystyle \mid a\mid^2 = a^*a

where a^* is the complex conjugate of a. A QBit vector represents the state of the QBit. We shall use the term vector and state interchangeably moving forward.

However, you won’t be able to see them in superposition state. If you want to know the state of a qubit, you will have to measure it. But when you measure it, it will collapse to one of either \left| 0\right\rangle or \left| 0\right\rangle but you will have a clue as to what that state will be. The coefficients a and b will give you the probability of the result of the measurement: it will be \mid a\mid^2 in state \left| a\right\rangle and \mid b\mid^2 in state \left| b\right\rangle.

If the state of a QBit is a superposition of basis states \left| 0\right\rangle and \left| 1\right\rangle, does this mean our computation is also random? Does it mean we get different answers every time ?

It turns out that we can get a definite answer! Let’s illustrate this by a simple quantum computation.

Suppose a I have a number between 0 and 1 million. What is my number? You can guess my number and I will only tell you if you got it correctly or not. I won’t tell you if my number is higher or lower than your guess. How many tries do you think you need in order to guess my number? I’m sure our tries will be many. In the worst case, you’ll need about 1 million tries.

For a quantum computer, you only need one try to guess it and that’s what we’re going to see. However, bear with me since we have to build up our arsenal of tools first before we can even talk about it. Let’s start with how a QBit looks like as a matrix.

The basis QBits \left| 0\right\rangle and \left| 1\right\rangle can be written as 1×2 matrices:

\displaystyle \left| 0 \right\rangle =  \begin{bmatrix}  1\\  0  \end{bmatrix}

\displaystyle \left| 1 \right\rangle =  \begin{bmatrix}  0\\  1  \end{bmatrix}

Using this representation, we can write a general QBit as:

\displaystyle \mid\Psi\rangle = a\mid 0\rangle + b\mid 1\rangle = a  \begin{bmatrix}  1\\  0  \end{bmatrix}  + b  \begin{bmatrix}  0\\  1  \end{bmatrix}  = \begin{bmatrix}  a\\  b  \end{bmatrix}

So now we know that every QBit is a linear combination of ket vectors \left| 0\right\rangle and \left| 1\right\rangle. These ket vectors have a corresponding “bra” vectors \langle 0 \mid and \langle 1\mid. In matrix representation, the bra basis vectors are defined as:

\displaystyle \langle 0\mid  = \begin{bmatrix}  1 & 0  \end{bmatrix}

and

\displaystyle \langle 1\mid =  \begin{bmatrix}  0 & 1  \end{bmatrix}

For a general QBit \mid\Psi\rangle = a\mid 0\rangle + b\mid 1\rangle, the corresponding bra vector is

\displaystyle  \langle \Psi \mid = a^*\langle 0\mid + b^*\langle 1\mid =  \begin{bmatrix}  a^* & b^*  \end{bmatrix}

There is an operation between bra and ket vectors of the basis states called inner product which is defined as:

\begin{array}{rl}  \displaystyle \langle 0\mid 0 \rangle &= 1\\  \displaystyle \langle 1\mid 1 \rangle &= 1\\  \displaystyle \langle 0\mid 1 \rangle &= \langle 1\mid 0\rangle = 0  \end{array}

Using the above rules, we can define the inner product of any two QBits \mid \Psi\rangle = a\mid 0\rangle + b\mid 1\rangle and \mid\Phi\rangle = c\mid 0\rangle + d\mid 1\rangle as

\begin{array}{rl}  \displaystyle \langle \Psi\mid\Phi\rangle &= \big(a^*\langle 0\mid + b^*\langle 1\mid\big)\big(c\mid 0\rangle + d\mid 1\rangle\big)\\  &= \displaystyle a^*c \langle 0\mid 0\rangle + a^*\langle 0\mid 1\rangle + b^*c\langle 1\mid 0\rangle + b^*d\langle 1\mid 1\rangle\\  &= a^*c + b^*d  \end{array}

Using the above definition of an inner product, the norm of a vector is defined as

\displaystyle \mid\left|\Psi\right\rangle\mid = \displaystyle \sqrt{\langle \Psi \mid \Psi \rangle} =\displaystyle \sqrt{a^*a + b^*b}

Geometrically, the norm is the length of the ket \mid \Psi\rangle.

Quantum computations are done using quantum operators. They are also called quantum gates. Operators are linear mappings that take a state to another state. That is, an operator \mathbf{A} acts on a\left| 0 \right\rangle + b\left| 1\right\rangle to give another vector c\left| 0 \right\rangle + d\left| 1\right\rangle. We write this as

\displaystyle \mathbf{A}(a\left| 0\right\rangle + b\left| 1\right\rangle) = c\left| 0\right\rangle + d\left| 1\right\rangle

Since the basis vectors \left| 0\right\rangle and \left| 1\right\rangle are also vectors, operators can also operate on them. An example of a very important operator and it’s action on the basis vectors is the Hadamard operator:

\begin{array}{rl}  \displaystyle \mathbf{H} \left| 0 \right\rangle &= \displaystyle \frac{1}{\sqrt{2}}(\left| 0 \right\rangle + \left| 1\right\rangle)\\  \displaystyle \mathbf{H} \left| 1 \right\rangle &= \displaystyle \frac{1}{\sqrt{2}}(\left| 0 \right\rangle - \left| 1\right\rangle)  \end{array}

By knowing the action of an operator on the basis vectors, we can determine the matrix representation of an operator. For the Hadamard operator, the matrix representations are:

\displaystyle \mathbf{H} \left| 0 \right\rangle = \displaystyle \frac{1}{\sqrt{2}}(\left| 0 \right\rangle + \left| 1\right\rangle) = \frac{1}{\sqrt{2}}  \begin{bmatrix}  1\\  1  \end{bmatrix}

and

\displaystyle \mathbf{H} \left| 1 \right\rangle = \displaystyle \frac{1}{\sqrt{2}}(\left| 0 \right\rangle - \left| 1\right\rangle) = \frac{1}{\sqrt{2}}  \begin{bmatrix}  1\\  -1  \end{bmatrix}

Therefore the matrix representation of the Hadamard operator is

\displaystyle \mathbf{H}=  \frac{1}{\sqrt{2}}  \begin{bmatrix}  1 & 1 \\  1 & -1  \end{bmatrix}

The identity operator is defined as the operator that maps a vector into itself and is symbolized as \mathbf{I}:

\displaystyle \mathbf{I}\left|\Psi\right\rangle = \left|\Psi\right\rangle

The NOT operator is defined as

\begin{array}{rl}  \mathbf{X}\left|0\right\rangle &= \left|1\right\rangle\\  \mathbf{X}\left|1\right\rangle &= \left|0\right\rangle  \end{array}

What it does is transform the \left|0\right\rangle to \left|1\right\rangle and vice versa, just like the classical bits.

Operators are linear. We can use linearity to derive the resulting vector when an operator acts on a general QBit:

\displaystyle \mathbf{A}(a\left| 0\right\rangle + b\left| 1\right\rangle) = a\mathbf{A} \left| 0\right\rangle + b\mathbf{A} \left| 1\right\rangle)

If \mathbf{A} is the Hadamard operator, that is, \mathbf{A} = \mathbf{H}, then the above operation becomes:

\displaystyle \mathbf{H}(a\left| 0\right\rangle + b\left| 1\right\rangle) = a\mathbf{H} \left| 0\right\rangle + b\mathbf{H} \left| 1\right\rangle)

Since we know that:

\begin{array}{rl}  \displaystyle \mathbf{H} \left| 0 \right\rangle &= \displaystyle \frac{1}{\sqrt{2}}(\left| 0 \right\rangle + \left| 1\right\rangle)\\  \displaystyle \mathbf{H} \left| 1 \right\rangle &= \displaystyle \frac{1}{\sqrt{2}}(\left| 0 \right\rangle - \left| 1\right\rangle)  \end{array}

We can then write this as:

\begin{array}{rl}  \displaystyle \mathbf{H}(a\left| 0\right\rangle + b\left| 1\right\rangle) &= \displaystyle a\mathbf{H} \left| 0\right\rangle + b\mathbf{H} \left| 1\right\rangle)\\  &= \displaystyle a\frac{1}{\sqrt{2}}\big(\left| 0\right\rangle + \left| 1\right\rangle\big) + b\frac{1}{\sqrt{2}}\big(\left| 0\right\rangle - \left| 1\right\rangle\big)\\  &= \displaystyle \frac{a+b}{\sqrt{2}} \left| 0\right\rangle + \frac{a-b}{\sqrt{2}} \left| 1\right\rangle  \end{array}

Let see what the operators look like when they operate on bra vectors. Let’s start with the Hadamard operator acting on a general QBit above.

\begin{array}{rl}  \displaystyle \mathbf{H} \mid 0 \rangle &= \displaystyle \frac{1}{\sqrt{2}}(\mid 0 \rangle + \mid 1\rangle) \rightarrow \langle 0\mid \mathbf{H^\dagger} = \displaystyle \frac{1}{\sqrt{2}}(\langle 0 \mid + \langle 1\mid)\\  \displaystyle \mathbf{H} \mid 1 \rangle &= \displaystyle \frac{1}{\sqrt{2}}(\mid 0 \rangle - \mid 1\rangle) \rightarrow \langle 1\mid \mathbf{H^\dagger} = \displaystyle \frac{1}{\sqrt{2}}(\langle 0 \mid - \langle 1\mid)  \end{array}

where we symbolize \mathbf{H^\dagger} as the operator corresponding to the Hadamard operator acting on bras.

The matrix representation of the Hadamard operator acting on bras is therefore:

\mathbf{H^\dagger} = \frac{1}{\sqrt{2}}  \begin{bmatrix}  1 & 1\\  1 & -1  \end{bmatrix} = \mathbf{H}

In general, the matrix representation of \mathbf{H^\dagger} is the conjugate transpose of the matrix of \mathbf{H}. In simple terms, this means take the transpose of \mathbf{H} and apply the complex conjugates to each element. It’s amazing that the Hadamard operator is equal to its conjugate transpose. The types of operators that behave that way are called Hermitian operators.

Let \mathbf{A} be an operator and \left| \Psi\right\rangle a vector and let \Phi = \mathbf{A}\left|\Psi\right\rangle, then the norm of \Phi is

\displaystyle \langle\Phi\mid\Phi\rangle = \langle \Psi\mid \mathbf{A^\dagger}\mathbf{A}\mid \Psi\rangle = \langle\Psi\mid\Psi\rangle

This means that

\displaystyle \mathbf{A^\dagger}\mathbf{A} = \mathbf{I}

The kind of operator that satisfies the above relationship is called Unitary.

Now that we have gained knowledge of some of the tools, we can now describe how a quantum computation is set up. We need QBits for the input and QBits for the output. The simplest setup is a one QBit input and one QBit output. Let \left| x\right\rangle and \left| y\right\rangle be the input and output QBits,respectively. The general state of this quantum system is a superposition of the following states:

\left| 0 \right\rangle \otimes \left| 0 \right\rangle, \left| 0 \right\rangle \otimes \left| 1 \right\rangle, \left| 1 \right\rangle \otimes \left| 0 \right\rangle, \left| 1 \right\rangle \otimes \left| 1 \right\rangle

which could also be written as

\left|00\right\rangle, \left|01\right\rangle, \left|10\right\rangle, \left|11\right\rangle,

or, if we take them as binary representation of numbers, we can write them concisely as

\left|0\right\rangle, \left|1\right\rangle,\left|2\right\rangle,\left|3\right\rangle,

Using the above basis vectors, we can write a general 2-QBit system as:

\displaystyle \left|\Psi\right\rangle = c_1\left|0\right\rangle + c_1\left|1\right\rangle + c_1\left|2\right\rangle + c_1\left|3\right\rangle = \sum_{i=0}^{2^n-1} c_i\left|i\right\rangle

where the numbers c_i are complex and n=2. The summation is up to 2^n-1 since the maximum number n QBits can represent is 2^n-1. The summation on the right also gives us the general form of a QBit for an n-QBit system.

As we saw earlier, the quantum operators (or gates) are the ones used to compute from a given QBit inputs. Let \mathbf{U}_f represent a unitary operator implementing the function f. The function f is a function that takes the numbers from the domain \{0,1\} to \{0,1\}, that is,

\displaystyle f: \{0,1\}\rightarrow \{0,1\}

For a two-QBit system, the form of quantum computation is

\displaystyle \mathbf{U}_f\left|x\right\rangle \left|y\right\rangle = \left|x\right\rangle \left|x\oplus f(x)\right\rangle

where the symbol \oplus represent bitwise OR operation. The table below will give you the results of the bitwise OR operation for every combination of values in \{0,1\}:

\begin{array}{cc}  0 \oplus 0 &= 0\\  0 \oplus 1 &= 1\\  1 \oplus 0 &= 1\\  1 \oplus 1 &= 0  \end{array}

A simple but important example of a unitary operator acting on two QBits is the \mathbf{C_{\text{NOT}}} gate, defined as:

\displaystyle \mathbf{C_{\text{NOT}}}\left|x\right\rangle\left|y\right\rangle = \left|x\right\rangle\left|y \oplus x\right\rangle

Here is the complete action of the \mathbf{C_{\text{NOT}}}:

\begin{tabular}{c|c|c}  \text{Input} & \text{Output} & \text{Result}\\  0 & 0 & 0\\  0 & 1 & 1\\  1 & 0 & 1\\  1 & 1 & 0  \end{tabular}

If you study the table above, you’ll notice that if the input QBit is in the state \left|0\right\rangle, the output QBit stays the same. However, if the input QBit is in \left|1\right\rangle, the output QBit is flipped. The input QBit in this case controls the flipping of the second QBit. The input QBit is said to be the control bit. That’s why it’s called a \mathbf{C_{\text{NOT}}} which stands for Controlled-NOT gate.

Let’s now go back to our original problem. There is a positive number a that is less than 2^n . We want to find this number using quantum computation. We are given a black box operator \mathbf{U}_f whose action on a bra is

\displaystyle \mathbf{U}_f \left|x\right\rangle\left|y\right\rangle = \left|x\right\rangle\left| y \oplus f(x)\right\rangle

where

\displaystyle f(x) = x_0 a_0 \oplus x_1 a_1 \oplus \cdots x_{n-1}a_{n-1} = x \cdot a

where a is the unknown number we are going to guess and x_ia_i are the i-th bits of x and a.

So how many times do we have to apply \mathbf{U}_f in order to guess the number a?

To solve this, we need n input QBits and 1 output QBit. We will initialize our QBits to \left|0\right\rangle. So the initial configuration is

\displaystyle \underbrace{\left|000...0\right\rangle}_{\text{n number of zeroes}}\otimes\left|1\right\rangle

Notice how we separate the input and output qubits.

Next we apply the Hadamard operator to each QBit:

\displaystyle \mathbf{H}^{\otimes n}\otimes \mathbf{H} \left|000...0\right\rangle\left|1\right\rangle = \mathbf{H}^{\otimes n} \left|000...0\right\rangle\otimes \mathbf{H}\left|1\right\rangle

We have seen the Hadamard operator earlier. For the sake of convenience we’ll repeat it here:

\begin{array}{rl}  \displaystyle \mathbf{H} \left| 0 \right\rangle &= \displaystyle \frac{1}{\sqrt{2}}(\left| 0 \right\rangle + \left| 1 \right\rangle)\\  \displaystyle \mathbf{H} \left| 1 \right\rangle &= \displaystyle \frac{1}{\sqrt{2}}(\left| 0 \right\rangle - \left| 1\right\rangle)  \end{array}

We can generalize this by letting x be either 0 or 1:

\begin{array}{rl}  \displaystyle \mathbf{H}\left| x\right\rangle &= \frac{1}{\sqrt{2}}\big(\left|0\right\rangle + (-1)^x \left|1\right\rangle  \end{array}

We can use this to compute the action on two QBits and generalize:

\begin{array}{rl}  \displaystyle \mathbf{H}^{\otimes 2}\left| x\right\rangle &= \mathbf{H}^{\otimes 2}\left| x_0x_1\right\rangle = \mathbf{H}\left| x_0\right\rangle \otimes \mathbf{H}\left| x_1\right\rangle\\  &= \displaystyle \frac{1}{\sqrt{2}}\big(\left|0\right\rangle + (-1)^{x_0} \left|1\right\rangle \otimes \frac{1}{\sqrt{2}}\big(\left|0\right\rangle + (-1)^{x_1} \left|1\right\rangle\\  &= \displaystyle \big(\frac{1}{\sqrt{2}}\big)^2 \big( \left|00\right\rangle + (-1)^{x_0} \left|01\right\rangle + (-1)^{x_1}\left|10\right\rangle + (-1)^{x_1 + x_0}\left|11\right\rangle\\  &= \displaystyle \big(\frac{1}{\sqrt{2}}\big)^2 \big( (-1)^{0\cdot x_1 + 0\cdot x_0} \left|00\right\rangle + (-1)^{0\cdot x_1 + 1\cdot x_0} \left|01\right\rangle + (-1)^{1\cdot x_1 + 0\cdot x_0}\left|10\right\rangle + (-1)^{1\cdot x_1 + 1\cdot x_0}\left|11\right\rangle\\  &= \displaystyle \big(\frac{1}{\sqrt{2}}\big)^2 \big( (-1)^{0\cdot x_1 + 0\cdot x_0} \left|0\right\rangle + (-1)^{0\cdot x_1 + 1\cdot x_0} \left|1\right\rangle + (-1)^{1\cdot x_1 + 0\cdot x_0}\left|2\right\rangle + (-1)^{1\cdot x_1 + 1\cdot x_0}\left|3\right\rangle\\  &= \displaystyle \big(\frac{1}{\sqrt{2}}\big)^2 \sum_{y=0}^{2^2-1} (-1)^{x\cdot y}\left|y\right\rangle  \end{array}

In general,

\displaystyle \mathbf{H}^{\otimes n}\left| x\right\rangle = \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} (-1)^{x\cdot y}\left|y\right\rangle

Using the above formula, the action of \mathbf{H}^{\otimes n} on \left|0\right\rangle is

\displaystyle \mathbf{H}^{\otimes n} \left|0\right\rangle = \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} \left|y\right\rangle

The next step is to apply our operator \mathbf{U}_{f} which was defined above:

\begin{array}{rl}  \displaystyle \mathbf{U}_f\mathbf{H}^{\otimes n}\left| 0\right\rangle \otimes \frac{1}{\sqrt{2}} \big( \left|0\right\rangle - \left|1\right\rangle\big) &= \displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} \mathbf{U}_f\left|y\right\rangle \otimes \frac{1}{\sqrt{2}} \big( \left|0\right\rangle - \left|1\right\rangle\big)\\  &= \displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} \frac{1}{\sqrt{2}} \big( \mathbf{U}_f\left|y\right\rangle \otimes  \left|0\right\rangle - \mathbf{U}_f\left|y\right\rangle \otimes \left|1\right\rangle\big)\\  &= \displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} \frac{1}{\sqrt{2}} \big( \left|y\right\rangle \otimes  \left|0\oplus f(y)\right\rangle - \left|y\right\rangle \otimes \left|1\oplus f(y)\right\rangle\big)\\  \end{array}

Let’s break this down a bit. Since the value of f(y) is either 0 or 1, we have 2 cases:

When f(y) = 0,

\displaystyle \left|y\right\rangle \otimes  \left|0\oplus f(y)\right\rangle = \left|y\right\rangle \otimes  \left|0\right\rangle
\displaystyle \left|y\right\rangle \otimes  \left|1\oplus f(y)\right\rangle = \left|y\right\rangle \otimes  \left|1\right\rangle

which means

\displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} \frac{1}{\sqrt{2}} \big( \left|y\right\rangle \otimes  \left|0\oplus f(y)\right\rangle - \left|y\right\rangle \otimes \left|1\oplus f(y)\right\rangle\big) = \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} \left|y\right\rangle \otimes \frac{1}{\sqrt{2}} \big(  \left|0\right\rangle - \left|1\right\rangle\big)

When f(y) = 1,

\displaystyle \left|y\right\rangle \otimes  \left|0\oplus f(y)\right\rangle = \left|y\right\rangle \otimes  \left|1\right\rangle
\displaystyle \left|y\right\rangle \otimes  \left|1\oplus f(y)\right\rangle = \left|y\right\rangle \otimes  \left|0\right\rangle

which means

\displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} \frac{1}{\sqrt{2}} \big( \left|y\right\rangle \otimes  \left|0\oplus f(y)\right\rangle - \left|y\right\rangle \otimes \left|1\oplus f(y)\right\rangle\big) = \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} (-1) \left|y\right\rangle \otimes \frac{1}{\sqrt{2}} \big(  \left|0\right\rangle - \left|1\right\rangle\big)

Combining these 2 we have

\begin{array}{rl}  \displaystyle \mathbf{U}_f\mathbf{H}^{\otimes n}\left| 0\right\rangle \otimes \frac{1}{\sqrt{2}} \big( \left|0\right\rangle - \left|1\right\rangle\big)  &= \displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} \frac{1}{\sqrt{2}} \big( \left|y\right\rangle \otimes  \left|0\oplus f(y)\right\rangle - \left|y\right\rangle \otimes \left|1\oplus f(y)\right\rangle\big)\\  &= \displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} (-1)^{f(y)} \left|y\right\rangle \otimes \frac{1}{\sqrt{2}} \big(  \left|0\right\rangle - \left|1\right\rangle\big)  \end{array}

Finally, getting the Hadamard action on this gives:

\begin{array}{rl}  \displaystyle \mathbf{H}^{\otimes n}\otimes\mathbf{H}\Big[\mathbf{U}_f\mathbf{H}^{\otimes n}\left| 0\right\rangle \otimes \frac{1}{\sqrt{2}} \big( \left|0\right\rangle - \left|1\right\rangle\big)\Big] &=\mathbf{H}^{\otimes n} \mathbf{U}_f\mathbf{H}^{\otimes n}\left| 0\right\rangle \otimes \mathbf{H}\Big[\frac{1}{\sqrt{2}} \big( \left|0\right\rangle - \left|1\right\rangle\big)\Big]\\  &= \underbrace{\mathbf{H}^{\otimes n} \Big[\displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} (-1)^{f(y)} \left|y\right\rangle \Big]}_{A} \otimes \underbrace{\mathbf{H}\Big[\frac{1}{\sqrt{2}} \big(  \left|0\right\rangle - \left|1\right\rangle\big)\Big]}_{B}  \end{array}

Expanding underbrace B gives us:

\begin{array}{rl}  \displaystyle \mathbf{H}\Big[\frac{1}{\sqrt{2}} \big(  \left|0\right\rangle - \left|1\right\rangle\big)\Big] &= \displaystyle \frac{1}{\sqrt{2}}\big(\mathbf{H}\left|0\right\rangle - \mathbf{H}\left|1\right\rangle\big)\\  &= \displaystyle \frac{1}{\sqrt{2}}\Big[\frac{1}{\sqrt{2}} \big( \left|0\right\rangle + \left|1\right\rangle \big) - \frac{1}{\sqrt{2}}\big( \left|0\right\rangle - \left|1\right\rangle\big) \Big]\\  &= \displaystyle \frac{1}{2}\Big[ 2\left|1\right\rangle \Big]\\  &= \left|1\right\rangle  \end{array}

Expanding underbrace A above:

\begin{array}{rl}  \mathbf{H}^{\otimes n} \Big[\displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} (-1)^{f(y)} \left|y\right\rangle \Big] &= \displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} (-1)^{f(y)}\mathbf{H}^{\otimes n} \left|y\right\rangle  \end{array}

We have found earlier that

\displaystyle \mathbf{H}^{\otimes n}\left| y\right\rangle = \frac{1}{2^{n/2}}\sum_{x=0}^{2^n-1} (-1)^{x\cdot y}\left|x\right\rangle

Substituting this to the above and noting that

\displaystyle f(y) = y\cdot a = \sum_{i=0}^{n-1} y_ia_i \mod 2,

we get:

\begin{array}{rl}  \mathbf{H}^{\otimes n} \Big[\displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} (-1)^{f(y)} \left|y\right\rangle \Big] &= \displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} (-1)^{f(y)}\mathbf{H}^{\otimes n} \left|y\right\rangle \\  &= \displaystyle \frac{1}{2^{n}}\sum_{y=0}^{2^n-1} \sum_{x=0}^{2^n-1} (-1)^{f(y)}(-1)^{y\cdot x}\left|x\right\rangle\\  &= \displaystyle \frac{1}{2^{n}}\sum_{y=0}^{2^n-1} \sum_{x=0}^{2^n-1} (-1)^{y\cdot a}(-1)^{y\cdot x}\left|x\right\rangle\\  &= \displaystyle \frac{1}{2^{n}}\sum_{x=0}^{2^n-1} \underbrace{\sum_{y=0}^{2^n-1} (-1)^{(a+x)\cdot y}}\left|x\right\rangle\\  \end{array}

where we have interchanged the order of the summation. To simplify the term with the underbrace, notice that

\begin{array}{rl}  \displaystyle \Big[1+ (-1)^{z_0}\Big]\Big[1+(-1)^{z_1}\Big] &= 1 + (-1)^{z_0} + (-1)^{z_1} + (-1)^{z_0+z_1}\\  &= \displaystyle (-1)^{0\cdot z_0 + 0\cdot z_1} + (-1)^{1\cdot z_0 + 0\cdot z_1} + (-1)^{0\cdot z_0 + 1\cdot z_1} + (-1)^{1\cdot z_0+ 1\cdot z_1}\\  &= \displaystyle(-1)^{0\cdot z} + (-1)^{1\cdot z} + (-1)^{2\cdot z} + (-1)^{3\cdot z}\\  &= \displaystyle \sum_{y=0}^{2^2-1} (-1)^{y\cdot z}  \end{array}

We can therefore write the factor in underbrace as

\begin{array}{rl}  \mathbf{H}^{\otimes n} \Big[\displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} (-1)^{f(y)} \left|y\right\rangle \Big] &= \displaystyle \frac{1}{2^{n}}\sum_{x=0}^{2^n-1} \underbrace{\sum_{y=0}^{2^n-1} (-1)^{(a+x)\cdot y}}\left|x\right\rangle\\  &= \displaystyle \frac{1}{2^{n}}\sum_{x=0}^{2^n-1} \prod_{i=0}^{n-1} \Big[ \underbrace{1 +(-1)^{(a_i+x_i)}}\Big] \left|x\right\rangle\\  \end{array}

Notice that

\displaystyle 1 +(-1)^{(a_i+x_i)} = \begin{cases}  0 & x\ne a\\  2 & x=a  \end{cases}

that is, the product \prod_{i=0}^{n-1} \Big[1 + (-1)^{(a_i+x_i)}\Big] = 0 unless x=a. Therefore

\begin{array}{rl}  \mathbf{H}^{\otimes n} \Big[\displaystyle \frac{1}{2^{n/2}}\sum_{y=0}^{2^n-1} (-1)^{f(y)} \left|y\right\rangle \Big]  &= \displaystyle \frac{1}{2^{n}}\sum_{x=0}^{2^n-1} \prod_{i=0}^{n-1} \Big[ \underbrace{1 +(-1)^{(a_i+x_i)}}\Big] \left|x\right\rangle\\  &= \displaystyle \frac{1}{2^{n}} 2^n \left|a\right\rangle\\  &= \left|a\right\rangle  \end{array}

Therefore, the quantum computation gives the answer in one application of \mathbf{U}_f:

\displaystyle \mathbf{H}^{\otimes n+1}\mathbf{U}_f \mathbf{H}^{\otimes n+1} \left|0\right\rangle = \left| a\right\rangle \left| 1\right\rangle

So what just happened? We began with a superposition of states and finished with the correct answer in one application of the black box operator \mathbf{U}_f. That’s the power of Quantum Computing!

Advertisements

Prototyping Parallel Programs on OpenShift

In the previous posts, we were introduced to parallel computing and saw how it can speed up computation and by what factor (depending on the algorithm). We also introduced Chapel, a language for parallel computation, developed by Cray. Using Chapel, we first developed a very simple parallel algorithm to compute the sum of a list of numbers. Then we learned how to implement parallel matrix multiplication. In this post, we will talk about how to run our parallel programs using a cluster of machines. The source code referenced here can be found on github.

To run our parallel program, we will need to set up a bare-metal server or a Virtual Machine (VM), install Chapel and clone it to N instances. A shared filesystem will be mounted on a pre-defined directory on each instance. This filesystem will allow all instances to load data from the same location. Any output from each instance will also be written to this filesystem (although we must ensure they write to different files or directories to avoid corruption of data).

Here is how our setup will look like:

We will also need to setup the following:

  • SSH server – this will run as an ordinary user using port 2222. The Chapel platform will connect to port 2222 on each machine.
  • Configure passwordless SSH login.

We will need to export the following environment variables to enable parallel computation across different machines. The purpose of those variables can be found here. The variable GASNET_SSH_SERVERS is a space-separated list of IP addresses or hostnames of our nodes. In our setup, we will use the following:

export CHPL_COMM=gasnet
export CHPL_LAUNCHER=amudprun
export GASNET_SSH_SERVERS="node00 node01 node10 node11"

where node00, node01, node10, node11 are hostnames of our machines/nodes.

How to Run a Parallel Program

We will run our parallel matrix_multiplication on our cluster. To launch the program, we will login to one of the nodes and change directory to /home/chapel containing the mat_mul binary and issue the following command:

chpl mat_mul -nl 4

Running in OpenShift

If having to provision a bare-metal or VM environment for parallel computing is costly, we can turn to containers for a much cheaper and faster way to provision. To make our lives easier, we will use OpenShift Origin platform to run our parallel computing environment.

In the succeeding sections, we assume that we have a working installation of OpenShift. See the instructions here on how to set up OpenShift on VirtualBox or here to install a single-node OpenShift in AWS.

Containerizing the “Node”

We will have to build a container image of the bare metal or VM we mentioned above. The format we will use is Docker. First, we clone the project from github:

git clone https://github.com/corpbob/chapel-openshift-demo.git

Change directory to chapel-openshift-demo and build the image:

docker built -t chapel .

Create the chapel project on OpenShift. We use the admin user in this example:

oc login -u admin
oc new-project chapel

We need to know the ip address of the docker-registry service in OpenShift. We execute the following command:

[root@openshift chapel-openshift-demo]# oc describe svc docker-registry -n default
Name:			docker-registry
Namespace:		default
Labels:			docker-registry=default
Selector:		docker-registry=default
Type:			ClusterIP
IP:			172.30.1.1
Port:			5000-tcp	5000/TCP
Endpoints:		172.17.0.7:5000
Session Affinity:	ClientIP
No events.

Line 7 (highlighted) of the output gives us the IP address of the docker-registry service. We will use this to push the Chapel image we just created to the registry. First we login to the docker registry:

docker login -u admin -p $(oc whoami -t) 172.30.1.1:5000

Tag the chapel image using the following format

<registry_ip>:<port>/<project_name>/<image_name>:<version>

For our example, we use the following command to tag the image:

docker tag chapel 172.30.1.1:5000/chapel/chapel:v0.1

We can now push the image to the internal docker registry:

docker push 172.30.1.1:5000/chapel/chapel

Importing the Chapel Template

In the previous section, we built the container image of Chapel and pushed it to the private docker registry. In this section, we will import a template that will do the following:

  • Set up the Chapel containers inside OpenShift
  • Create a file that will dynamically generate the variable GASNET_SSH_SERVERS containing the IP addresses of the Chapel pods that will be used in the parallel computation.

The name of the template is chapel.yml.

Import the template using the command

oc create -f chapel.yml

We need to give the default service account the view role so that it can read the IP addresses of the pods associated with chapel. To do this, execute the command:

oc policy add-role-to-user view system:serviceaccount:chapel:default 

After this we can now create the chapel application:

oc new-app chapel

This will automatically trigger a deployment of 4 chapel instances with a shared volume mounted at /home/chapel/mnt.

The “hook-post” pod is a separate instance of the chapel image that will execute the following commands

echo "export GASNET_MASTERIP=\$MY_NODE_IP" > /home/chapel/mnt/exports && \
echo "export GASNET_SSH_OPTIONS=\"-p 2222\"" >> /home/chapel/mnt/exports && \
for pod in `oc get pods -l app=chapel|grep chapel|awk '{print $1}'`; \
do \
  oc describe pod $pod |grep ^IP:|awk '{print $2}'; \
done| \
awk 'BEGIN { x="" } \
      {x = x$1" "} \
     END {print "export GASNET_SSH_SERVERS=\""x"\""}' >> \
/home/chapel/mnt/exports

The output of the above command is a file named exports and looks like the below:

Running the Sample Program

We now go to the web console-> Applications -> Pods. Select any of the pods and click Terminal. In the /home/chapel directory, there is a file named run-test.sh. This file contains the following commands:

export GASNET_SPAWNFN=S                                                                                                                          
source /home/chapel/mnt/exports                                                                                                                  
                                                                                                                                                 
./hello6-taskpar-dist -nl $* 

The commands above executes the pre-compiled chapel binary hello6-taskpar-dist which was compiled when we built the container image earlier. Executing this file gives us:

# the parameter 4 tells chapel to use 4 pods to execute the command.
sh-4.2$ ./run-test.sh 4                                                                                                                               
Warning: Permanently added '[172.17.0.16]:2222' (ECDSA) to the list of known hosts.                                                              
Warning: Permanently added '[172.17.0.18]:2222' (ECDSA) to the list of known hosts.                                                              
Warning: Permanently added '[172.17.0.15]:2222' (ECDSA) to the list of known hosts.                                                              
Warning: Permanently added '[172.17.0.17]:2222' (ECDSA) to the list of known hosts.                                                              
Hello, world! (from locale 0 of 4 named chapel-1-fgg4b)                                                                                          
Hello, world! (from locale 2 of 4 named chapel-1-pt3v9)                                                                                          
Hello, world! (from locale 1 of 4 named chapel-1-rg668)                                                                                          
Hello, world! (from locale 3 of 4 named chapel-1-rw2rc)                                                                                          

Running the Parallel Matrix Multiplication

Copy the file mat_mul.chpl to the pod and compile.

chpl mat_mul.chpl -o mnt/mat_mul

The command above will place the resulting binary inside the directory /home/chapel/mnt. This will be accessible from all pods.

Finally, execute the parallel matrix multiplication:

./run.sh mnt/mat_mul 4

Conclusion

I have not tried this in a real production environment or even on a bare-metal installation of OpenShift. This seems to be a promising use of OpenShift but I still have to find out.

Implementing Parallel Algorithms Part 2

In the previous post, we implemented a very simple parallel program to add a set of numbers. In this post, we will implement parallel matrix multiplication.

We have shown a parallel algorithm to multiply 2 big matrices using message passing. The algorithm involved block sub-matrices to be passed from node to node and multiplied within a node until the answer is found.

There will be 2 square matrices A and B. In our example, the dimension of both A and B is 4×4. We will distribute the matrices evenly to 4 nodes.

A = \left[  \begin{array}{cc|cc}  a_{00} & a_{01} & a_{02} & a_{03}\\   a_{10} & a_{11} & a_{12} & a_{13}\\ \hline  a_{20} & a_{21} & a_{22} & a_{23}\\  a_{30} & a_{31} & a_{32} & a_{33}  \end{array}  \right],    B = \left[  \begin{array}{cc|cc}  b_{00} & b_{01} & b_{02} & b_{03}\\   b_{10} & b_{11} & b_{12} & b_{13}\\ \hline  b_{20} & b_{21} & b_{22} & b_{23}\\  b_{30} & b_{31} & b_{32} & b_{33}  \end{array}  \right]

In this example, node 00 will have the following matrices:

A_{00}=\begin{bmatrix}  a_{00} & a_{01}\\  a_{10} & a_{11}  \end{bmatrix},  B_{00}=  \begin{bmatrix}  b_{00} & b_{01}\\  b_{10} & b_{11}  \end{bmatrix}

Let’s simulate each node loading entries of sub-matrices assigned to it.

const n = 4;
var vec = 1..n;
var blockSize = 2;

var A: [vec, vec] real;
var B: [vec, vec] real;
var C: [vec, vec] real;

coforall loc in Locales {
  on loc {
    var i = loc.id/2;
    var j = loc.id%2;
    var istart = i*blockSize;
    var iend = istart + blockSize;
    var jstart = j*blockSize;
    var jend = jstart + blockSize;

    for (r,s) in {istart + 1..iend, jstart + 1..jend} {
      B(r,s) = r+s;
      A(r,s) = 2*r + s;
    }
  }
}

Global Address Space

Each node has limited memory physically exclusive to itself. In order for node A to have access to the contents of the memory of another node B, node B should pass the data to node A. Fortunately, Chapel can use a library called GASNet that allows each node to have a global view of the memory of all nodes participating in the computation.

In the code above, each node loads its own data. However, the GASNet library allows each node to access the matrix elements loaded by the other nodes. Consequently, we are able to reference the sub-matrix held by each node without doing fancy message passing. The algorithm is then a straightforward implementation of

\displaystyle \mathbf{C}_{ij}=\sum_{k=0}^{2} \mathbf{A}_{ik} \mathbf{B}_{kj}

where \mathbf{A_{ij}}, \mathbf{B_{ij}} and \mathbf{C_{ij}} are submatrices of \mathbf{A}, \mathbf{B}, and \mathbf{C}, respectively.

Below is the straightforward implementation of parallel block multiplication:

coforall loc in Locales {
  on loc {
    var i = loc.id/2;
    var j= loc.id%2;
    var istart = i*blockSize;
    var iend = istart + blockSize;
    var jstart = j*blockSize;
    var jend = jstart + blockSize;
    var r = { istart + 1..iend, jstart + 1..jend };
    ref W = C[r].reindex( { 1..2,1..2 });
 
    coforall k in 0..1 {
      var U=get_block_matrix(A[vec,vec],i,k,blockSize);
      var V=get_block_matrix(B[vec,vec],k,j,blockSize);
      var P = mat_mul(U,V);
      coforall (s,t) in { 1..2,1..2 } {       
        W(s,t) += P(s,t);
      }
    }
  }
}

The procedure get_block_matrix will return the sub-matrix given the (i,j)th index and the block size.

proc get_block_matrix(A: [?D], i:int, j:int , blockSize:int) {
  var r = { i*blockSize+1 .. i*blockSize 
            +  blockSize, j*blockSize 
            + 1 .. j*blockSize + blockSize };
  return A[r];
}

The procedure mat_mul will return the matrix product of two sub-matrices:

proc mat_mul(A: [?D1], B: [?D2]) {
  var D3 = { 1..2, 1..2 };
  var C: [D3] real;
  var AA = A.reindex({1..2,1..2});
  var BB = B.reindex({1..2,1..2});

  for row in 1..2 {
    for col in 1..2 {
      var sum:real = 0;
      for k in 1..2 {
         sum += AA(row,k) * BB(k,col);
      }
      C(row,col) = sum;
    }
  }
  return C;
}

writeln(C[vec,vec]);

To run this code, we need to set the following environment variables:

source $CHPL_HOME/util/setchplenv.bash 

export CHPL_COMM=gasnet
export CHPL_LAUNCHER=amudprun

export GASNET_SSH_SERVERS="127.0.0.1 127.0.0.1 127.0.0.1 127.0.0.1"

Compiling and running this program gives the output:

#Compile
chpl mat_mul.chpl -o mat_mul

# Run using 4 nodes  
./mat_mul -nl 4

A=
3.0 4.0 5.0 6.0
5.0 6.0 7.0 8.0
7.0 8.0 9.0 10.0
9.0 10.0 11.0 12.0
B=
2.0 3.0 4.0 5.0
3.0 4.0 5.0 6.0
4.0 5.0 6.0 7.0
5.0 6.0 7.0 8.0
C=
68.0 86.0 104.0 122.0
96.0 122.0 148.0 174.0
124.0 158.0 192.0 226.0
152.0 194.0 236.0 278.0

Implementing Parallel Algorithms Part 1

Now we know that parallel algorithms allow us to make our programs run faster. So how do we implement them?

I have used mpich before, but that was more than a decade ago. Recently, I found myself looking for new ways of doing parallel programming. I discovered a very nice parallel programming language called Chapel. This is what we’ll use to implement parallel algorithms.

Algorithm 1: Parallel sum of consecutive numbers from 1 to N

To get the sum of numbers from 1 to N, where N is some integer is easy. There is a formula to do that:

\displaystyle \sum_{i=1}^N i = \frac{N(N+1)}{2}

However for the sake of illustration, we are going to compute the sum of 1 to N using a cluster of machines. Here is the Chapel code to accomplish it inside the file add_parallel.chpl.

config var N:int = 1000;
var D:domain(1) = {0..numLocales -1};
var s:[D] int;
var sum:int= 0;
var bs = N/numLocales;
coforall loc in Locales {
  on loc {
      var i = loc.id;
      var start = i*bs + 1;
      var end = start + bs -1;
      var _sum:int = 0;
      for j in start .. end {
        _sum += j;
      }
    writeln("i= " + i + ", start= "+ start + ", end=" + end + ", sum = " + _sum);
    s[i] = _sum;
  }
}

sum = + reduce s;
writeln("sum: " + sum);

This program is compiled using the command:

chpl add_parallel.chpl -o add_parallel

where add_parallel.chpl is the filename of the program and -o add_parallel specifies the filename of the binary produced after compilation.

One line 1, we have defined the default value of N to be 1000. This can be overridden on the command line by specifying the --x parameter. The number of machines we are going to use is also specified on the command line using the -nl parameter. A sample invocation of this program is the following:

./add_parallel -nl 3 --N=120

The above command means that we want to run the add_parallel binary using 3 machines with the value of N=120.

Executing the above command will give us:

./add_parallel -nl 3 --N=120
i= 0, start= 1, end=40, sum = 820
i= 1, start= 41, end=80, sum = 2420
i= 2, start= 81, end=120, sum = 4020
sum: 7260

How the program works

The program will partition N into 3 blocks. This is specified on line 5 where we divided N by numLocales to get the block size. The numLocales will contains the value of the parameter -nl which in this example is 3.

The code on line 6 tells chapel to execute a parallel for-loop executing the code inside on loc block on each Locale. A Locale has an id starting from 0. The Locale will determine it’s id and compute the starting and ending number to sum and store this value in the variable _sum.

coforall loc in Locales {
  on loc {
      var i = loc.id;
      var start = i*bs + 1;
      var end = start + bs -1;
      var _sum:int = 0;
      for j in start .. end {
        _sum += j;
      }
    writeln("i= " + i + ", start= "+ start + ", end=" + end + ", sum = " + _sum);
    s[i] = _sum;
  }
}

This _sum is stored in the array s. We take the sum of entries of the s array using the reduce keyword specifying + as the reduction operator. Finally we print the total sum across the machines.

sum = + reduce s;
writeln("sum: " + sum);

Conclusion

We have seen that that we can implement parallel programs using Chapel programming language. In part 2, we will show how to do Parallel Matrix Multiplication using Chapel.

Divided We Compute, United We Reduce

Once upon a time, in a far away village lay a dying old man. He called his sons to his deathbed and spoke to them one last time. He said “Sons, see that bundle of sticks? Each one of you try to break it. The one who can break it will inherit all my riches”. Each son, being greedy, wanted all the riches for himself. So each one of them tried to break the bundle of sticks but none of them succeeded. The old man asked his servant to untie the bundle and said to his sons, “Each one of you now get one stick and break it”. Without any effort, each son was able to break the stick. The old man said “You see, when you unite, no task will be difficult. The riches that I told you was a lie. We are broke. When i’m dead, make sure  you unite so that you can survive.”

Fast forward to modern times. You can think of the bundle of sticks to be a complex problem that is itself composed of smaller problems. The sons are the processors of your computer. When each processor was given the task to solve the complex problem, it fails to solve it in a reasonable amount of time. When the complex problem is decomposed into smaller problems and given to each processor, each processor is now able to solve the smaller problems quickly thereby solving the big problem quickly as well.

The process of decomposing a problem into smaller problems and solving them in separate processors is called Parallel Computing. In this article, we will compute how fast a certain algorithm will run when parallelized. The problem we want to investigate is sorting an array of a million (2^{20}) integers.

Efficient Sorting

Suppose you have an array \{ a_1, a_2, a_3, ..., a_n \} that you want to sort based on pairwise comparison.  The sorted array is just one of the many permutations of the array \{a_1, a_2, a_3,\ldots, a_n\}. In fact, if you have n different objects to sort, then there are exactly n! ways to arrange these objects, and one of them is the sorted state. You can imagine the sorting process as a decision tree. Say, for example we have array A={ a,b,c }. To sort this, we first compare a with b and there are 2 ways this can go. Either a \le b or a > b. If a \le b, we then compare b and c. This also give either b \le c or b > c. As you can see from the diagram below, this is nothing but a decision tree.



Since the height of this binary tree is lg(n!), then we have

\lg(n!) = \lg\Big[ n \cdot (n - 1) \cdot (n-2) \cdots 1\Big] \le \lg n + \lg (n-1) \cdots \lg1 \le \underbrace{\lg n \cdots \lg n}_\text{ n times}
\lg(n!) \le n\cdot \lg n

There are efficient algorithms that are able to sort of this complexity. For example, the merge sort has this complexity. Therefore, if you have an array of 2^{20} elements, then the complexity is

2^{20} \cdot \lg(2^{20}) = 2^{20} \cdot (20) = 20971520

that is, it takes about 20 million comparisons to sort an array of 1 million. Could we do any better than this? We can either upgrade the cpu of the machine doing the sorting or use two or more machines to divide the work among those machines. In this article, we are going to investigate the impact of dividing the work into smaller chunks and farming it to other processors.

Divide and Conquer

Assume we have an array n=2^{20} elements that we need to sort and suppose we have two identical processors we can use. Divide the array into 2 equal sized arrays. Give the first array to the first processor and the other half to the second processor. Apply an efficient sorting algorithm to the subarrays to produce a sorted array for each processor. We then combine the result of processor 1 and processor 2 to one big array by merging the two sorted arrays. The diagram below illustrates the process of computation:



This is also known as the MapReduce algorithm. Mapping is the process of assigning subsets of the input data to processors where each processor computes the partial result. Reducing is the process of aggregating the results of each processor to the final solution of the problem.

The process of merging is straightforward. Given two sorted arrays, begin by comparing the first element of each array. The smaller of the two will then occupy the first slot in the big array. The second element of the array from which we took the smallest element will now become the first element of that array. Repeat the process until all elements of both arrays have already occupied slots in the big array. The diagram below illustrates the algorithm of merging.




If you count the total number of comparisons that you need to merge two sorted arrays, you will find that it takes n-1 comparisons. Therefore, the complexity of the merging process is O(n).

Since each processor has n/2 sized subarrays, the sorting complexity is therefore n/p \lg (n/p). Furthermore, since the merging process takes O(n) comparisons, the total complexity of the parallel sorting process is therefore

\displaystyle n/p \lg(n/p) + n

In our example, C=2^{20}/2 \lg(2^{20}/2) + 2^{20}=  11010048 comparisons compared to 2^{20} \lg(2^{20}) = 20971520 when run sequentially. For large values of n, n/p \lg(n/p) dominates n, therefore the complexity of the parallel algorithm is O(n/p \lg(n/p)).

Can we do any better?

For a given value of n, what do you think is the value of p that reduces the running time to O(n)? If we take n=2^{20} and plot complexity against p = \{ 2, 4, 8, 16, 32\} we get the diagram below.



In this diagram, we also plotted the horizontal line y=2^{20}. The intersection of this line with the plot of \displaystyle f(p) = \frac{n}{p} \lg(\frac{n}{p}) gives us the value of p such that the total comparisons is already linear, that is,

\displaystyle f( p ) = n
\displaystyle \frac{n}{p} \lg(\frac{n}{p})  = n

To get the value of p numerically, we have to solve the root of the equation

\displaystyle g( p ) = \frac{n}{p} \lg(\frac{n}{p}) - n = 0

Simplifying,

\displaystyle \frac{1}{p} \lg(\frac{n}{p}) - 1 = 0
\displaystyle \lg(\frac{n}{p}) = p
\displaystyle \frac{n}{p} = 2^p
\displaystyle p2^p - n = 0

Since this is a non-linear equation, we can solve this using the Newton's method. It is a method to compute the roots by approximation given an initial value of the solution. Starting from a guess solution p_1, the root can be approximated using the recursive formula

\displaystyle p_{n+1} = p_n - \frac{g( p_n)}{g\prime ( p_n)}

where g\prime ( p ) is the first derivative of g( p ) . Applying the rules of derivatives, we get

\displaystyle g\prime ( p ) = p\cdot 2^p \ln 2 + 2^p

Substituting this to the formula for Newton's method, we get

\displaystyle p_{n+1} = p_n - \frac{p2^p - n}{p2^p \ln 2 - 2^p}

Below is an R code using newton’s method to compute the root of the equation g(p).

g=function(n,p){
	p* 2^p - n
}

gprime=function(n,p){
	p*2^p *log(2) - 2^p
}

newton=function(p,n,iter){
	tmp = p
	for(i in 1:iter){
		p=p-g(n,p)/gprime(n,p)
		
		if(abs(p-tmp)< 0.0001){
			break		
		}

		print(p)
		tmp=p
	}
	print(p)
}

Running this code, we get the value of p = 16:

> newton(15,n,100)
[1] 16.80905
[1] 16.08829
[1] 15.98603
[1] 16.00286
[1] 15.99944
[1] 16.00011
[1] 15.99998
[1] 16

Ignoring network latency, by distributing the input evenly into 16 processors, we get a running time of O(n) time complexity for n=2^{20} array of items. Therefore, instead of doing 20 million comparisons, you only need 1 million comparisons to sort 1 million objects.

In this age of multicore processors, parallel computing is fast becoming the norm than the exception. Learning to harness the power of multicores is becoming an extremely handy skill to have.

Do It Yourself Supercomputing in Linux Part 1

If you recently purchased a laptop or desktop computer, chances are you have a dual core system. There does not seem to be any indication anymore of us going back to single core unless some technological breakthrough will break the power barrier of CPUs. This means that more and more people will have a high powered computer in their homes without any idea how to harness such power.

What does it mean to have a dual processor? On first impulse, you probably will think it will speed up the execution of your programs. You would probably perceive a significant difference between the response time of your programs in a single core versus a dual core. But why do programs run faster in dual processor systems? For one thing, the speed of each processor is faster as compared to older single processors. The other thing is due to symmetric multiprocessing. An analogy of the dual processor system is a bank with 2 tellers and having a single line for customers. When a teller is done with one customer, it will get the next customer from the line. You can read about the performance of multiprocessor systems in this article.

There is another way you can make use of multiprocessors to make your programs run faster. This is through parallel programming. Parallel computing has been around for a long time already but are usually confined to university laboratories or supercomputing facilities. It has not caught the interest of the common people because they have no access to such machines. However, the future is already about multi-processing. More and more of these machines will reach the masses. This means that highly talented members of the masses get to experiment with parallel computing on a day-to-day basis. We are in for another revolution in computing. Are you ready for this revolution?

Continue reading Do It Yourself Supercomputing in Linux Part 1

Talk on Parallel Computing

Parallel computing used to be the province of scientists and engineers. The tools and algorithms to do this has always been available on the net. However, since the advent of multicore machines, there is now a compelling reason for all programmers to learn and master it. Parallel programming is already the future of computing and there future is now.

I remember way back in 1998, I was introduced to parallel computing in MPI. I also wrote some papers on parallel algorithms and still active in parallel algorithms research. Now I am advocating and teaching a lot of people in parallel computing.

dsc_4057.JPG