## Divided We Compute, United We Reduce

Once upon a time, in a far away village lay a dying old man. He called his sons to his deathbed and spoke to them one last time. He said “Sons, see that bundle of sticks? Each one of you try to break it. The one who can break it will inherit all my riches”. Each son, being greedy, wanted all the riches for himself. So each one of them tried to break the bundle of sticks but none of them succeeded. The old man asked his servant to untie the bundle and said to his sons, “Each one of you now get one stick and break it”. Without any effort, each son was able to break the stick. The old man said “You see, when you unite, no task will be difficult. The riches that I told you was a lie. We are broke. When i’m dead, make sure  you unite so that you can survive.”

Fast forward to modern times. You can think of the bundle of sticks to be a complex problem that is itself composed of smaller problems. The sons are the processors of your computer. When each processor was given the task to solve the complex problem, it fails to solve it in a reasonable amount of time. When the complex problem is decomposed into smaller problems and given to each processor, each processor is now able to solve the smaller problems quickly thereby solving the big problem quickly as well.

The process of decomposing a problem into smaller problems and solving them in separate processors is called Parallel Computing. In this article, we will compute how fast a certain algorithm will run when parallelized. The problem we want to investigate is sorting an array of a million ($2^{20}$) integers.

Efficient Sorting

Suppose you have an array $\{ a_1, a_2, a_3, ..., a_n \}$ that you want to sort based on pairwise comparison.  The sorted array is just one of the many permutations of the array $\{a_1, a_2, a_3,\ldots, a_n\}$. In fact, if you have $n$ different objects to sort, then there are exactly $n!$ ways to arrange these objects, and one of them is the sorted state. You can imagine the sorting process as a decision tree. Say, for example we have array A={ a,b,c }. To sort this, we first compare a with b and there are 2 ways this can go. Either $a \le b$ or $a > b$. If $a \le b$, we then compare b and c. This also give either $b \le c$ or $b > c$. As you can see from the diagram below, this is nothing but a decision tree.

Since the height of this binary tree is lg(n!), then we have

$\lg(n!) = \lg\Big[ n \cdot (n - 1) \cdot (n-2) \cdots 1\Big] \le \lg n + \lg (n-1) \cdots \lg1 \le \underbrace{\lg n \cdots \lg n}_\text{ n times}$
$\lg(n!) \le n\cdot \lg n$

There are efficient algorithms that are able to sort of this complexity. For example, the merge sort has this complexity. Therefore, if you have an array of $2^{20}$ elements, then the complexity is

$2^{20} \cdot \lg(2^{20}) = 2^{20} \cdot (20) = 20971520$

that is, it takes about 20 million comparisons to sort an array of 1 million. Could we do any better than this? We can either upgrade the cpu of the machine doing the sorting or use two or more machines to divide the work among those machines. In this article, we are going to investigate the impact of dividing the work into smaller chunks and farming it to other processors.

Divide and Conquer

Assume we have an array $n=2^{20}$ elements that we need to sort and suppose we have two identical processors we can use. Divide the array into 2 equal sized arrays. Give the first array to the first processor and the other half to the second processor. Apply an efficient sorting algorithm to the subarrays to produce a sorted array for each processor. We then combine the result of processor 1 and processor 2 to one big array by merging the two sorted arrays. The diagram below illustrates the process of computation:

This is also known as the MapReduce algorithm. Mapping is the process of assigning subsets of the input data to processors where each processor computes the partial result. Reducing is the process of aggregating the results of each processor to the final solution of the problem.

The process of merging is straightforward. Given two sorted arrays, begin by comparing the first element of each array. The smaller of the two will then occupy the first slot in the big array. The second element of the array from which we took the smallest element will now become the first element of that array. Repeat the process until all elements of both arrays have already occupied slots in the big array. The diagram below illustrates the algorithm of merging.

If you count the total number of comparisons that you need to merge two sorted arrays, you will find that it takes $n-1$ comparisons. Therefore, the complexity of the merging process is $O(n)$.

Since each processor has $n/2$ sized subarrays, the sorting complexity is therefore $n/p \lg (n/p)$. Furthermore, since the merging process takes $O(n)$ comparisons, the total complexity of the parallel sorting process is therefore

$\displaystyle n/p \lg(n/p) + n$

In our example, $C=2^{20}/2 \lg(2^{20}/2) + 2^{20}= 11010048$ comparisons compared to $2^{20} \lg(2^{20}) = 20971520$ when run sequentially. For large values of $n$, $n/p \lg(n/p)$ dominates $n$, therefore the complexity of the parallel algorithm is $O(n/p \lg(n/p))$.

Can we do any better?

For a given value of $n$, what do you think is the value of $p$ that reduces the running time to $O(n)$? If we take $n=2^{20}$ and plot complexity against $p = \{ 2, 4, 8, 16, 32\}$ we get the diagram below.

In this diagram, we also plotted the horizontal line $y=2^{20}$. The intersection of this line with the plot of $\displaystyle f(p) = \frac{n}{p} \lg(\frac{n}{p})$ gives us the value of $p$ such that the total comparisons is already linear, that is,

$\displaystyle f( p ) = n$
$\displaystyle \frac{n}{p} \lg(\frac{n}{p}) = n$

To get the value of $p$ numerically, we have to solve the root of the equation

$\displaystyle g( p ) = \frac{n}{p} \lg(\frac{n}{p}) - n = 0$

Simplifying,

$\displaystyle \frac{1}{p} \lg(\frac{n}{p}) - 1 = 0$
$\displaystyle \lg(\frac{n}{p}) = p$
$\displaystyle \frac{n}{p} = 2^p$
$\displaystyle p2^p - n = 0$

Since this is a non-linear equation, we can solve this using the Newton's method. It is a method to compute the roots by approximation given an initial value of the solution. Starting from a guess solution $p_1$, the root can be approximated using the recursive formula

$\displaystyle p_{n+1} = p_n - \frac{g( p_n)}{g\prime ( p_n)}$

where $g\prime ( p )$ is the first derivative of $g( p )$. Applying the rules of derivatives, we get

$\displaystyle g\prime ( p ) = p\cdot 2^p \ln 2 + 2^p$

Substituting this to the formula for Newton's method, we get

$\displaystyle p_{n+1} = p_n - \frac{p2^p - n}{p2^p \ln 2 - 2^p}$

Below is an R code using newton’s method to compute the root of the equation $g(p)$.

g=function(n,p){
p* 2^p - n
}

gprime=function(n,p){
p*2^p *log(2) - 2^p
}

newton=function(p,n,iter){
tmp = p
for(i in 1:iter){
p=p-g(n,p)/gprime(n,p)

if(abs(p-tmp)< 0.0001){
break
}

print(p)
tmp=p
}
print(p)
}



Running this code, we get the value of $p = 16$:

> newton(15,n,100)
[1] 16.80905
[1] 16.08829
[1] 15.98603
[1] 16.00286
[1] 15.99944
[1] 16.00011
[1] 15.99998
[1] 16


Ignoring network latency, by distributing the input evenly into 16 processors, we get a running time of $O(n)$ time complexity for $n=2^{20}$ array of items. Therefore, instead of doing 20 million comparisons, you only need 1 million comparisons to sort 1 million objects.

In this age of multicore processors, parallel computing is fast becoming the norm than the exception. Learning to harness the power of multicores is becoming an extremely handy skill to have.

## Divine Comedy*

Last night I had a dream. In my dream, an angel showed me heaven and hell. The first place we went to was a large hall filled with round tables. Around each table were 5 souls. All of them wearing white. The place was quiet and the only sound you can hear are the noise of the chopsticks as each soul eats the spaghetti from his or her plate. The spaghetti was unlimited and placed in a big bowl at the center of each table. The chopsticks were placed in such a way that there is one stick between each plate. No soul can eat unless he or she has taken his or her left and right chopsticks. If a soul is not eating, he or she spends this time thinking.

I asked the angel “What is this place?” The angel said “This is hell.” I asked the angel “But why is this hell? This seems to be a very peaceful place with people doing nothing but eat and think.” Then the angel pointed me to one table and said “What do you see?”. I looked at the table and I saw souls starving but never dying. As I looked closer, I noticed that all of them were able to pick their right chopsticks but not the left chopsticks. Suddenly I realized that many tables have souls starving forever.

I was about to ask the angel why they starved when the angel suddenly showed me another place. People were also wearing white and they also have a bowl of spaghetti at the center of each table. Two things were noticeably different from hell. Each table was inside its own room and there was a waiting area per room that can accommodate 4 souls. There were 5 seats per table, but the maximum number of souls per table was only 4. I observed the entire hall and I did not see a single soul starved. Everyone had a happy look on their face and were either eating or thinking. I asked the angel “What is this place?” The angel said “This is heaven. You can see that there are a maximum of 4 souls per table. A soul that is not seated at the table is in the waiting area.”

Suddenly, I woke up from my dream without any idea of what it means. What could the dream mean? I spent the whole day thinking about it.

Interpretation of the Dream

The chopsticks used by the souls are binary semaphores since no two of them can grab the same stick at any time. To model the behavior of the souls in hell, we need an array of 5 semaphores (chopsticks) and the following algorithm:

#loop forever
think()
wait(left chopstick)
wait(right chopstick)
eat()
signal(left chopstick)
signal(right chopstick)


The figure below shows a round table with 5 plates. The chopsticks are colored green and placed in between the plates. For a soul to eat, he/she should have both chopsticks first.

Since the chopsticks are semaphores, then no two souls can be holding the same chopstick at the same time. To see this, let’s make use of the following properties from the previous post:

1. $\displaystyle \#CS = \#wait - \#signal$. For chopstick i, the number of souls holding this chopstick at any time is equal to $\#CS$.
2. $\displaystyle S.V = 1 - \#CS$. Using this property to compute $\#CS$, we get

$\displaystyle \#CS = 1 - S.V$

Since $S.V \ge 0$, then $\#CS \le 1$, that is, at most one soul is holding the chopstick i at any given time.

The reason why the souls got starved is because it is possible that all of the souls grabbed their right forks at the same time leaving no single left fork to use.

The setup in heaven is subtly different. Instead of having 5 souls eating on the table at the same time, at least one of them waits for his turn in the waiting area (which is guarded by a room semaphore). To model this, we need an array of 5 semaphores (chopsticks) and a room semaphore initialized to 4. The following is the algorithm used by the souls in heaven:

# loop forever
think()
wait(room)
wait(left chopstick)
wait(right chopstick)
eat()
signal(left chopstick)
signal(right chopstick)
signal(room)



The diagram below shows a round table inside a room. The plates and chopsticks are arranged as before, but at least one seat is empty as indicated by the red plate corresponding to it. Any soul that is not inside the room is in the waiting area as shown by the stick figure.

Initially, all the souls are in the waiting area and the door to the room is guarded by the room semaphore. Since the initial value of the room semaphore is 4, at most 4 souls can enter the room at a time. Inside the room, the souls take their seats. Since at most 4 souls can be in the room, at least one seat is not taken. No soul is ever left starved in this system. Otherwise, a soul[i] can be blocked in one of three cases:

1. Blocked on his/her left chopstick. This means there is a soul[i-1] to his/her left that is using chopstick[i] as his/her right chopstick. By assumption of progress, this means that soul[i-1] will eventually execute signal(right chopstick). Hence, soul[i] will be able to pickup his/her left chopstick.
2. Blocked on his/her right chopstick. Soul[i] is not able to pick up his/her right chopstick because soul[i+1] is using this chopstick as his/her left chopstick and will never release it until he/she has eaten. But the only way for soul[i+1] to eat is if he/she is able to pick up the right chopstick which, by induction, is also being used by another soul to the right of soul[i+1]. However, since there is at least one seat that is vacant, there is a right chopstick that is not being used by any soul and hence by progress, at least one soul can eat and eventually soul[i] can eat.
3. Blocked from entering the room. If we use a first come first served policy, eventually, all souls can enter the room.

By using an extra locking mechanism, the souls in heaven are infinitely happier than the ones in hell.

* Disclaimer: This is a work of fiction and should not be taken literally aside from it’s algorithmic content. In concurrency literature, this is called the Dining Philosophers problem. For more information, please consult the book “Principles of Concurrent and Distributed Programming” by M. Ben-Ari.

## Semaphorically Speaking

When I was child, our family lived about 60 kilometers away from the city. Once or twice a month we go to the city which took about an hour and a half one way. On the way to the city, we have to pass a stretch of very narrow road that was at the side of a very deep cliff. Many people have already lost their lives in that cliff when the bus they rode plunged into the cliff. In order to prevent this, a mechanism was setup that allowed only one bus at a time to enter that section of road, whether going one way or the other. In order for each end of the road to determine if a bus is traversing at any moment, they used a telephone to signal the other party so that the other buses can wait before taking their turn. It was always a frightening experience each time we pass that road. In computer science, we can view the dangerous narrow section of road as a critical section and the buses traversing this road as the processes. To guarantee mutual exclusion, the telephone system that signals each end of the road is called a semaphore.

We see semaphores everyday. The traffic light is a semaphore that protects the vehicles from bumping each other in that critical section of the road called the intersection. Semaphores are also used in the navy. Below is an example of a semaphore used in the US Navy.

In the last post, we talked about concurrency and an algorithm to ensure correctness of concurrent algorithms. In this post, we will present another mechanism to ensure correctness of concurrent programs. That mechanism is called a semaphore and was first proposed by Edsger Dijkstra. A semaphore* is a data structure S that is composed of an integer V and a set L and being modified by two atomic functions. The first function is called wait and is defined as follows:

wait(S)
if S.V > 0
S.V <- S.V -1
else
S.L <- S.L union p
p.state <- blocked


The wait function is called by a process $p$ on a semaphore before it enters the critical section. If the value of S.V is not zero, S.V is decremented by 1 and the process $p$ is allowed to continue its execution. Otherwise, the process $p$ is blocked and is put on the set S.L.

The signal function is defined as

signal(S)
if S.L = empty set
S.V <- S.V + 1
else
select a process q in S.L
S.L <- S.L - {q}



When a process $p$ is done in its critical section, it will call the signal operation on the semaphore S. If the set S.L has no blocked processes, it will increment the value of S.V by 1, otherwise a process $q$ is selected from the set S.L and its state is set to “ready”.

To use the semaphore construct, a typical concurrent program will be written like the one below:

# loop forever
noncriticalSection()
wait(S)
criticalSection()
signal(S)


Simple Properties of Semaphores

In this section, we list some simple properties of semaphores that we can use in proving correctness. By properties, we mean those mathematical properties that remain invariant under whatever state of the computation.

1. From the definition of the wait and signal functions, we can see that the value of S.V never goes negative, that is

$\displaystyle S.V \ge 0$

2. If initially $S.V = k$, then we can compute the subsequent values of S.V by counting the number of calls to wait and signal (with some qualifications). If $k$ processes call the wait function, then the value of S.V goes down to 0. If all of these processes call the signal function, then the value of S.V goes back to k. If however, a process calls wait and the value of S.V is already 0, then S.V will just remain zero and the process is blocked. We say that the call to wait has failed an we will not count this call. If another process calls on signal and seeing that S.L is not empty, the S.V value will remain the same. This call to signal is also not counted. Having laid down the rules for counting the calls to wait and signal, we can see that

$\displaystyle S.V = k - \#wait + \#signal$

where $\#wait$ and $\#signal$ are the number of calls to wait and signal, respectively.

3. At anytime, the number of processes executing the critical section is found by counting the number of processes entering the critical section by a successful call to wait, minus the number of processes leaving the critical section by a successful call to signal:

$\displaystyle \#CS = \#wait - \#signal$

Correctness of the Semaphore Solution for Two Processes

As we have discussed in the last post, to prove correctness we need to show that the algorithm is mutually exclusive, free from deadlock and starvation.

1. Mutually Exclusion – Let S.V = 1 initially. At anytime during the execution, the number of processes in the critical section is given by property 3 above, that is, $\#CS = \#wait - \#signal$. From property 2,

$\displaystyle S.V = 1 - \#CS$
$\displaystyle S.V + \#CS = 1$
$\displaystyle \#CS = 1 - S.V$

Since by property 1 $S.V \ge 0$, therefore $\#CS \le 1$. That is, there can be at most one process in the critical section at anytime.

2. Freedom from deadlock – Two processes can deadlock if the value of S.V = 0 and neither are in the critical section ($\#CS = 0$). By property 2 and 3,

$\displaystyle S.V = 1 - \#CS$

Since S.V = 0 and $\#CS = 0$, substituting this to the equation above gives us

$\displaystyle 1 = 0$,

3. Freedom from Starvation – Suppose a process $q$, upon calling wait, sees that $S.V = 0$, and is blocked to starvation in S.L. By property 2 and 3,

$\displaystyle \#CS = 1 - S.V = 1 - 0 = 1$

which means that the other process $p$ is in the critical section. By assumption of progress, process $p$ will eventually call signal and will pick one process from S.L to enter the critical section. Since there is only one process that is blocked, and that is $q$, process $q$ can then enter its critical section, contradicting the assumption that it is starved.

* For more information, consult the book “Principles of Concurrent and Distributed Programming” by M. Ben-Ari.

## A Mutually Beneficial Arrangement

It’s been said that the programmers of today are the pizza delivery boys of tomorrow. This has happened the last time the world witnessed a severe financial crisis. Usually, those who are not able to cope up with the current trends in IT are the first ones to go. One trend that is here to stay is the increasing number of cores of our computers. In order to maximize the power of multicore architecture, the programmers of today should be able to do concurrent programming correctly. Concurrent programming is the art of using the extra CPU capacity of your computer by exploiting parallelism to speed up your applications. This is a non-trivial task and requires a different mindset. In these series of articles, we will explore what concurrent programming is and its mathematical basis.

The section of code that does the updating or reading of the array is called the critical section. This is the portion of the code where we need to ensure that only one thread is executing at a time. To do this, we employ a mechanism for the two threads to know whose turn it is to enter the critical section. Below is a code that tries to solve this problem. First we define a global variable that is shared by both threads. This variable called turn will specify which thread can enter the critical section. The value of turn is either 1 or 2. A value of 1 means that thread 1 can enter the critical section while a value of 2 means that thread 2 can enter the critical section.


global turn = 1


This code is to be executed by thread 1:

#loop forever
1 execute non-critical section
2 await turn = 1
3 execute critical section
4 turn = 2



The code to be executed by thread 2 is similar:

#loop forever
1 execute non-critical section
2 await turn = 2
3 execute critical section
4 turn = 1


The critical section has be coded in such a way that it takes a short time as possible. We shall assume that any thread that enters the critical section should exit from it eventually. This is called the progress assumption.

We can imagine the execution of this program by the two threads to be described by a state specified by three numbers:

1. the pointer to the line that thread 1 is to execute.
2. the pointer to the line that thread 2 is to execute.
3. the value of the turn variable.

For example, the initial state is described by the triple (1,1,1). This means that thread 1 pointer is at line 1, thread 2 pointer is at line 1 and the value of the turn variable is 1. Let us first make it clear that when the pointer is at line 1, the thread has not yet executed that line. It only means that if the CPU scheduler will allow thread 1 to execute, it will execute line 1. For example, if the pointer of thread 1 is at line 4, it does not mean that the value of turn = 2. It only means that when thread 1 goes from state 4 to state 1, then the value of turn should have been set to 2.

We can draw the entire execution of the two threads using a state diagram as shown below:

Let’s try to understand this diagram. The execution begins with the state (1,1,1) which means that thread 1 pointer is at 1, thread 2 pointer is at 1, and the value of turn = 1. This state can either go to (2,1,1) or (1,2,1) depending on which thread the CPU scheduler wants to run first. The state transitions are indicated by arrows. From the diagram, you will see that when the value of turn = 1, thread 2 cannot go past pointer 2. In the same way, when the value of turn = 2, thread 1 cannot go past pointer 2.

Correctness Properties of Concurrent Programs

Writing concurrent programs is not easy. You have to demonstrate that your program is correct. What does correct mean? A concurrent program is correct if:

1. Mutual exclusion holds. There is at most one thread that enters the critical section.
2. No deadlock occurs. This means that no thread holds a resource that is needed by another thread, and vice versa. If this occurs, then each thread will be waiting for the other thread to release the resource and the application will appear to hang.
3. No starvation occurs. If a thread wants to enter the critical section, then it should eventually be able to do so.

Is our algorithm correct? First, the algorithm satisfies mutual exclusion. There is no state in which both thread pointers are pointing to line 3, which is the critical section. This means you cannot see states (3,3,1) or (3,32). There is no deadlock. The two threads do not get stuck on it’s preprotocol. Suppose that the value of turn = 1 and both threads execute line 2. Since the turn = 1, then thread 1 will enter its critical section. By the assumption of progress, it will exit its critical section and set turn = 2. This will then allow thread 2 to enter its critical section. Therefore, no thread gets stuck trying to enter its critical section.

There is however starvation. There is no assumption of progress on the non-critical section. This means that if turn = 1 and thread 1 will get stuck in its non-critical section, this will prevent thread 2 from entering its critical section. Therefore, the algorithm is not correct as it does not satisfy all conditions.

In the next post, we will examine a mechanism, called a semaphore, that is able to solve this problem.