## Have You Seen A Half-Balloon?

Balloons come in all shapes and sizes but the shape the comes to mind when I hear the word balloon is that of a spherical or round shape. Given a spherical balloon, create an imaginary partition that divides the balloon into left and right hemispheres. This effectively places the air molecules into left and right. The number of molecules in the left is more or less the same as the number of molecules on the right. This is what we see in real life.

As we pump air into the balloon, a given air molecule can randomly go left or right of the partition. If all air molecules for some reason go left (or right), then we have a situation called a “half-balloon”.

So why aren’t we seeing half-balloons? It’s because that situation is highly improbable. For illustration purposes, imagine there are 6 air molecules. Each molecule has two equally likely choices, be on the left or on the right. Now imagine each air molecule has chosen it’s “side”. A possible combination or configuration (a term we will use moving forward) will be 3 molecules choose Left and 3 choose Right:

L  L  L  R  R  R


Where L stands for Left and R stands for Right.

In fact, the number of such configurations is equal to

$\displaystyle \underbrace{2\times 2\times \ldots \times 2}_{\text{6 times}} = 2^6 = 64$

We enumerate below the list of possible air molecule configurations:


# This is the code in R
# x=c("L","R")
# expand.grid(m1=x,m2=x,m3=x,m4=x,m5=x,m6=x)
# d1=expand.grid(m1=x,m2=x,m3=x,m4=x,m5=x,m6=x)
m1 m2 m3 m4 m5 m6
1   L  L  L  L  L  L
2   R  L  L  L  L  L
3   L  R  L  L  L  L
4   R  R  L  L  L  L
5   L  L  R  L  L  L
6   R  L  R  L  L  L
7   L  R  R  L  L  L
8   R  R  R  L  L  L
9   L  L  L  R  L  L
10  R  L  L  R  L  L
11  L  R  L  R  L  L
12  R  R  L  R  L  L
13  L  L  R  R  L  L
14  R  L  R  R  L  L
15  L  R  R  R  L  L
16  R  R  R  R  L  L
17  L  L  L  L  R  L
18  R  L  L  L  R  L
19  L  R  L  L  R  L
20  R  R  L  L  R  L
21  L  L  R  L  R  L
22  R  L  R  L  R  L
23  L  R  R  L  R  L
24  R  R  R  L  R  L
25  L  L  L  R  R  L
26  R  L  L  R  R  L
27  L  R  L  R  R  L
28  R  R  L  R  R  L
29  L  L  R  R  R  L
30  R  L  R  R  R  L
31  L  R  R  R  R  L
32  R  R  R  R  R  L
33  L  L  L  L  L  R
34  R  L  L  L  L  R
35  L  R  L  L  L  R
36  R  R  L  L  L  R
37  L  L  R  L  L  R
38  R  L  R  L  L  R
39  L  R  R  L  L  R
40  R  R  R  L  L  R
41  L  L  L  R  L  R
42  R  L  L  R  L  R
43  L  R  L  R  L  R
44  R  R  L  R  L  R
45  L  L  R  R  L  R
46  R  L  R  R  L  R
47  L  R  R  R  L  R
48  R  R  R  R  L  R
49  L  L  L  L  R  R
50  R  L  L  L  R  R
51  L  R  L  L  R  R
52  R  R  L  L  R  R
53  L  L  R  L  R  R
54  R  L  R  L  R  R
55  L  R  R  L  R  R
56  R  R  R  L  R  R
57  L  L  L  R  R  R
58  R  L  L  R  R  R
59  L  R  L  R  R  R
60  R  R  L  R  R  R
61  L  L  R  R  R  R
62  R  L  R  R  R  R
63  L  R  R  R  R  R
64  R  R  R  R  R  R


Looking at the data above, we can see that there are 6 configurations that contain only 1 “L”:

# This is the code in R
# cc=c()
# for(i in 1:64){
#  tmp=d1[i,]
#  cc=c(cc,sum(tmp == "L"))
# }
# d1$count = cc # d1[d1$count == 1, ]
m1 m2 m3 m4 m5 m6 count
32  R  R  R  R  R  L     1
48  R  R  R  R  L  R     1
56  R  R  R  L  R  R     1
60  R  R  L  R  R  R     1
62  R  L  R  R  R  R     1
63  L  R  R  R  R  R     1


This means that the probability of getting a configuration having one molecule on the left and 5 molecules on the right is:

$\displaystyle \text{Probability of 1 molecule Left and 5 molecules Right} = \frac{6}{2^6} = 0.093750$

If we continue summarizing our data so that we get the number of combinations containing 2, 3, 4, 5, and 6 “L”s, we can generate what is known as a probability distribution of air molecules on the left hemisphere:

# cc=c()
# for(i in 0:6){
#  cc=c(cc,length(attributes(d1[d1$count == i, ])$row.names))
# }
# data.frame(X=0:6,count=cc,probability=cc/2^6)
X count probability
1 0     1    0.015625
2 1     6    0.093750
3 2    15    0.234375
4 3    20    0.312500
5 4    15    0.234375
6 5     6    0.093750
7 6     1    0.015625


The graph of this probability distribution is shown below:

# Here is the code that generated the plot above
barplot(cc/2^6,names.arg=0:6,main="Probability Distribution \nNumber of Air Molecules on the Left Hemisphere")


The graph above tells us that the most probable configuration of balloon molecules is that half of them are on the right and the other half on the left as shown by the high probability of the value X = 3. It also tells us the probability of all molecules choosing the Left side equal to 0.015625. When the number of air molecules is very large, this probability will turn out to be extremely small, as we will show below.

## The Mathematics

At this point, manually counting the number of rows will not be practical anymore for a large number of molecules. We need to use some mathematical formula to compute the combinations. We don’t really care which molecules chose left or right. We just care about the number of molecules on the left. Given N molecules, there are $2^N$ possible configurations of air molecules. Of these configurations, there are

$\displaystyle {N \choose m } = \frac{N!}{m! (N-m)!}$

combinations that have $m$ molecules on the left. Therefore, the probability of a configuration having $m$ molecules on the left is

$P(m) = \displaystyle \frac{\displaystyle {N\choose m}}{\displaystyle 2^N}$

This is a probability density function since

$\displaystyle \sum_{m=0}^N P(m) = \displaystyle \sum_{m=0}^N \frac{\displaystyle {N\choose m}}{\displaystyle 2^N} = 1$

To show this, we will use the Binomial Theorem

$\displaystyle \sum_{m=0}^N {N \choose m} x^m = (1+x)^N$

If we let $x=1$, the Binomial Theorem gives us

$\displaystyle \sum_{m=0}^N {N \choose m} = (1+1)^N = 2^N$

Therefore

$\begin{array}{rl} \displaystyle \sum_{m=0}^N P(m) &= \displaystyle \sum_{m=0}^N \frac{\displaystyle {N\choose m}}{\displaystyle 2^N}\\ &= \displaystyle \frac{1}{2^N} \sum_{m=0}^N {N\choose m}\\ &= \displaystyle \frac{1}{2^N} 2^N\\ &= 1 \end{array}$

## A Mole of Air

One mole of air contains $6.022 \times 10^{23}$. This means the probability of that all molecules are on the left side of the balloon is

$\displaystyle \frac{1}{\displaystyle 2^{6.022 \times 10^{23}}} < 1/1024^{10^{22}} < 1/(10^3)^{10^{22}}=10^{-30,000,000,000,000,000,000,000}$

This is a very small number such that it contains 30 million trillion zeros to the right of the decimal point. When you write this zeros on a piece of paper with a thickness of 0.05 millimeters, you would need to stack it up to a height 74 times the distance of earth to pluto in kilometers!

## An Interview Question: Using Integer Programming

We can solve the Interview Question using a mathematical technique called Integer Programming. Let $d_1, d_2, \ldots, d_N$ be the variables representing diskette 1, diskette 2, diskette 3, etc. The values of the $d_k$ variables can only be 0 or 1. A 0 means the diskette is not used while a 1 means that it is used.

Each file is saved to a certain diskette. We want to know to what diskette $d_i$ a given file $f_j$ is assigned. To represent this, we assign the variable $a_{ij}$ a value of 1 if file $f_j$ is assigned to diskette $d_i$.

We will normalize the file sizes so that if $s_i$ is the size of $f_i$, the $s_i \le 1$. We do this by simply dividing all file sizes by the size of the diskette. For a given diskette $d_i$, the following constraint should be satisfied:

$d_i - s_1a_{i1} - s_2a_{i2} - \ldots - s_N a_{iN} \ge 0$

for diskette $i = 1, 2, \ldots, N$ and $s_i$ are the normalized file sizes of file $f_i$ for $i=1,2,\ldots,N$.

Since each file $f_j$ can only be assigned to one diskette, we have the following constraint:

$a_{1j} + a_{2j} + \ldots + a_{Nj} = 1$

where $a_{1j}$ is the variable representing the “file $f_j$ is in diskette $d_1$“, etc.

Finally, we have to constrain the value of $d_i$ to be either 0 or 1, that is,

$d_i \le 1$

for all $i=1,2,\ldots,N$.

## Integer Programming Formulation

Given the above information, we can formulate the Integer Programming problem as

Minimize:

$d_1 + d_2 + d_3 + \ldots + d_N$

subject to

$\begin{array}{rl} d_1 - s_1a_{11} - s_2a_{12} - s_3a_{13} - \ldots - s_Na_{1N} &\ge 0\\ d_2 - s_1a_{21} - s_2a_{22} - s_3a_{23} - \ldots - s_Na_{2N} &\ge 0\\ :\\ d_N - s_1a_{N1} - s_2a_{N2} - s_3a_{N3} - \ldots - s_Na_{NN} &\ge 0\\ a_{11} + a_{21} + a_{31} + \ldots + a_{N1} &= 1\\ a_{12} + a_{22} + a_{32} + \ldots + a_{N2} &= 1\\ :\\ a_{1N} + a_{2N} + a_{3N} + \ldots + a_{NN} &= 1\\ d_1 &\le 1\\ d_2 &\le 1\\ :\\ d_n &\le 1 \end{array}$

## Solving the Problem

We will use R to solve this Integer Programming Formulation. Please see code below:

library("lpSolve")
NUMFILES=4

# Generate random file sizes between 1 and 10
FileSizes=ceiling(10*runif(NUMFILES))
x = -1*FileSizes/10
l=length(x)

# Each files can be in any of the diskettes. Suppose there are N files,
# to determine if a file j is in diskette i, the value of variable x_ij will
# 1 if file j is in diskette i, and 0 otherwise.
# Here we construct the coefficients of variables x_ij which are the
# sizes of the files (normalized to 1)
zz=c()
for(i in 1:(l-1)){
zz=c(zz,x,rep(0,l*l))
}
zz=c(zz,x)

# Construct the coefficients of the indicator variables representing the
# diskettes d_i
zzmatrix=matrix(zz,ncol=l*l,byrow=T)
CoefficientsOfDiskettes=c();
for(i in 1:l){
ttt=rep(0,l)
ttt[i] = 1
CoefficientsOfDiskettes= c(CoefficientsOfDiskettes,ttt,zzmatrix[i,])
}

# Construct the coefficients of x_ij for constant j. These variables
# satisfy the equation \sum_{i=1}^N x_{ij}
SumOfFileAcrossDiskettes=c()
for(i in 1:l){
ttt=rep(0,l)
ttt[i]=1
SumOfFileAcrossDiskettes=c(SumOfFileAcrossDiskettes,rep(ttt,l))
}

# Prepend Coefficients of variables d_i. The value of these coefficients is 0.
SumOfFileAcrossDiskettesMatrix=matrix(SumOfFileAcrossDiskettes,ncol=l*l,byrow=T)
PrependCoefficientsOfDiskettes=c()
for(i in 1:l){
PrependCoefficientsOfDiskettes=c(PrependCoefficientsOfDiskettes,c(rep(0,l),SumOfFileAcrossDiskettesMatrix[i,]))
}

# Construct coefficients of d_i to construct constraint d_i <= 1
DisketteConstraints=c()
for(i in 1:l){
ttt=rep(0,l)
ttt[i]=1
DisketteConstraints=c(DisketteConstraints,ttt,rep(0,l*l))
}

# Construct matrix input of lpSolve
const.mat=matrix(c(CoefficientsOfDiskettes,PrependCoefficientsOfDiskettes,DisketteConstraints),ncol=l*(l+1),byrow=T)

print("Matrix Coefficients:")
print(const.mat)

# Construct inequalities/equalities
const.dir=c(rep(">=",l),rep("=",l),rep("<=",l))

# Construct Right-Hand side
const.rhs=c(rep(0,l),rep(1,l),rep(1,l))

# Construct Objective Function
objective.in=c(rep(1,l),rep(0,l*l))

# Invoke lpSolve
mylp=lp(direction="min",objective.in=objective.in,const.mat=const.mat,const.dir=const.dir,const.rhs=const.rhs,all.int=T)

# Print Results
print(paste("Number of Diskettes: ", sum(mylp$solution[1:l]))) tz=matrix(mylp$solution,ncol=l,byrow=T)
print("File Sizes: ")
print(FileSizes)
for(i in 2:(l+1)){
files = which(tz[i,] == 1)
if(length(files) > 0){
print(paste("Files in diskette ", i-1))
print(files)
}
}



Most of the code above is setting up the matrix of coefficients. The line 70 then calls on lpSolve to compute the optimal values of the variables

## Program Output

Running this code we get the output

[1] "Matrix Coefficients:"
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,]    1    0    0    0   -1 -0.2 -0.1 -0.1    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[2,]    0    1    0    0    0  0.0  0.0  0.0   -1  -0.2  -0.1  -0.1     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[3,]    0    0    1    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0    -1  -0.2  -0.1  -0.1     0   0.0   0.0   0.0
[4,]    0    0    0    1    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0    -1  -0.2  -0.1  -0.1
[5,]    0    0    0    0    1  0.0  0.0  0.0    1   0.0   0.0   0.0     1   0.0   0.0   0.0     1   0.0   0.0   0.0
[6,]    0    0    0    0    0  1.0  0.0  0.0    0   1.0   0.0   0.0     0   1.0   0.0   0.0     0   1.0   0.0   0.0
[7,]    0    0    0    0    0  0.0  1.0  0.0    0   0.0   1.0   0.0     0   0.0   1.0   0.0     0   0.0   1.0   0.0
[8,]    0    0    0    0    0  0.0  0.0  1.0    0   0.0   0.0   1.0     0   0.0   0.0   1.0     0   0.0   0.0   1.0
[9,]    1    0    0    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[10,]    0    1    0    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[11,]    0    0    1    0    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[12,]    0    0    0    1    0  0.0  0.0  0.0    0   0.0   0.0   0.0     0   0.0   0.0   0.0     0   0.0   0.0   0.0
[1] "Number of Diskettes:  2"
[1] "File Sizes: "
[1] 10  2  1  1
[1] "Files in diskette  1"
[1] 2 3 4
[1] "Files in diskette  2"
[1] 1



## Interpreting the Result

Lines 2-14 of the output gives you the matrix of coefficients. Line 15 prints the number of diskettes needed to store the files. Line 17 prints the randomly generated file sizes from 1 to 10. Finally lines 18-21 prints which diskettes contain which files.

The space complexity of this solution is quite substantial. Given $N$ files, we need to specify $N^2 + N$ variables by $3\times N$ equations for a total of $(N^2 + N)\times 3N$ memory space for coefficients.

## An Interview Question

I was given this interview question and I’d like to share it to you. The setting is back in the days when the largest size of a hard disk was 1 GB and there were no CD writers yet and the only way to back up your data is through those 1.44 floppy disks. You want to back up your files but you want to minimize the number of floppy disks you need to use. Assume your most important files are in a single directory. How will you distribute the files across your disks in such a way that the number of disks you use is minimized ?

To make this simple, let’s assume the following:

– we will not take into account that every file you copy to the disk has a record of the metadata of the file and stored on the disk as well. This will eat up space as you put more files. For our purposes, we ignore this complexity.
– The size of each file is less than or equal to 1.44

First we need to have a list of those files including the size and sort the list according to size in descending order. If A is the list of files, we can apply this algorithm:

B := list of files to copy to current floppy disk
remaining_size := 1.44 MB
For file in A:
If remaining_size - file.size > 0:
A.remove(file)
Copy all files listed in B to disk
Empty B
Repeat process for remaining files in A


Although there are other better algorithms than the one above, this is the one I managed to to come up during the interview.

We now need to determine how fast our algorithm can run.

## Worst Case Complexity

How slow can this algorithm get ? If for any two files $F_i$ and $F_j$ in A we have $F_i + F_j > 1.44$, then all files will have their own diskette. If this is the case, for each file, our algorithm will execute step 4. For the first disk, it will execute the step $N$ times. For the second disk, it will execute the step $N-1$ times, for the third disk it will execute $N-2$ times, etc. the total number of times it executes step 4 is the total number of comparisons and is equal to the summation:

$\displaystyle \sum_{1=1}^{N} i$

which is equal to

$\displaystyle \frac{N(N+1)}{2}$

Therefore, in the worst case, the complexity is $O(N^2)$.

## Best Case Complexity

The best case is when all files fit in just one diskette. For this, the total number of comparisons is $N$

## Average Case Complexity

On the average, files have different sizes. We now compute the complexity on the assumption that the probability distribution is uniform.

If $k$ is the number of diskettes, the number of comparisons is a sequence of monotonic decreasing numbers $\{ a_1, a_2, a_3, \ldots, a_k \}$ taken at random from the set $\{ 1, 2, \ldots, N\}$. Each of the numbers $a_j$, $j\in \{1, 2, \ldots, k\}$ has a probability $1/N$ of being chosen. Let $X$ be a random variable such that

$\displaystyle Pr(X=a_j) = \frac{1}{N} \text{ for } j=1,2,\ldots,k$

then the number of comparisons $C$ is equal to

$\displaystyle C = \sum_{i=1}^k X = kX$

The expected value of $C$ is given by the

$E[C] = E[kX] = kE[X]$

However, the expected value of X is given by

$\displaystyle E[X] = \sum_{j=1}^N j\cdot Pr(X=j) = \frac{1}{N} \sum_{j=1}^N j = \frac{1}{N}\frac{N(N+1)}{2} = \frac{N+1}{2}$

Therefore,

$\displaystyle E[C] = k\frac{N+1}{2}$

What remains is to determine the average value of $k$, which is the number of diskettes. If $M=1.44$ is the maximum file size, the average file size is $M/2$. The average total file size is then $NM/2$. The average number of diskettes is equal to the average total size divided by size of diskette, that is

$k = \displaystyle \frac{NM}{2}\frac{1}{M} = \frac{N}{2}$

This means that

$\displaystyle E[C] = \frac{N}{2} \frac{N+1}{2} = O(N^2)$

which is the same as the worst case complexity.

There is another way to solve this problem using Integer Programming.

## Asynchronous Versus Synchronous: A Performance Perspective

I just had my coffee from my favorite coffee shop down the corner. I’m a regular at that coffee shop that the baristas know me and I’m able to sit down after I order and do something else while they make my coffee and serve to me when ready.  If I order from other branches of this coffee shop, I would have to wait for my coffee in a corner and not being able to do anything productive.  So while I was seated comfortable waiting for my coffee, a thought came to me. This is a great example of synchronous versus asynchronous service invocation.

A synchronous service invocation is just like buying your coffee and having to wait in the corner for the baristas to complete your coffee before you can even sit down and do something useful. Asynchronous service invocation is having to get seated and do something else while waiting for your coffee to be served.  My instinct tells me that asynchronous invocation is better than synchronous in terms of performance especially when it takes a long for the service to complete and the waiting time is much better used doing something else. But how can we show that this is the case?

We can model the synchronous/asynchronous invocation as a Markov Chain and compute a few metrics from our model. If you’re not familiar with this approach, you can refer to this article MODELING QUEUING SYSTEMS USING MARKOV CHAINS.

First, we model the asynchronous invocation. We can identify each state of our system by  2 slots each of which contains a number. The first slot is the number of responses waiting for the client to process and the second slot is the number of requests waiting for the server to process.  To simplify the modeling, we assume the following:

1. The maximum number of requests that a server can handle at any given time is 2. Which means that a client cannot anymore send a request if the second number is equal to 2.
2. When the server responds, the first number increments by 1. Once a client receives a response, it will stop whatever it is doing and process that response. When the client finishes processing the response, the first number goes down to zero. As a consequence, this assumption says that the maximum value of slot number 1 is 1 at any given time.

With these assumptions, we can draw the Markov Diagram of the asynchronous system to come up with the below diagram:

Explanation of the Diagram

The system initially starts at 00 where there are no requests and no responses. It will then transition to state 01 when the client will send a  request which increments the server’s pending requests to 1. From this state, the system can transition to either state 02 or state 10. State 02 occurs when the client sends another request without waiting for a response. State 10 is when the server is able to process the request and return a response, thereby decrementing it’s pending items and incrementing the client’s number of responses to process.

At state 02, the client cannot anymore send requests since the maximum requests that the server can handle is 2. At this state, it can only transition to state 11, where the server has processed one of the requests (thus decrementing the second slot from 2 to 1 and incrementing the first slot from 0 to 1).

State 11 can only transition to 01 since the client will stop whatever it is doing to process the single response it has received (thus decrementing the first slot from 1 to 0).

The numbers $a,b,c$ are the rates at which the transitions will occur. For example, the rate at which the system will transition from state 00 to state 01 is $b$.

In the remainder of this article, we assume that the client can request at the rate of 60 requests per unit time and can process the response at 60 per unit time. We also assume that the server can process requests at 30 per unit time. We therefore have the following:

$a=60, b=60, c=30$

Computing the Probabilities

At steady state, the flow going into each state is equal to the flow out of that state. For example, for state 00, the following is true:

$\displaystyle bP_{00} = aP_{10}$

Doing this for all states, we get the following balance equations:

$\begin{array}{rlr} \displaystyle bP_{00} &= aP_{10} & (1)\\ bP_{00} + a P_{11} &= bP_{01} + cP_{01} & (2)\\ bP_{01} &= cP_{02} & (3)\\ aP_{10} &= cP_{01} & (4)\\ aP_{11} &= cP_{02} & (5) \end{array}$

Since the sum of the probabilities should be equal to 1, we have our last equation:

$P_{00} + P_{01} + P_{02} + P_{10} + P_{11} = 1$

We have 6 equations in 5 unknowns, one of the equations above is actually redundant. In fact, you can show that equation 2 above is redundant.

We can form the matrix equation of the system of equations above and solve for the probabilities:

$\begin{bmatrix} -b & 0 & 0 & a & 0 \\ 0 & b & -c & 0 & 0 \\ 0 & -c & 0 & a & 0 \\ 0 & 0 & -c & 0 & a \\ 1 & 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} P_{00}\\ P_{01}\\ P_{02}\\ P_{10}\\ P_{11} \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ 0\\ 0\\ 1 \end{bmatrix}$

Solving this matrix equation when a=60, b=60 and c=30, we can find the probabilities:

$\begin{bmatrix} P_{00}\\ P_{01}\\ P_{02}\\ P_{10}\\ P_{11} \end{bmatrix} = \displaystyle \frac{1}{ab^2+abc+ac^2+b^2c+bc^2} \begin{bmatrix} ac^2\\ abc\\ ab^2\\ bc^2\\ b^2c \end{bmatrix} = \begin{bmatrix} 1/10 \\ 1/5 \\2/5\\ 1/10\\ 1/5 \end{bmatrix}$

Utilization and Throughput: Asynchronous Case

The client utilization is defined to be the probability that the client is busy. In the same way, the server utilization is the probability that the server is busy. Looking at the diagram, the client is busy sending requests at state 00 and state 01. It is busy processing responses at state 10 and 11.  On the other hand, the server is busy processing requests at state 01 and 02.

Therefore, the client utilization is  equal to

$\begin{array}{rl} U_{\text{client}} &= P_{00} + P_{01} + P_{10} + P_{11}\\ &= 1/10 + 1/5 + 1/10 + 1/5\\ &= 3/5\\ &= 60\% \end{array}$

The server utilization is equal to

$\begin{array}{rl} U_{\text{server}} &= P_{01} + P_{02}\\ &= 1/5 + 2/5 \\ &= 3/5 \\ &= 60\% \end{array}$

The system througput is the number of requests the client is able to submit and is equal to

$\begin{array}{rl} X &= 60P_{00} + 60 P_{01} \\ &= 60*1/10 + 60 * 1/5 \\ &= 18 \end{array}$

Comparison with Synchronous Invocation

For the synchronous invocation, the client will submit a request at state 00. The server will then receive this request and process it immediately at state 01. The client will get the response and process it at state 10 and do the loop again at state 00. Please see the Markov diagram below describing this process.

We can solve the probabitlies of this Markov Chain by solving the balance equation

$\begin{bmatrix} b & 0 & -a \\ 0 & c & -a \\ 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} P_{00}\\ P_{01}\\ P_{10} \end{bmatrix} = \begin{bmatrix} 0\\0\\1 \end{bmatrix}$

Solving for the probabilities, we get

$\begin{bmatrix} P_{00}\\ P_{01}\\ P_{10} \end{bmatrix} = \displaystyle \frac{1}{ab + ac + bc} \begin{bmatrix} ac\\ ab\\ bc \end{bmatrix} = \begin{bmatrix} 1/4\\ 1/2\\ 1/4 \end{bmatrix}$

At state 00, the client is busy sending requests at the rate of 60 requests per unit time, therefore the system throughput is

$\begin{array}{rl} X &= 60 P_{00} \\ &= 60/4 \\ &= 15 \end{array}$

which is less than the Asynchronous case. In addition, the client utilization is

$\begin{array}{rl} U_{\text{client}} &= P_{00} + P_{10}\\ &= 1/4 + 1/4\\ &= 1/2 \end{array}$

This is lower than the Asynchronous case where the client utilization is 60%.

Therefore, the asynchronous service invocation is indeed better compared to the synchronous case. It has a much higher throughput and a much higher utilization compared to the synchronous case.

P.S. I used Mathics to solve the matrix equations. Mathics is an open source alternative to Mathematica. It features Mathematica-compatible syntax and functions so if you’re a user of Mathematica you can run basic commands in Mathics. If you’re a Mathics user, you can easily switch to Mathematica as well. Please visit http://mathics.github.io/ to know more about Mathics.

Now that we know the basics of the Birthday Problem, we can use this knowledge to understand the security of password hashing.

In the early days, passwords were stored in the server “as-is”. This means that if your username was juan and your password is Password123! then that information is stored in the server like this:

juan,Password123!

Since the theft of a password file is harder to prevent, the passwords are not anymore stored “as-is” (also known as clear-text). The server will apply an algorithm to the original password which outputs a text called a hash. The algorithm is called a hash function. The hash is what’s put in the password file. A thief in possession of the password file will not be able to know the original password just by looking at it.

For example, the information above will now look like this:

juan,2c103f2c4ed1e59c0b4e2e01821770fa

where “2c103f2c4ed1e59c0b4e2e01821770fa” is the has value of the password “Password123!“.

The hash function I’m using is called the MD5 hash function. Given a password, it will produce a hash value. The set of all hash values is not infinite. In fact, the number of possible hash values is $2^{128}$ for md5. Due to this restriction, the birthday paradox will apply.

The birthday paradox tells us that given a hash function $f(x)$, the probability that at least two passwords hash to the same value is given by:

$\displaystyle 1-\frac{N\times N-1\times N-2\times \ldots \times N-k+1}{N^k}$

Since md5 hash function has $N=2^{128}$ possible values, the probability that two passwords hash to the same value is

$\displaystyle 1-\frac{2^{128}\times 2^{128}-1\times 2^{128}-2\times \ldots \times 2^{128}-k+1}{(2^{128})^k}$

We want to compute for k so that this probability is at least 50%.

$\displaystyle 1-\frac{2^{128}\times 2^{128}-1\times 2^{128}-2\times \ldots \times 2^{128}-k+1}{(2^{128})^k} \ge 0.5$

which is equivalent to

$\displaystyle \frac{2^{128}\times 2^{128}-1\times 2^{128}-2\times \ldots \times 2^{128}-k+1}{(2^{128})^k} < 0.5$

Computing for $k$ when N is large is hard so we need to approximate this. To that end, we need some tools to help us.

We can write the probability in the following way:

$\displaystyle 1-\frac{N}{N}\times\frac{N-1}{N}\times\frac{N-2}{N}\times\frac{N-3}{N}\times\ldots\times\frac{N-k+1}{N}$
$= \displaystyle 1-\frac{N}{N}\times (1-\frac{1}{N})\times (1-\frac{2}{N})\times (1-\frac{3}{N}) \times\ldots\times (1-\frac{k-1}{N})$

Since N is large, the quantities

$\displaystyle \frac{1}{N}, \frac{2}{N}, \frac{3}{N}, \frac{k-1}{N}$

are very small. Because of this, we can use the approximation

$e^{-x} \approx 1-x$

The above approximation comes from the Taylor expansion of $e^{-x}$:

$\displaystyle e^{-x} = 1 - x + \frac{x^2}{2!} - \frac{x^3}{3!} + \frac{x^4}{4!} \ldots$

If $x$ is small, the higher order terms like $x^2, x^3, x^4, \ldots$ vanish. Using this approximation, we can write the probability as:

$\displaystyle \frac{N}{N}\times (1-\frac{1}{N})\times (1-\frac{2}{N})\times (1-\frac{3}{N}) \times\ldots\times (1-\frac{k-1}{N})$

$\displaystyle = e^{-\frac{1}{N}}\cdot e^{-\frac{2}{N}}\cdot e^{-\frac{3}{N}}\cdot \ldots\cdot e^{-\frac{k-1}{N}}$

$\displaystyle = e^{-\frac{1+2+3+4+\ldots + k-1}{N}}$

Since

$\displaystyle \sum_1^n j = 1+2+3+4+ \ldots + n = \frac{n(n+1)}{2}$

we have

$e^{-\frac{1+2+3+4+\ldots + k-1}{N}} = e^{-k\cdot (k-1)/2N }$

Computing for k

Let’s compute k so that

$\displaystyle e^{-k\cdot (k-1)/2N} < 0.5$

Taking the natural logarithms of both sides

$\displaystyle \ln e^{-k\cdot (k-1)/2N} < \ln 0.5$

$\displaystyle \frac{-k\cdot (k-1)}{2N} < \ln 0.5$

$\displaystyle k^2 - k + 2N\ln 0.5 > 0$

Using the quadratic equation, we can solve for k:

$\displaystyle k > \frac{-(-1) \pm \sqrt{(-1)^2 -4(1)(2N\ln 0.5}}{2}$
$\displaystyle k > \frac{1 \pm \sqrt{1-8N\ln 0.5}}{2}$

When $N=2^{128}$, we have

$\displaystyle k > \frac{1 \pm 4.343876e+19}{2} \approx 10^{19}$

This is about 10 quintillion. What this means is that when $k > 10^{19}$, there is already a 50% chance that 2 passwords hash to the same value. In fact, the md5 was already cracked in 2004.

There are only 365 days in a year (excluding leap year). Given that there are about 7.4 billion people on earth, this means that there are approximately 20 million people with the same birthday on any given day. You just divide 7,400,000,000 by 365 and you get 20 million. Happy Birthday to all 20 million people celebrating their birthday today!

Suppose you’re in a crowd, on a bus, in a restaurant, or stadium. There is a big chance you might be standing next to a person with the same birthday as you.

In fact, you only need about 23 people to have a 50/50 chance of two people having the same birthday! This may sound unbelievable since there are 365 days in a year but you only need 23 people to have a 50% chance of 2 people with the same birthday. How come?

This is called the Birthday Paradox and is very important in digital security, especially the password security.

Basic Counting

Probability is all about counting the possibilities. Let’s make it simple by using a dice as an example. We all know what a dice looks like.

When a balanced dice is thrown, it can land showing any one of its six sides. We refer to the result of throwing a dice as an outcome and we say that a dice has 6 possible outcomes. If a dice is balanced, every side is equally likely to show up. We define the probability of a face showing up as the number of times that face occurs in the possible outcomes divided by the total number of possible outcomes. For example, out of the 6 possible outcomes, the number “1” occurs only once. Since there are 6 possible outcomes, the probability of getting a 1 is, therefore:

$\displaystyle \text{Probability of getting a "1"} = 1/6$

Let’s add a second dice. To identify our two dice, let’s call one of them Dice A and the other Dice B. Let’s throw the dice together. When they land, dice A and dice B will show numbers. For this scenario, an outcome is now defined as the numbers that Dice A and Dice B show when they land. A possible outcome is Dice A shows a 1 and Dice B shows a 2. We can give this outcome a name and call it 1,2. We should remind ourselves that the first number is the result of Dice A and the second number is the result of Dice B. We can also refer to each outcome as a combination.

Here are the possible outcomes that the two dice will show:

If you count the number of combinations above, you’ll get 36. The reason it’s 36 is because dice A has 6 different outcomes and dice B has 6 different outcomes. Multiplying them together gives $6 \times 6=6^2 = 36$.

If you add a third dice, say dice C, the total number of combinations becomes:

$\displaystyle 6^3 = 216$.

In general, for N dice, the total number of combinations is

$\displaystyle 6^N$

How many combinations have at least 2 same numbers?

Since there are only 2 numbers for each combination, this question is also the same as “How many combinations show the same numbers?”. If you look at the diagonal, these are the combinations that have the same number for Dice A and Dice B.

If you count them, you’ll get 6. Therefore, the probability of getting at least two equal numbers (in our 2-Dice system) is

6/36

How many combinations show different numbers?

If you count all combinations outside the diagonal, you’ll get 30. Therefore, the probability of getting two different numbers is

30/36

Notice that the probability of getting at least 2 same numbers PLUS the probability of getting different numbers is equal to 1:

6/36 + 30/36 = 36/36 = 1

Knowing One gives you the other

If we know the probability of getting different numbers (30/36), then we can compute the probability of getting at least 2 same numbers simply by subtracting it from 1:

$\displaystyle \text{probability of getting at least 2 numbers same} = 1-30/36 = 1/6 = 0.167$

Avoid counting manually

When we counted the number of combinations which show different numbers, we counted it with our fingers. There is another way to count which is by doing it mentally. Since we are counting the number of ways that the 2-Dice system will show different numbers, we start by getting Dice A and asking how many different ways Dice A can land so that the number it shows is not equal to the number shown by Dice B. Since we have not yet thrown Dice B, then Dice A is allowed to show any number when it lands. This means there are 6 possible ways for Dice A to do this.

Number of ways Dice A can land = 6

Whatever number results in throwing Dice A, we cannot allow Dice B to have that number. This means that Dice B can only choose from 5 other numbers different from the one chosen by Dice A.

Number of ways Dice B can land = 5

If we multiply them, we get the number of combinations that Dice A and Dice B can land with different numbers:

6*5 = 30

This agrees with our manual counting.

At this point, pause and take note that the probability of getting at least 2 numbers the same for a 2-Dice system is 0.167. If we add more dice, this probability will increase. The question then is

How many dice do we need to throw so that the probability of getting 2 dice showing the same number is at least 50%?

Our 2-Dice example above shows that the probability of at least 2 dices showing the same number is 0.167, which is less than 50%. Let’s add a third dice and compute the probability.

How to compute the probability?

Let’s follow the pattern for the 2-Dice system. Since there are now 3 dice, the number of ways to get all numbers different is:

6*5*4

The total number of combinations of a 3-Dice system is

$\displaystyle 6^3$

Therefore, the probability of getting at least 2 dice with the same number is

$\displaystyle 1- \frac{6\times 5\times 4}{6^3} = 0.444$

This is still less than 50%.

Let’s now add a 4th dice and compute the probability using the same pattern:

$\displaystyle 1- \frac{6\times 5\times 4\times 3}{6^4} = 0.722$

This is greater than 50%! So the answer is we need 4 dice thrown so that the probability of getting at least 2 dice with the same number is at least 50%.

The general formula for the probability for a k-Dice system is:

$\displaystyle 1- \frac{ 6\times 5\times \ldots \times (6-k+1)}{6^k}$

How does this relate to the Birthday Problem?

Now that we have the foundations, it’s easy to translate Dice to people and numbers to birthdays. In our dice example, there are 6 different numbers (faces) per dice. Translating this to birthdays, each person can have 365 possible birthdays since there are 365 days in a year (not including leap year).

This is the analogy:

Dice -> 6 possible faces
Person -> 365 possible birthdays

We want to compute how many random persons we need so that the probability of at least two persons having the same birthday is at least 50%. Let k be the number of random persons. Following the same pattern as the Dice example, the formula to compute the probability, given k persons, is:

$\displaystyle \text{Probability of at least 2 persons with the same birthday} = 1-\frac{365 \times 364 \times 363 \times \ldots (365-k+1)}{365^k}$

If we compute starting from k=1 to k=30, we can construct the following table:

   probability
1  0.000000000
2  0.002739726
3  0.008204166
4  0.016355912
5  0.027135574
6  0.040462484
7  0.056235703
8  0.074335292
9  0.094623834
10 0.116948178
11 0.141141378
12 0.167024789
13 0.194410275
14 0.223102512
15 0.252901320
16 0.283604005
17 0.315007665
18 0.346911418
19 0.379118526
20 0.411438384
21 0.443688335
22 0.475695308
23 0.507297234
24 0.538344258
25 0.568699704
26 0.598240820
27 0.626859282
28 0.654461472
29 0.680968537
30 0.706316243


Below is the graph of the same data where we indicate at what number of persons the graph is greater than or equal to 50%. When the number of persons becomes 23, there is already a 50% chance that at least 2 of them have the same birthday!

## When Average Is Not Enough: Thoughts on Designing for Capacity

Designing a system from scratch to handle a workload you don’t know is a challenge. If you put to much hardware, you might be wasting money. You put little, then your users will complain of how slow the system is.

If you’re given only a rate, like 6000 hits/hour, you don’t know how these are distributed in a minute by minute or per second interval. We can make a guess and say that there are about 100 hits per minute or 1.67 hits/sec. If hits come uniformly at that rate, then we can design a system that can handle 2 hits/sec and all users will be happy since all requests will be served quickly and no queueing of requests. But we know it’s not going to happen. There will be some interval where the number of hits is less than 3 and some more than 3.

Theoretically, requests to our server come randomly. Let’s imagine 60 bins represented by seconds in one minute. We also imagine that requests are like balls we throw into the bins. Each bin is equally likely to be landed by a ball. It’s possible that all balls land on only one bin!

After throwing the balls into bins, let’s see what we have.

As you can see, some bins have more than 2 balls (which is the average number of balls in a bin). Therefore if we design our system based on the average, 50% of our users will have a great experience while the other 50% will have a bad experience. Therefore we need to find how many requests per second our server needs to handle so that our users will have a good experience (without overspending).

To determine how many requests per second we need to support, we need to get the probability of getting 4, 5, 6 or more request per second. We will compute the probability starting from 3 requests per second and increment by one until we can get a low enough probability. If we design the system for a rate that has a low probability, we are going to spend money for something that rarely occurs.

Computing the Probability Distribution

We can view the distribution of balls into bins in another way. Imagine labeling each ball with a number from 1 to 60. Each number has an equal chance to be picked. The meaning of this labeling is this: the number that was assigned to the ball is the bin (time bucket) it belongs to. After labeling all balls, what you have is a distribution of balls into bins.

Since each ball can be labeled in 60 different ways and there are 100 balls, the number of ways we can label 100 different balls is therefore

$\displaystyle 60^{100}$

Pick a number from 1-60. Say number 1. Assume 2 balls out of 100 are labeled with number 1. In how many ways can you do this ? Choose the first ball to label. There are 100 ways to choose the ball. Choose the second ball. Now there are 99 ways to choose the second ball. We therefore have 990 ways to select 2 balls and label them 1. Since we don’t really care in what order we picked the ball, we divide 990 with the number of possible arrangements of ball 1 and ball 2, which is 2! (where the exclamation mark stands for “factorial”). So far, the number of ways to label 2 balls with the same number is

$\displaystyle \frac{100 \times 99}{2!}$

Since these are the only balls with label 1, the third ball can be labeled anything except number 1. In that case, there are 59 ways to label ball 3. In the same way, there are 59 ways to label ball 4. Continuing this reasoning until ball 100, the total ways we can label 2 balls with number 1 and the rest with anything else is therefore:

$\displaystyle \frac{100 \times 99}{2!} \times 59^{98}$

Notice that the exponent of 59 is 98 since there are 98 balls starting from ball 3 to ball 100.

Therefore, the probability of having two balls in the same bin is

$\displaystyle \frac{100 \times 99}{2!} \times \frac{59^{98}}{60^{100}} = 0.2648$

We can also write this as

$\displaystyle \frac{100!}{2! \times 98!} \times \frac{(60-1)^{98}}{60^{100}} = \binom{100}{2} \frac{(60-1)^{98}}{60^{100}}$

In general, if m is the number of balls, n the number of bins and k the number of balls with the same label, then the probability of having k balls within the same bin is given by

$\displaystyle \binom{m}{k} \frac{(n-1)^{m-k}}{n^{m}}$

,

where

$\displaystyle \binom{m}{k} = \frac{m!}{k!(m-k)!}$

is the binomial coefficient.

It turns out that this is a probability distribution since the sum of all probabilities from k=0 to k=m is equal to 1. that is

$\displaystyle \sum_{k=0}^{n} \binom{m}{k} \frac{(n-1)^{m-k}}{n^{m}} = 1$

To see this, recall from the Binomial Theorem that

$\displaystyle \big( x + y \big)^n = \sum_{k=0}^{n} \binom{n}{k} x^{n-k}y^k$

If we let x=n-1 and y=1, we can write the above equation as

$\displaystyle \begin{array}{ll} \displaystyle \sum_{k=0}^{m} \binom{m}{k} \frac{(n-1)^{m-k}}{n^{m}} &= \displaystyle \sum_{k=0}^{m} \binom{m}{k} \frac{(n-1)^{m-k}\cdot 1^k}{n^{m}}\\ &= \displaystyle\frac{(n-1+1)^m}{n^{m}}\\ &= \displaystyle\frac{n^m}{n^m}\\ &= \displaystyle 1 \end{array}$

Here is a graph of this probability distribution.

Here’s the plot data:

 probability 1 0.315663315854 2 0.264836171776 3 0.146632456689 4 0.060268424995 5 0.019612775592 6 0.005263315484 7 0.001197945897 8 0.000236035950 9 0.000040895118 10 0.000006307552 

We can see that for k=9, the probability of it occurring is .004%. Anything beyond that we can call rare and no need to spend money with.

Just For Fun

What’s the probability that a given bin is empty, that is, there are no balls in it?

Other Probability Distributions

Our computation above was based on a uniform probability distribution. However, there are other distributions that are more suitable for arrival of requests. One of the most widely used is called the Poisson Distribution where you can read from here.

R Code

The R code to generate the simulation:

par(mfrow=c(4,4))
f=function(){
x=c()
for(i in 1:100){
x=c(x,sample(seq(1,60),1,replace=T))
}
plot(tabulate(x),type="h", ylab="tx", xlab="secs")
}

for(i in 1:16){
f()
}


The R code to generate the probability distribution:

p=function(m,n,s){
prod(seq(m,m-s+1))/factorial(s)*(n-1)^(m-s)
}

tt=c()
for(i in 1:10){
tt=c(tt,p(100,60,i)/60^100)
}
plot(tt,type="h",xlab="Number of Balls",ylab="Probability")


Dedication

This post is dedicated to my friend Ernesto Adorio, a mathematician. He loves combinatorial mathematics.

Rest in peace my friend! I miss bouncing ideas with you.