Have You Seen A Half-Balloon?

Balloons come in all shapes and sizes but the shape the comes to mind when I hear the word balloon is that of a spherical or round shape. Given a spherical balloon, create an imaginary partition that divides the balloon into left and right hemispheres. This effectively places the air molecules into left and right. The number of molecules in the left is more or less the same as the number of molecules on the right. This is what we see in real life.

As we pump air into the balloon, a given air molecule can randomly go left or right of the partition. If all air molecules for some reason go left (or right), then we have a situation called a “half-balloon”.

So why aren’t we seeing half-balloons? It’s because that situation is highly improbable. For illustration purposes, imagine there are 6 air molecules. Each molecule has two equally likely choices, be on the left or on the right. Now imagine each air molecule has chosen it’s “side”. A possible combination or configuration (a term we will use moving forward) will be 3 molecules choose Left and 3 choose Right:

L  L  L  R  R  R


Where L stands for Left and R stands for Right.

In fact, the number of such configurations is equal to

$\displaystyle \underbrace{2\times 2\times \ldots \times 2}_{\text{6 times}} = 2^6 = 64$

We enumerate below the list of possible air molecule configurations:


# This is the code in R
# x=c("L","R")
# expand.grid(m1=x,m2=x,m3=x,m4=x,m5=x,m6=x)
# d1=expand.grid(m1=x,m2=x,m3=x,m4=x,m5=x,m6=x)
m1 m2 m3 m4 m5 m6
1   L  L  L  L  L  L
2   R  L  L  L  L  L
3   L  R  L  L  L  L
4   R  R  L  L  L  L
5   L  L  R  L  L  L
6   R  L  R  L  L  L
7   L  R  R  L  L  L
8   R  R  R  L  L  L
9   L  L  L  R  L  L
10  R  L  L  R  L  L
11  L  R  L  R  L  L
12  R  R  L  R  L  L
13  L  L  R  R  L  L
14  R  L  R  R  L  L
15  L  R  R  R  L  L
16  R  R  R  R  L  L
17  L  L  L  L  R  L
18  R  L  L  L  R  L
19  L  R  L  L  R  L
20  R  R  L  L  R  L
21  L  L  R  L  R  L
22  R  L  R  L  R  L
23  L  R  R  L  R  L
24  R  R  R  L  R  L
25  L  L  L  R  R  L
26  R  L  L  R  R  L
27  L  R  L  R  R  L
28  R  R  L  R  R  L
29  L  L  R  R  R  L
30  R  L  R  R  R  L
31  L  R  R  R  R  L
32  R  R  R  R  R  L
33  L  L  L  L  L  R
34  R  L  L  L  L  R
35  L  R  L  L  L  R
36  R  R  L  L  L  R
37  L  L  R  L  L  R
38  R  L  R  L  L  R
39  L  R  R  L  L  R
40  R  R  R  L  L  R
41  L  L  L  R  L  R
42  R  L  L  R  L  R
43  L  R  L  R  L  R
44  R  R  L  R  L  R
45  L  L  R  R  L  R
46  R  L  R  R  L  R
47  L  R  R  R  L  R
48  R  R  R  R  L  R
49  L  L  L  L  R  R
50  R  L  L  L  R  R
51  L  R  L  L  R  R
52  R  R  L  L  R  R
53  L  L  R  L  R  R
54  R  L  R  L  R  R
55  L  R  R  L  R  R
56  R  R  R  L  R  R
57  L  L  L  R  R  R
58  R  L  L  R  R  R
59  L  R  L  R  R  R
60  R  R  L  R  R  R
61  L  L  R  R  R  R
62  R  L  R  R  R  R
63  L  R  R  R  R  R
64  R  R  R  R  R  R


Looking at the data above, we can see that there are 6 configurations that contain only 1 “L”:

# This is the code in R
# cc=c()
# for(i in 1:64){
#  tmp=d1[i,]
#  cc=c(cc,sum(tmp == "L"))
# }
# d1$count = cc # d1[d1$count == 1, ]
m1 m2 m3 m4 m5 m6 count
32  R  R  R  R  R  L     1
48  R  R  R  R  L  R     1
56  R  R  R  L  R  R     1
60  R  R  L  R  R  R     1
62  R  L  R  R  R  R     1
63  L  R  R  R  R  R     1


This means that the probability of getting a configuration having one molecule on the left and 5 molecules on the right is:

$\displaystyle \text{Probability of 1 molecule Left and 5 molecules Right} = \frac{6}{2^6} = 0.093750$

If we continue summarizing our data so that we get the number of combinations containing 2, 3, 4, 5, and 6 “L”s, we can generate what is known as a probability distribution of air molecules on the left hemisphere:

# cc=c()
# for(i in 0:6){
#  cc=c(cc,length(attributes(d1[d1$count == i, ])$row.names))
# }
# data.frame(X=0:6,count=cc,probability=cc/2^6)
X count probability
1 0     1    0.015625
2 1     6    0.093750
3 2    15    0.234375
4 3    20    0.312500
5 4    15    0.234375
6 5     6    0.093750
7 6     1    0.015625


The graph of this probability distribution is shown below:

# Here is the code that generated the plot above
barplot(cc/2^6,names.arg=0:6,main="Probability Distribution \nNumber of Air Molecules on the Left Hemisphere")


The graph above tells us that the most probable configuration of balloon molecules is that half of them are on the right and the other half on the left as shown by the high probability of the value X = 3. It also tells us the probability of all molecules choosing the Left side equal to 0.015625. When the number of air molecules is very large, this probability will turn out to be extremely small, as we will show below.

The Mathematics

At this point, manually counting the number of rows will not be practical anymore for a large number of molecules. We need to use some mathematical formula to compute the combinations. We don’t really care which molecules chose left or right. We just care about the number of molecules on the left. Given N molecules, there are $2^N$ possible configurations of air molecules. Of these configurations, there are

$\displaystyle {N \choose m } = \frac{N!}{m! (N-m)!}$

combinations that have $m$ molecules on the left. Therefore, the probability of a configuration having $m$ molecules on the left is

$P(m) = \displaystyle \frac{\displaystyle {N\choose m}}{\displaystyle 2^N}$

This is a probability density function since

$\displaystyle \sum_{m=0}^N P(m) = \displaystyle \sum_{m=0}^N \frac{\displaystyle {N\choose m}}{\displaystyle 2^N} = 1$

To show this, we will use the Binomial Theorem

$\displaystyle \sum_{m=0}^N {N \choose m} x^m = (1+x)^N$

If we let $x=1$, the Binomial Theorem gives us

$\displaystyle \sum_{m=0}^N {N \choose m} = (1+1)^N = 2^N$

Therefore

$\begin{array}{rl} \displaystyle \sum_{m=0}^N P(m) &= \displaystyle \sum_{m=0}^N \frac{\displaystyle {N\choose m}}{\displaystyle 2^N}\\ &= \displaystyle \frac{1}{2^N} \sum_{m=0}^N {N\choose m}\\ &= \displaystyle \frac{1}{2^N} 2^N\\ &= 1 \end{array}$

A Mole of Air

One mole of air contains $6.022 \times 10^{23}$. This means the probability of that all molecules are on the left side of the balloon is

$\displaystyle \frac{1}{\displaystyle 2^{6.022 \times 10^{23}}} < 1/1024^{10^{22}} < 1/(10^3)^{10^{22}}=10^{-30,000,000,000,000,000,000,000}$

This is a very small number such that it contains 30 million trillion zeros to the right of the decimal point. When you write this zeros on a piece of paper with a thickness of 0.05 millimeters, you would need to stack it up to a height 74 times the distance of earth to pluto in kilometers!

Asynchronous Versus Synchronous: A Performance Perspective

I just had my coffee from my favorite coffee shop down the corner. I’m a regular at that coffee shop that the baristas know me and I’m able to sit down after I order and do something else while they make my coffee and serve to me when ready.  If I order from other branches of this coffee shop, I would have to wait for my coffee in a corner and not being able to do anything productive.  So while I was seated comfortable waiting for my coffee, a thought came to me. This is a great example of synchronous versus asynchronous service invocation.

A synchronous service invocation is just like buying your coffee and having to wait in the corner for the baristas to complete your coffee before you can even sit down and do something useful. Asynchronous service invocation is having to get seated and do something else while waiting for your coffee to be served.  My instinct tells me that asynchronous invocation is better than synchronous in terms of performance especially when it takes a long for the service to complete and the waiting time is much better used doing something else. But how can we show that this is the case?

We can model the synchronous/asynchronous invocation as a Markov Chain and compute a few metrics from our model. If you’re not familiar with this approach, you can refer to this article MODELING QUEUING SYSTEMS USING MARKOV CHAINS.

First, we model the asynchronous invocation. We can identify each state of our system by  2 slots each of which contains a number. The first slot is the number of responses waiting for the client to process and the second slot is the number of requests waiting for the server to process.  To simplify the modeling, we assume the following:

1. The maximum number of requests that a server can handle at any given time is 2. Which means that a client cannot anymore send a request if the second number is equal to 2.
2. When the server responds, the first number increments by 1. Once a client receives a response, it will stop whatever it is doing and process that response. When the client finishes processing the response, the first number goes down to zero. As a consequence, this assumption says that the maximum value of slot number 1 is 1 at any given time.

With these assumptions, we can draw the Markov Diagram of the asynchronous system to come up with the below diagram:

Explanation of the Diagram

The system initially starts at 00 where there are no requests and no responses. It will then transition to state 01 when the client will send a  request which increments the server’s pending requests to 1. From this state, the system can transition to either state 02 or state 10. State 02 occurs when the client sends another request without waiting for a response. State 10 is when the server is able to process the request and return a response, thereby decrementing it’s pending items and incrementing the client’s number of responses to process.

At state 02, the client cannot anymore send requests since the maximum requests that the server can handle is 2. At this state, it can only transition to state 11, where the server has processed one of the requests (thus decrementing the second slot from 2 to 1 and incrementing the first slot from 0 to 1).

State 11 can only transition to 01 since the client will stop whatever it is doing to process the single response it has received (thus decrementing the first slot from 1 to 0).

The numbers $a,b,c$ are the rates at which the transitions will occur. For example, the rate at which the system will transition from state 00 to state 01 is $b$.

In the remainder of this article, we assume that the client can request at the rate of 60 requests per unit time and can process the response at 60 per unit time. We also assume that the server can process requests at 30 per unit time. We therefore have the following:

$a=60, b=60, c=30$

Computing the Probabilities

At steady state, the flow going into each state is equal to the flow out of that state. For example, for state 00, the following is true:

$\displaystyle bP_{00} = aP_{10}$

Doing this for all states, we get the following balance equations:

$\begin{array}{rlr} \displaystyle bP_{00} &= aP_{10} & (1)\\ bP_{00} + a P_{11} &= bP_{01} + cP_{01} & (2)\\ bP_{01} &= cP_{02} & (3)\\ aP_{10} &= cP_{01} & (4)\\ aP_{11} &= cP_{02} & (5) \end{array}$

Since the sum of the probabilities should be equal to 1, we have our last equation:

$P_{00} + P_{01} + P_{02} + P_{10} + P_{11} = 1$

We have 6 equations in 5 unknowns, one of the equations above is actually redundant. In fact, you can show that equation 2 above is redundant.

We can form the matrix equation of the system of equations above and solve for the probabilities:

$\begin{bmatrix} -b & 0 & 0 & a & 0 \\ 0 & b & -c & 0 & 0 \\ 0 & -c & 0 & a & 0 \\ 0 & 0 & -c & 0 & a \\ 1 & 1 & 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} P_{00}\\ P_{01}\\ P_{02}\\ P_{10}\\ P_{11} \end{bmatrix} = \begin{bmatrix} 0\\ 0\\ 0\\ 0\\ 1 \end{bmatrix}$

Solving this matrix equation when a=60, b=60 and c=30, we can find the probabilities:

$\begin{bmatrix} P_{00}\\ P_{01}\\ P_{02}\\ P_{10}\\ P_{11} \end{bmatrix} = \displaystyle \frac{1}{ab^2+abc+ac^2+b^2c+bc^2} \begin{bmatrix} ac^2\\ abc\\ ab^2\\ bc^2\\ b^2c \end{bmatrix} = \begin{bmatrix} 1/10 \\ 1/5 \\2/5\\ 1/10\\ 1/5 \end{bmatrix}$

Utilization and Throughput: Asynchronous Case

The client utilization is defined to be the probability that the client is busy. In the same way, the server utilization is the probability that the server is busy. Looking at the diagram, the client is busy sending requests at state 00 and state 01. It is busy processing responses at state 10 and 11.  On the other hand, the server is busy processing requests at state 01 and 02.

Therefore, the client utilization is  equal to

$\begin{array}{rl} U_{\text{client}} &= P_{00} + P_{01} + P_{10} + P_{11}\\ &= 1/10 + 1/5 + 1/10 + 1/5\\ &= 3/5\\ &= 60\% \end{array}$

The server utilization is equal to

$\begin{array}{rl} U_{\text{server}} &= P_{01} + P_{02}\\ &= 1/5 + 2/5 \\ &= 3/5 \\ &= 60\% \end{array}$

The system througput is the number of requests the client is able to submit and is equal to

$\begin{array}{rl} X &= 60P_{00} + 60 P_{01} \\ &= 60*1/10 + 60 * 1/5 \\ &= 18 \end{array}$

Comparison with Synchronous Invocation

For the synchronous invocation, the client will submit a request at state 00. The server will then receive this request and process it immediately at state 01. The client will get the response and process it at state 10 and do the loop again at state 00. Please see the Markov diagram below describing this process.

We can solve the probabitlies of this Markov Chain by solving the balance equation

$\begin{bmatrix} b & 0 & -a \\ 0 & c & -a \\ 1 & 1 & 1 \end{bmatrix} \begin{bmatrix} P_{00}\\ P_{01}\\ P_{10} \end{bmatrix} = \begin{bmatrix} 0\\0\\1 \end{bmatrix}$

Solving for the probabilities, we get

$\begin{bmatrix} P_{00}\\ P_{01}\\ P_{10} \end{bmatrix} = \displaystyle \frac{1}{ab + ac + bc} \begin{bmatrix} ac\\ ab\\ bc \end{bmatrix} = \begin{bmatrix} 1/4\\ 1/2\\ 1/4 \end{bmatrix}$

At state 00, the client is busy sending requests at the rate of 60 requests per unit time, therefore the system throughput is

$\begin{array}{rl} X &= 60 P_{00} \\ &= 60/4 \\ &= 15 \end{array}$

which is less than the Asynchronous case. In addition, the client utilization is

$\begin{array}{rl} U_{\text{client}} &= P_{00} + P_{10}\\ &= 1/4 + 1/4\\ &= 1/2 \end{array}$

This is lower than the Asynchronous case where the client utilization is 60%.

Therefore, the asynchronous service invocation is indeed better compared to the synchronous case. It has a much higher throughput and a much higher utilization compared to the synchronous case.

P.S. I used Mathics to solve the matrix equations. Mathics is an open source alternative to Mathematica. It features Mathematica-compatible syntax and functions so if you’re a user of Mathematica you can run basic commands in Mathics. If you’re a Mathics user, you can easily switch to Mathematica as well. Please visit http://mathics.github.io/ to know more about Mathics.

When Average Is Not Enough: Thoughts on Designing for Capacity

Designing a system from scratch to handle a workload you don’t know is a challenge. If you put to much hardware, you might be wasting money. You put little, then your users will complain of how slow the system is.

If you’re given only a rate, like 6000 hits/hour, you don’t know how these are distributed in a minute by minute or per second interval. We can make a guess and say that there are about 100 hits per minute or 1.67 hits/sec. If hits come uniformly at that rate, then we can design a system that can handle 2 hits/sec and all users will be happy since all requests will be served quickly and no queueing of requests. But we know it’s not going to happen. There will be some interval where the number of hits is less than 3 and some more than 3.

Theoretically, requests to our server come randomly. Let’s imagine 60 bins represented by seconds in one minute. We also imagine that requests are like balls we throw into the bins. Each bin is equally likely to be landed by a ball. It’s possible that all balls land on only one bin!

After throwing the balls into bins, let’s see what we have.

As you can see, some bins have more than 2 balls (which is the average number of balls in a bin). Therefore if we design our system based on the average, 50% of our users will have a great experience while the other 50% will have a bad experience. Therefore we need to find how many requests per second our server needs to handle so that our users will have a good experience (without overspending).

To determine how many requests per second we need to support, we need to get the probability of getting 4, 5, 6 or more request per second. We will compute the probability starting from 3 requests per second and increment by one until we can get a low enough probability. If we design the system for a rate that has a low probability, we are going to spend money for something that rarely occurs.

Computing the Probability Distribution

We can view the distribution of balls into bins in another way. Imagine labeling each ball with a number from 1 to 60. Each number has an equal chance to be picked. The meaning of this labeling is this: the number that was assigned to the ball is the bin (time bucket) it belongs to. After labeling all balls, what you have is a distribution of balls into bins.

Since each ball can be labeled in 60 different ways and there are 100 balls, the number of ways we can label 100 different balls is therefore

$\displaystyle 60^{100}$

Pick a number from 1-60. Say number 1. Assume 2 balls out of 100 are labeled with number 1. In how many ways can you do this ? Choose the first ball to label. There are 100 ways to choose the ball. Choose the second ball. Now there are 99 ways to choose the second ball. We therefore have 990 ways to select 2 balls and label them 1. Since we don’t really care in what order we picked the ball, we divide 990 with the number of possible arrangements of ball 1 and ball 2, which is 2! (where the exclamation mark stands for “factorial”). So far, the number of ways to label 2 balls with the same number is

$\displaystyle \frac{100 \times 99}{2!}$

Since these are the only balls with label 1, the third ball can be labeled anything except number 1. In that case, there are 59 ways to label ball 3. In the same way, there are 59 ways to label ball 4. Continuing this reasoning until ball 100, the total ways we can label 2 balls with number 1 and the rest with anything else is therefore:

$\displaystyle \frac{100 \times 99}{2!} \times 59^{98}$

Notice that the exponent of 59 is 98 since there are 98 balls starting from ball 3 to ball 100.

Therefore, the probability of having two balls in the same bin is

$\displaystyle \frac{100 \times 99}{2!} \times \frac{59^{98}}{60^{100}} = 0.2648$

We can also write this as

$\displaystyle \frac{100!}{2! \times 98!} \times \frac{(60-1)^{98}}{60^{100}} = \binom{100}{2} \frac{(60-1)^{98}}{60^{100}}$

In general, if m is the number of balls, n the number of bins and k the number of balls with the same label, then the probability of having k balls within the same bin is given by

$\displaystyle \binom{m}{k} \frac{(n-1)^{m-k}}{n^{m}}$

,

where

$\displaystyle \binom{m}{k} = \frac{m!}{k!(m-k)!}$

is the binomial coefficient.

It turns out that this is a probability distribution since the sum of all probabilities from k=0 to k=m is equal to 1. that is

$\displaystyle \sum_{k=0}^{n} \binom{m}{k} \frac{(n-1)^{m-k}}{n^{m}} = 1$

To see this, recall from the Binomial Theorem that

$\displaystyle \big( x + y \big)^n = \sum_{k=0}^{n} \binom{n}{k} x^{n-k}y^k$

If we let x=n-1 and y=1, we can write the above equation as

$\displaystyle \begin{array}{ll} \displaystyle \sum_{k=0}^{m} \binom{m}{k} \frac{(n-1)^{m-k}}{n^{m}} &= \displaystyle \sum_{k=0}^{m} \binom{m}{k} \frac{(n-1)^{m-k}\cdot 1^k}{n^{m}}\\ &= \displaystyle\frac{(n-1+1)^m}{n^{m}}\\ &= \displaystyle\frac{n^m}{n^m}\\ &= \displaystyle 1 \end{array}$

Here is a graph of this probability distribution.

Here’s the plot data:

 probability 1 0.315663315854 2 0.264836171776 3 0.146632456689 4 0.060268424995 5 0.019612775592 6 0.005263315484 7 0.001197945897 8 0.000236035950 9 0.000040895118 10 0.000006307552 

We can see that for k=9, the probability of it occurring is .004%. Anything beyond that we can call rare and no need to spend money with.

Just For Fun

What’s the probability that a given bin is empty, that is, there are no balls in it?

Other Probability Distributions

Our computation above was based on a uniform probability distribution. However, there are other distributions that are more suitable for arrival of requests. One of the most widely used is called the Poisson Distribution where you can read from here.

R Code

The R code to generate the simulation:

par(mfrow=c(4,4))
f=function(){
x=c()
for(i in 1:100){
x=c(x,sample(seq(1,60),1,replace=T))
}
plot(tabulate(x),type="h", ylab="tx", xlab="secs")
}

for(i in 1:16){
f()
}


The R code to generate the probability distribution:

p=function(m,n,s){
prod(seq(m,m-s+1))/factorial(s)*(n-1)^(m-s)
}

tt=c()
for(i in 1:10){
tt=c(tt,p(100,60,i)/60^100)
}
plot(tt,type="h",xlab="Number of Balls",ylab="Probability")


Dedication

This post is dedicated to my friend Ernesto Adorio, a mathematician. He loves combinatorial mathematics.

Rest in peace my friend! I miss bouncing ideas with you.

Counting the Fruits of the Tree, Part 2

Algorithm Complexity

In the last post, we computed the probability distribution of the set of events that would occur when the first letter of the search term is pressed, given a set of 100K english words. Let’s review the algorithm we are trying to analyze:

input = []; //array of words
matches = [];
while(incremental_search){
for(i =0;i -1){
matches.push(input[i]);
}
}
input = matches;
matches = [];
}


When the first keystroke is pressed, we search through the entire array (comparing each element with the keystroke) and collecting all matching words. The matching words are then collected by the matches array. After the loop is finished, we set the input array to the contents of the matches array and we empty the matches array. We again search the new input array when the user presses the next keystroke, this time matching all words in the new input with the word fragment formed by the first and 2nd letter. We do this until the user is presented with a small set of words that matches his/her search term. We can see that the performance of this algorithm is dependent on the length of the input array for each iteration.

Suppose we take as our input a random subset of our list of words with size 1000. This subset of words should not have any word repeated. We want to know how our algorithm performs on the average for any such subset. To figure this out, we need to know how many words would match the input array when we press the first letter. Since this input array could be any subset of 1000 words from the original list of 109582 words, we want to know the average number of words that matches the first keystroke. Take for example, the letter a. What is the average number of words that will we would likely see when we press the letter a?

Random Variables

In a deterministic world, we can compute the behaviour of things using a function and we would get the same answer for the same input. But we know that the world is all but deterministic and that functions do not always assume the same value for the same input. We call such functions from a set $U$ to the set of real numbers whose value depend on a probability distribution a Random Variable. For example, when we toss a coin, it can either land heads or tails. If we assign a value 1 to heads and 0 to tails, then the coin toss can be thought of as a random variable whose domain is the set ${text{head},text{tail}}$ and whose range is the set ${1,0}$. It’s value is 1 with probability 1/2 and 0 with probability 1/2. If we toss the coin many times and count the number of heads that came up, we would find that 50% of the time, it came up heads. We also say that “on the average”, heads come up half of the time. The “on the average” is also known as the Expected Value of the random variable and is defined by the formula:

$\displaystyle E[X] = \sum_j j\cdot Pr(X = j)$

where $j$ is a value that the random variable assumes and $Pr(X=j)$ is the probability of the random variable $X$ in assuming the value $j$. The summation is over all values that X assumes. In our coin toss example, there are 2 possible values of the $X$ and these are 1 (with probability 1/2) and 0 (with probability 1/2). Therefore, the expected value of the coin toss random variable is

$\displaystyle E[X] = 1\cdot Pr(X=1) + 0\cdot Pr(X=0)$
$\displaystyle = 1\cdot \frac{1}{2} = \frac{1}{2}$

In the same way, given any word from our list, we can define a function that takes on a value of 1 if it contains the letter ‘a’ and 0 otherwise. From the previous post, we have computed the probability of getting an ‘a’ to be 0.0752. Let’s call this function $X_alpha$, then it is defined as;

$\displaystyle X_\alpha = \Big\{\begin{array}{ll} 1 & \text{if the word contains an a}\\ 0 & \text{otherwise} \end{array}$

In general, if $p$ is the probability of getting a letter ‘a’, then $1-p$ is the probability of getting something else. The expected value of this random variable can be computed using the formula:

$\displaystyle \begin{array}{ll} E[X_\alpha] &= 1 \cdot Pr(X_\alpha = 1) + 0 \cdot Pr(X_\alpha = 0)\\ &= 1 \cdot p + 0 \cdot (1-p)\\ &= p \end{array}$

Imagine we have a set of 1000 words, if for each word we apply this function, then we have an array of 1’s and 0’s. The total number of 1’s indicate the number of words that contain the letter ‘a’. In other words, the function

$\displaystyle X = \sum_\alpha^{1000} X_\alpha$

is a random variable that gives us the number of words that contain a letter ‘a’. We can compute the expected value of $X$ using the formula:

$\displaystyle E[X] = E[\sum_\alpha^{1000} X_\alpha]$
$\displaystyle = \sum_\alpha^{1000} E[X_\alpha]$
Since $E[X_\alpha] = p$, the above computation leads to

$\displaystyle E[X] = \sum_\alpha^{1000} p = 1000 p$

The result tells us that the expected number of words containing the letter ‘a’ is $1000p = 1000\cdot 0.0752$ or approximately 75 words. This is a good thing because in the second loop, we will only need to iterate on an input with an average length of 75 words. If we do the same computation for all letters, we find the expected number of words per letter for every thousand to be

   key Expected #Words Matching
1    '               0.02624589
2    a              75.22991402
3    b              22.05573578
4    c              43.26766611
5    d              39.77565011
6    e              99.04150001
7    f              14.90897924
8    g              31.76540371
9    h              25.38502724
10   i              80.46859417
11   j               2.27158200
12   k              10.22408743
13   l              54.74761950
14   m              30.32581651
15   n              67.64353879
16   o              59.99679800
17   p              30.62108280
18   q               2.24402381
19   r              74.48190608
20   s              80.93445876
21   t              65.87062875
22   u              37.54737384
23   v              12.34738014
24   w              10.11910386
25   x               3.54188320
26   y              20.14765939
27   z               5.01034088


If each letter is equally likely to be the first letter to be typed, then on the average we have 37 words matching the first letter, consequently reducing our 1000 length input to 37.

Since the figure we computed was the average, it means that the length of the next input array can be more than this average. In the next part of the series, we will derive some bounds on the length of the next input array. We will also compute the length of the next input array when we type the next letter of the search term.

Chances Are

I just came from a great weekend escapade, with my wife and two colleagues, in one of the beautiful regions of the Philippines. Bicol is home to Mayon Volcano, one of the seven perfect coned volcanoes on Earth. The scenery is so green and the air so refreshing. It was one of the best trips I ever made in the Philippines.

The trip was also full of coincidences. We met a colleague in one of the restaurants in Naga City where we had our dinner. In the same restaurant we also met one of our contacts in Ateneo de Naga. On the way home, we met another colleague at the airport. But the first coincidence happened on our way to Legazpi City where one of my companions (i’ll hide the name to protect the innocent) met a good friend in the airplane seated next to him. It was a pleasant surprise for my companion and he asked “What are the odds of that happening”? That’s a good question which got me thinking later that night.

To simplify the problem, let us not compute the odds of my companion and his friend to be in the same plane going to Bicol. Rather, let us compute the odds of him being seated next to his friend. The plane was an airbus A320 and about 180 seats. The seats are laid out in such a way that there is a middle isle and each row is composed of three seats. Let us draw a row of seats and assume we have persons A and B. How many ways can they be seated in that row in such a way that they sit next to each other? Below is a diagram on the ways two persons can be seated next to each other.

As you can see, there are four ways for them to be seated next to each other for any given row. In the first arrangement, we have A, B seated in that order. The third seat can be occupied by anyone. Having chosen the seats for A and B, how many choices do we have for the third person? Since there are 180 seats minus two, the third person has 178 other seats to choose from. Once person number 3 chooses one seat, the fourth person can then choose from the remaining 177 seats. Doing this for all other passengers, you will come up with 178! ways of assigning seating arrangements for the 180 passengers such that persons A and B are seated next to each other. Now, that is just for the case where A and B are in the first two seats. For the other three cases, you will see that it also takes 178! ways for each case. Since there are 4 cases per row and 60 rows, we have

$\displaystyle 4\cdot 60\cdot 178!$

ways to have two persons sit next to each other on a 180 seater plane. Since the total number of ways to assign passengers to 180 seats is 180!, then the probability of having two persons seat next to each other is

$\displaystyle \frac{4\cdot 60\cdot 178!}{180!}$
$\displaystyle = \frac{4\cdot 60\cdot 178!}{180\cdot 179\cdot 178!}$
$\displaystyle = \frac{4\cdot 60}{180\cdot 179}$
$\displaystyle = 0.00744879$

This is quite a low probability and it assumes that the seating arrangement is totally random. By totally random, I mean that my wife and I could be seated far apart from each other. In reality, this is not the case since many passengers come in groups and they prefer to be seated next to each other. If we take this into account, then the probability of my companion and his friend to be seated next to each other increases.