Counting the Fruits of the Tree, Part 2

Algorithm Complexity

In the last post, we computed the probability distribution of the set of events that would occur when the first letter of the search term is pressed, given a set of 100K english words. Let’s review the algorithm we are trying to analyze:

input = []; //array of words
matches = [];
while(incremental_search){
   for(i =0;i -1){
          matches.push(input[i]);
      }
   }
   input = matches;
   matches = [];
}

When the first keystroke is pressed, we search through the entire array (comparing each element with the keystroke) and collecting all matching words. The matching words are then collected by the matches array. After the loop is finished, we set the input array to the contents of the matches array and we empty the matches array. We again search the new input array when the user presses the next keystroke, this time matching all words in the new input with the word fragment formed by the first and 2nd letter. We do this until the user is presented with a small set of words that matches his/her search term. We can see that the performance of this algorithm is dependent on the length of the input array for each iteration.

Suppose we take as our input a random subset of our list of words with size 1000. This subset of words should not have any word repeated. We want to know how our algorithm performs on the average for any such subset. To figure this out, we need to know how many words would match the input array when we press the first letter. Since this input array could be any subset of 1000 words from the original list of 109582 words, we want to know the average number of words that matches the first keystroke. Take for example, the letter a. What is the average number of words that will we would likely see when we press the letter a?

Random Variables

In a deterministic world, we can compute the behaviour of things using a function and we would get the same answer for the same input. But we know that the world is all but deterministic and that functions do not always assume the same value for the same input. We call such functions from a set U to the set of real numbers whose value depend on a probability distribution a Random Variable. For example, when we toss a coin, it can either land heads or tails. If we assign a value 1 to heads and 0 to tails, then the coin toss can be thought of as a random variable whose domain is the set {text{head},text{tail}} and whose range is the set {1,0}. It’s value is 1 with probability 1/2 and 0 with probability 1/2. If we toss the coin many times and count the number of heads that came up, we would find that 50% of the time, it came up heads. We also say that “on the average”, heads come up half of the time. The “on the average” is also known as the Expected Value of the random variable and is defined by the formula:

\displaystyle E[X] = \sum_j j\cdot Pr(X = j)

where j is a value that the random variable assumes and Pr(X=j) is the probability of the random variable X in assuming the value j. The summation is over all values that X assumes. In our coin toss example, there are 2 possible values of the X and these are 1 (with probability 1/2) and 0 (with probability 1/2). Therefore, the expected value of the coin toss random variable is

\displaystyle E[X] = 1\cdot Pr(X=1) + 0\cdot Pr(X=0)
\displaystyle = 1\cdot \frac{1}{2} = \frac{1}{2}

In the same way, given any word from our list, we can define a function that takes on a value of 1 if it contains the letter ‘a’ and 0 otherwise. From the previous post, we have computed the probability of getting an ‘a’ to be 0.0752. Let’s call this function X_alpha, then it is defined as;

\displaystyle  X_\alpha = \Big\{\begin{array}{ll}  1 & \text{if the word contains an a}\\  0 & \text{otherwise}  \end{array}

In general, if p is the probability of getting a letter ‘a’, then 1-p is the probability of getting something else. The expected value of this random variable can be computed using the formula:

\displaystyle  \begin{array}{ll}  E[X_\alpha] &= 1 \cdot Pr(X_\alpha = 1) + 0 \cdot Pr(X_\alpha = 0)\\  &= 1 \cdot p + 0 \cdot (1-p)\\  &= p  \end{array}

Imagine we have a set of 1000 words, if for each word we apply this function, then we have an array of 1’s and 0’s. The total number of 1’s indicate the number of words that contain the letter ‘a’. In other words, the function

\displaystyle  X = \sum_\alpha^{1000} X_\alpha

is a random variable that gives us the number of words that contain a letter ‘a’. We can compute the expected value of X using the formula:

\displaystyle E[X] = E[\sum_\alpha^{1000} X_\alpha]
\displaystyle = \sum_\alpha^{1000} E[X_\alpha]
Since E[X_\alpha] = p, the above computation leads to

\displaystyle E[X] = \sum_\alpha^{1000} p = 1000 p

The result tells us that the expected number of words containing the letter ‘a’ is 1000p = 1000\cdot 0.0752 or approximately 75 words. This is a good thing because in the second loop, we will only need to iterate on an input with an average length of 75 words. If we do the same computation for all letters, we find the expected number of words per letter for every thousand to be

   key Expected #Words Matching
1    '               0.02624589
2    a              75.22991402
3    b              22.05573578
4    c              43.26766611
5    d              39.77565011
6    e              99.04150001
7    f              14.90897924
8    g              31.76540371
9    h              25.38502724
10   i              80.46859417
11   j               2.27158200
12   k              10.22408743
13   l              54.74761950
14   m              30.32581651
15   n              67.64353879
16   o              59.99679800
17   p              30.62108280
18   q               2.24402381
19   r              74.48190608
20   s              80.93445876
21   t              65.87062875
22   u              37.54737384
23   v              12.34738014
24   w              10.11910386
25   x               3.54188320
26   y              20.14765939
27   z               5.01034088

If each letter is equally likely to be the first letter to be typed, then on the average we have 37 words matching the first letter, consequently reducing our 1000 length input to 37.

Since the figure we computed was the average, it means that the length of the next input array can be more than this average. In the next part of the series, we will derive some bounds on the length of the next input array. We will also compute the length of the next input array when we type the next letter of the search term.

Advertisements

Published by

Bobby Corpus

Is an IT Architect.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s