## 3 Reasons Why Following Your Architecture is Good For You

I recently had a chat with one of my junior developers. He was just less than 3 months in our company and I wanted to know how his experience was so far. This guy came from a company where every developer was a heroic developer, given the freedom to choose the technology he see’s fit to use in his software project. With our company, he had to follow certain ways of doing things, which was totally different to what he’s used to. He said he’s already getting the hang of it and he now understands why we have that culture, the culture of having a well defined architecture.

Having an architecture means that we put constraints on some of our technology and design choices in order to achieve some system properties that we value most. We’ve been reaping the benefits of this architecture and each time it gets me thinking about reasons why an architecture is a good thing. There are probably an thousand and one reasons but I’ll put forward 3 which has benefited my team immensely.

## Economies of Scale

Several months ago, we realised we needed to redesign the user experience in order to make our product more usable in the enterprise context. This was a good time to revisit parts of our architecture but we only had one month to do it. We had to build 5 big modules and giving one module to one developer seems a daunting task. Not only that, since there were many functionalities that were common across all modules, we were sure that a lot of code duplication will happen. Luckily, we were using a technology that allowed us to define re-usable components. By encapsulating common functionalities as components, this allowed us build components that do one specific thing and do them well. Instead of developing a module, developers now build components which make development simpler and tracking progress more fine-grained. Since components make up a module, it became more evident right from the start that everyone is cooperating in building a module. And since each component is used in more than one module, this accelerated our development drastically allowing us to meet our target.

### An illustration

The above diagram describes the mapping of 10 components to 5 modules. Each row represents a module. A green box indicates that the component(vertical) is a part of that module. For example, module 1 is composed of components 1, 4, 8, 9, and 10. After building and integrating the said components, we say that module 1 is done. The same is true for the other modules.

Now assume that in the first time period, the developers finish module 1. No surprises here. The velocity of the development is as expected. They will then proceed to build module 2. After building module 2, you can see that all the components of module 3 and 4 are also built and only need to be integrated. This means that by the second time period, you already built 4 of the 5 modules. By the 3rd time period, you only need to build the last two components of module 5. The graph below shows how the development is accelerated by the component approach.

In fact, in this illustration, given any 5 modules that is composed of 5 components, there will always be a component that is shared by at least 3 modules. The power of Economies of Scale!

## Consistent approach

Our application captures events happening in the system. Creation, deletion, update or reading of business objects trigger event handlers that do a multitude of things. For example, some of these events are captured by the application’s Social module. When some user comments or likes a certain update, that event will be captured and sent to interested users. By following the architecture, any new module developed will automatically benefit from the event infrastructure and the developer need not worry about it.

Not only does the business object benefit from Social module, it also benefits from other modules automatically as shown in the diagram above.

## Better Maintainability and More Solid Code

The UNIX philosophy advocates writing programs that do one thing and do it well; programs that work together to solve complex tasks. By following the same philosophy in our architecture, we are able to minimise duplication of code. What this means is that when we identify a bug in the code, we are confident that that is the only place we need to fix the bug, making our application more solid as time goes by. All components using that particular component automatically enjoys the benefit of a solid code and on any new features of that component.

What I described is architecture in the small. But the same benefit applies to architecture in the large. That is why architecture is very important from the government we run, the cities we build, the buildings we live in, down to the software we use.

## Counting the Fruits of the Tree, Part 2

Algorithm Complexity

In the last post, we computed the probability distribution of the set of events that would occur when the first letter of the search term is pressed, given a set of 100K english words. Let’s review the algorithm we are trying to analyze:

input = []; //array of words
matches = [];
while(incremental_search){
for(i =0;i -1){
matches.push(input[i]);
}
}
input = matches;
matches = [];
}


When the first keystroke is pressed, we search through the entire array (comparing each element with the keystroke) and collecting all matching words. The matching words are then collected by the matches array. After the loop is finished, we set the input array to the contents of the matches array and we empty the matches array. We again search the new input array when the user presses the next keystroke, this time matching all words in the new input with the word fragment formed by the first and 2nd letter. We do this until the user is presented with a small set of words that matches his/her search term. We can see that the performance of this algorithm is dependent on the length of the input array for each iteration.

Suppose we take as our input a random subset of our list of words with size 1000. This subset of words should not have any word repeated. We want to know how our algorithm performs on the average for any such subset. To figure this out, we need to know how many words would match the input array when we press the first letter. Since this input array could be any subset of 1000 words from the original list of 109582 words, we want to know the average number of words that matches the first keystroke. Take for example, the letter a. What is the average number of words that will we would likely see when we press the letter a?

Random Variables

In a deterministic world, we can compute the behaviour of things using a function and we would get the same answer for the same input. But we know that the world is all but deterministic and that functions do not always assume the same value for the same input. We call such functions from a set $U$ to the set of real numbers whose value depend on a probability distribution a Random Variable. For example, when we toss a coin, it can either land heads or tails. If we assign a value 1 to heads and 0 to tails, then the coin toss can be thought of as a random variable whose domain is the set ${text{head},text{tail}}$ and whose range is the set ${1,0}$. It’s value is 1 with probability 1/2 and 0 with probability 1/2. If we toss the coin many times and count the number of heads that came up, we would find that 50% of the time, it came up heads. We also say that “on the average”, heads come up half of the time. The “on the average” is also known as the Expected Value of the random variable and is defined by the formula:

$\displaystyle E[X] = \sum_j j\cdot Pr(X = j)$

where $j$ is a value that the random variable assumes and $Pr(X=j)$ is the probability of the random variable $X$ in assuming the value $j$. The summation is over all values that X assumes. In our coin toss example, there are 2 possible values of the $X$ and these are 1 (with probability 1/2) and 0 (with probability 1/2). Therefore, the expected value of the coin toss random variable is

$\displaystyle E[X] = 1\cdot Pr(X=1) + 0\cdot Pr(X=0)$
$\displaystyle = 1\cdot \frac{1}{2} = \frac{1}{2}$

In the same way, given any word from our list, we can define a function that takes on a value of 1 if it contains the letter ‘a’ and 0 otherwise. From the previous post, we have computed the probability of getting an ‘a’ to be 0.0752. Let’s call this function $X_alpha$, then it is defined as;

$\displaystyle X_\alpha = \Big\{\begin{array}{ll} 1 & \text{if the word contains an a}\\ 0 & \text{otherwise} \end{array}$

In general, if $p$ is the probability of getting a letter ‘a’, then $1-p$ is the probability of getting something else. The expected value of this random variable can be computed using the formula:

$\displaystyle \begin{array}{ll} E[X_\alpha] &= 1 \cdot Pr(X_\alpha = 1) + 0 \cdot Pr(X_\alpha = 0)\\ &= 1 \cdot p + 0 \cdot (1-p)\\ &= p \end{array}$

Imagine we have a set of 1000 words, if for each word we apply this function, then we have an array of 1’s and 0’s. The total number of 1’s indicate the number of words that contain the letter ‘a’. In other words, the function

$\displaystyle X = \sum_\alpha^{1000} X_\alpha$

is a random variable that gives us the number of words that contain a letter ‘a’. We can compute the expected value of $X$ using the formula:

$\displaystyle E[X] = E[\sum_\alpha^{1000} X_\alpha]$
$\displaystyle = \sum_\alpha^{1000} E[X_\alpha]$
Since $E[X_\alpha] = p$, the above computation leads to

$\displaystyle E[X] = \sum_\alpha^{1000} p = 1000 p$

The result tells us that the expected number of words containing the letter ‘a’ is $1000p = 1000\cdot 0.0752$ or approximately 75 words. This is a good thing because in the second loop, we will only need to iterate on an input with an average length of 75 words. If we do the same computation for all letters, we find the expected number of words per letter for every thousand to be

   key Expected #Words Matching
1    '               0.02624589
2    a              75.22991402
3    b              22.05573578
4    c              43.26766611
5    d              39.77565011
6    e              99.04150001
7    f              14.90897924
8    g              31.76540371
9    h              25.38502724
10   i              80.46859417
11   j               2.27158200
12   k              10.22408743
13   l              54.74761950
14   m              30.32581651
15   n              67.64353879
16   o              59.99679800
17   p              30.62108280
18   q               2.24402381
19   r              74.48190608
20   s              80.93445876
21   t              65.87062875
22   u              37.54737384
23   v              12.34738014
24   w              10.11910386
25   x               3.54188320
26   y              20.14765939
27   z               5.01034088


If each letter is equally likely to be the first letter to be typed, then on the average we have 37 words matching the first letter, consequently reducing our 1000 length input to 37.

Since the figure we computed was the average, it means that the length of the next input array can be more than this average. In the next part of the series, we will derive some bounds on the length of the next input array. We will also compute the length of the next input array when we type the next letter of the search term.

## Counting the Fruits of the Tree, Part 1

In the last post, we described a way to filter a set of words using a Trie. The other way to filter is via searching the array of matching substrings for each keystroke and returning all words that matched. Let $W$ be an array of words. If we type the letter “f”, we search through the entire array and return those words that contain the letter f. Here’s a sample of words containing the letter f.

"folios"          "graffiti"        "frowner"         "affirmativeness"
"confounds"       "nitrifies"       "thrifts"         "decaffeinated"
"undefiled"       "shiftily"        "finises"         "midwiferies"
"firepan"         "wafered"         "purifies"        "flatly"
"scarfing"        "preform"


Typing the letter “i” will further reduce this list to

"graffiti"        "affirmativeness" "nitrifies"       "undefiled"
"finises"         "firepan"         "purifies"        "scarfing"


Typing the letter “e” will reduce this list to

"nitrifies" "purifies"


Analysis: Data Preparation

Let’s get a feel of how this algorithm performs on a set of english words published at site http://www-01.sil.org/linguistics/wordlists/english/. The word list can be download from this link. We’ll also use our favorite tool to do data manipulation and statistical computations.

x=read.csv("wordsEn.txt", header=F)


After loading, the data is now referenced via the variable $x$. The number of words loaded is 109582. We use the head command to take a peek at the data.

> head(x)
V1
1        a
2      aah
3    aahed
4   aahing
5     aahs
6 aardvark


For every letter in the english alphabet, we want to determine how many words in the list $W$ contain that letter. To determine this, we will split each word into its constituent letters and get the unique letters in each word.

xvec=as.vector(x[,1])
xsplit=strsplit(xvec, "")
[[1]]
[1] "a"

[[2]]
[1] "a" "a" "h"

[[3]]
[1] "a" "a" "h" "e" "d"

[[4]]
[1] "a" "a" "h" "i" "n" "g"

[[5]]
[1] "a" "a" "h" "s"

[[6]]
[1] "a" "a" "r" "d" "v" "a" "r" "k"


To get the unique letters we apply the unique function to all words in the array:

tmp=lapply(xsplit, FUN=unique)
[[1]]
[1] "a"

[[2]]
[1] "a" "h"

[[3]]
[1] "a" "h" "e" "d"

[[4]]
[1] "a" "h" "i" "n" "g"

[[5]]
[1] "a" "h" "s"

[[6]]
[1] "a" "r" "d" "v" "k"


Now that we have a list of unique letters of each word, we can count the number of words that contain each letter. We do this by concatenating the list of unique letters of each word into one big list and tabulating the occurrences of each letter.

tmp2=unlist(tmp)
> table(tmp2)
tmp2
'     a     b     c     d     e     f     g
20 57327 16807 32971 30310 75472 11361 24206
h     i     j     k     l     m     n     o
19344 61319  1731  7791 41719 23109 51546 45719
p     q     r     s     t     u     v     w
23334  1710 56757 61674 50195 28612  9409  7711
x     y     z
2699 15353  3818

# In terms of percentage, rounded to 2 decimal places...
> round(table(tmp2)/length(tmp2)*100,2)
tmp2
'    a    b    c    d    e    f    g    h    i
0.00 7.52 2.21 4.33 3.98 9.90 1.49 3.18 2.54 8.05
j    k    l    m    n    o    p    q    r    s
0.23 1.02 5.47 3.03 6.76 6.00 3.06 0.22 7.45 8.09
t    u    v    w    x    y    z
6.59 3.75 1.23 1.01 0.35 2.01 0.50


We can visualize the numbers above by plotting the frequency as a bar chart against the letters.

plot(table(tmp2)/length(tmp2), ylab="probability", xlab="letters")


The result shows us that the 5 most common letters in the english dictionary are “e”, “s”, “i”, “a”, and “r”.

> f=data.frame(f=round(table(tmp2)/length(tmp2)*100,2))
> attach(f)
> colnames(f)=c("letter", "frequency")
letter frequency
6       e      9.90
20      s      8.09
10      i      8.05
2       a      7.52
19      r      7.45


Recap

What we have done is create a frequency distribution of the occurrence each letter in the english alphabet based on a list of 100K words. Now that we have our data prepared, we will analyze the performance of the algorithm, given the input $W$, in the next post.

## Growing Words on Tries

Give me a place to stand and I will move the Earth.

Thus says Archimedes. Leverage. Brute force can only get you somewhere but true power is in leverage. But leverage does not come easily. It involves a lot of thinking and training. In this post, we are going to explore how we use leverage in order to solve a common programming problem.

Suppose I have an array of words or phrases (or String, as used in programming) that I want to search. The search is incremental in the sense that whenever I press the first few letters of the word, I am given an incremental search result. These first few letters can occur in any part of the string. I want to develop an algorithm to to search the array of strings in an efficient manner.

Let’s say we have an array of words: “far”, “for”, “force”, “forced”, “forces”, “force field”. Suppose I want to search for the word “force”. What I do is type in sequence “f”, “o”, “r”,… When I type “f”, the algorithm will then search for all strings containing “f”.

When I type “o”, it will search for all strings containing “fo”. When I type “r”, it will search for all strings containing “for”, and so on.

The first thing that comes to mind is to match each string in the array for each keystroke that we type. All those that match will then be collected and will be used as the new array where we search the new substring formed when the next letter is pressed. If there are a million strings in the array, we will have to search the entire array when the first string is pressed. Hopefully, we will get a small subset that will help us find quickly what we are looking for. This is a brute force approach because it will have to compare a million times just to come up with the set of matches.

Leverage

Nature has always be a source of innovations for science and engineering. We turn to the birds to learn how planes can fly. We turn to the fishes to help our submarines dive. We turn to ants to help us find the shortest path to food. For this problem, we turn to trees.

Imagine representing the array of words as leaves of a tree as shown in the diagram below.

In the diagram, we have 6 words, namely: “far”, “for”, “force”, “forced”, “forces”, “force field” that are arranged in a tree structure. Each non-leaf node contains a letter and the leaf nodes contain the words. Typing the first letter “f” already focuses on all words containing “f” thereby preventing the entire array to be traversed. After forming the sequence “for”, we immediately focus on the words “force”, “forced”, “forces”, and “force field”. The more we type letters, the more focused we are on the item we are searching. This is a very efficient search structure. Thanks to the trees for the inspiration.