# Counting the Fruits of the Tree, Part 1

In the last post, we described a way to filter a set of words using a Trie. The other way to filter is via searching the array of matching substrings for each keystroke and returning all words that matched. Let $W$ be an array of words. If we type the letter “f”, we search through the entire array and return those words that contain the letter f. Here’s a sample of words containing the letter f.

"folios"          "graffiti"        "frowner"         "affirmativeness"
"confounds"       "nitrifies"       "thrifts"         "decaffeinated"
"undefiled"       "shiftily"        "finises"         "midwiferies"
"firepan"         "wafered"         "purifies"        "flatly"
"scarfing"        "preform"


Typing the letter “i” will further reduce this list to

"graffiti"        "affirmativeness" "nitrifies"       "undefiled"
"finises"         "firepan"         "purifies"        "scarfing"


Typing the letter “e” will reduce this list to

"nitrifies" "purifies"


Analysis: Data Preparation

Let’s get a feel of how this algorithm performs on a set of english words published at site http://www-01.sil.org/linguistics/wordlists/english/. The word list can be download from this link. We’ll also use our favorite tool to do data manipulation and statistical computations.

x=read.csv("wordsEn.txt", header=F)


After loading, the data is now referenced via the variable $x$. The number of words loaded is 109582. We use the head command to take a peek at the data.

> head(x)
V1
1        a
2      aah
3    aahed
4   aahing
5     aahs
6 aardvark


For every letter in the english alphabet, we want to determine how many words in the list $W$ contain that letter. To determine this, we will split each word into its constituent letters and get the unique letters in each word.

xvec=as.vector(x[,1])
xsplit=strsplit(xvec, "")
[[1]]
[1] "a"

[[2]]
[1] "a" "a" "h"

[[3]]
[1] "a" "a" "h" "e" "d"

[[4]]
[1] "a" "a" "h" "i" "n" "g"

[[5]]
[1] "a" "a" "h" "s"

[[6]]
[1] "a" "a" "r" "d" "v" "a" "r" "k"


To get the unique letters we apply the unique function to all words in the array:

tmp=lapply(xsplit, FUN=unique)
[[1]]
[1] "a"

[[2]]
[1] "a" "h"

[[3]]
[1] "a" "h" "e" "d"

[[4]]
[1] "a" "h" "i" "n" "g"

[[5]]
[1] "a" "h" "s"

[[6]]
[1] "a" "r" "d" "v" "k"


Now that we have a list of unique letters of each word, we can count the number of words that contain each letter. We do this by concatenating the list of unique letters of each word into one big list and tabulating the occurrences of each letter.

tmp2=unlist(tmp)
> table(tmp2)
tmp2
'     a     b     c     d     e     f     g
20 57327 16807 32971 30310 75472 11361 24206
h     i     j     k     l     m     n     o
19344 61319  1731  7791 41719 23109 51546 45719
p     q     r     s     t     u     v     w
23334  1710 56757 61674 50195 28612  9409  7711
x     y     z
2699 15353  3818

# In terms of percentage, rounded to 2 decimal places...
> round(table(tmp2)/length(tmp2)*100,2)
tmp2
'    a    b    c    d    e    f    g    h    i
0.00 7.52 2.21 4.33 3.98 9.90 1.49 3.18 2.54 8.05
j    k    l    m    n    o    p    q    r    s
0.23 1.02 5.47 3.03 6.76 6.00 3.06 0.22 7.45 8.09
t    u    v    w    x    y    z
6.59 3.75 1.23 1.01 0.35 2.01 0.50


We can visualize the numbers above by plotting the frequency as a bar chart against the letters.

plot(table(tmp2)/length(tmp2), ylab="probability", xlab="letters")


The result shows us that the 5 most common letters in the english dictionary are “e”, “s”, “i”, “a”, and “r”.

> f=data.frame(f=round(table(tmp2)/length(tmp2)*100,2))
> attach(f)
> colnames(f)=c("letter", "frequency")
letter frequency
6       e      9.90
20      s      8.09
10      i      8.05
2       a      7.52
19      r      7.45


Recap

What we have done is create a frequency distribution of the occurrence each letter in the english alphabet based on a list of 100K words. Now that we have our data prepared, we will analyze the performance of the algorithm, given the input $W$, in the next post.