Good Friends Are Hard To …Compute !

Almost 2 years ago, I ran an algorithm to produce the maximal cliques of my facebook friends. My experiment was made possible because of the old facebook REST api. You can find the description of my experiment from this site.

Recently, when I had 351 friends, I tried to do the same experiment, now with a much larger size. To my disappointment, it took forever to compute the maximal cliques.

Why would a network of size 81 take only 315 seconds to compute while a network of size 351 takes all of your patience away? The answer is in the time complexity of the maximal clique algorithm.

The maximal clique algorithm* is as follows:

\text{ALLCLIQUES} (l)
\text{global } X, C_l \text{ }(l=0,1,...,n-1)
\text{comment: every } X \text { generated by the algorithm is a clique}
\text{if } l=0 \text { then output}([])
\text { else output } ([x_0,...,x_{l-1}])
\text {if } l = 0
\text { then } N_l \gets V
\text { else } N_l \gets A_{x_{l-1}} \cap N_{l-1}
\text {if } N_l=\emptyset
\text { then } \{x_0,...,x_{l-1}\} \text { is a maximal clique}
\text {if } l = 0
\text { then } C_l \gets V \text { else } C_l \gets A_{x_{l-1}} \cap B_{x_{l-1}} \cap C_{l-1}
\text {for each } x \in C_l
\text {do } \begin{cases} x_l \gets x \\ \text{ALLCLIQUES} (l + 1) \end{cases}

How this works

Before you call this algorithm, you first number your vertices from 0 to n-1. This is to allow the algorithm to sort the vertices according the number that was assigned to them.

This algorithm is called with the value of l=0. At this point, the algorithm will output the string “[]”, which stands for an empty clique. The set N_l is initialized to the entire vertex set V, the choice set C_l is also initialized to V. The sets A_v and B_v are precomputed before the algorithm starts and are defined as follows: (Here, E is defined as the set of vertices)

A_v = \{ u \in V: \{ u,v\} \in E\}
B_v = \{ u \in V: u > v \}

What this means in layman’s terms is that the set A_v is the set of all vertices connected to v and the set B_v is the set of all vertices whose number is greater than v.

Using the definition above, the choice set C_l is defined as:

\displaystyle C_l = A_{x_{l-1}} \cap B_{x_{l-1}} \cap C_{l-1}

The set N_l is defined as

\displaystyle N_l = N_{l-1}\cap A_{x_{l-1}}

When N_l = \emptyset, we say that the generated set of vertices, X=[x_0,...,x_{l-1}] is a maximal clique.

For example, suppose we have the following graph:

Running the algorithm over this graph would generate:

The “*” indicates that the generated vertices form a clique.

So how do we answer the question: “Why would a network of size 81 take only 315 seconds to compute while a network of size 351 a very long time”? We compute the running time of the algorithm.

Average Case Analysis

Let n be the set of nodes. The number of ways you can connect one node to another node is {{n}\choose{2}}, these correspond to the number of possible vertices which you can make a graph G. Given this set, the power set corresponds to the number of possible graphs of you can create using n nodes. The cardinality of this power set is \displaystyle 2^{{n}\choose{2}}. Let us call this power set G(n).

Let G \in G(n) be an arbitrary graph and define c(G) be the number of cliques in G. By the recursive nature of the algorithm, it produces a tree, called the state space tree, for which the number of nodes is equal to c(G). Since the algorithm is invoke recursively n times, the running time of this algorithms is O(nc(G)). It remains for us to compute c(G) in the average case.

Denote by \bar c(n) the average clique size, then

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum _{G\in G(n)} c(G)

Let’s dissect the above formula. Keeping in mind what you learned in elementary math, the average value is computed over the total number of graphs is G(n), which is equal to 2^{{n}\choose {2}}. The numerator of the formula is just the sum of the number of cliques of all graphs G \in G(n). That’s how we get the formula above. Although, this does not help us compute the average complexity on first blush. We have to express this explicitly. For example, how do we compute c(G), which is the number of cliques of graph G? A more algorithmic way of counting the number of cliques of G is to get all possible set of vertices W of G and for each set W determining if it is a clique or not. Then we just count the number of cliques to get the value of c(G).

Let W be a subset of the set of vertices V of the graph G. Define a function, \chi(W) which returns 1 if W is a clique and 0 if W is not a clique:

\chi (G,W) = \begin{cases} 1 & \text{if W is a clique},\\ 0 & \text{if W is not a clique}.\end{cases}.

With the definition above, we can then express c(G) mathematically as:

\displaystyle c(G) = \sum_{W \subseteq V} \chi(G,W)

Plugging this into the formula for \bar c (n), we get

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum _{G\in G(n)} \sum_{W \subseteq V} \chi(G,W)

By the summation property, we are able to switch the order of the summation:

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum_{W \subseteq V} \sum _{G\in G(n)} \chi(G,W)

Let’s now focus on the inner summation. It is the summation of the value \chi(G,W) over all graph G. Remember that \chi(G,W) is 1 if W is a clique of G for a specific W and 0 if not. Imagine W to be a clique. Then all its edges are contained in G. How many are the edges of W then? It is the number of vertices in W, which is the cardinality of W, given by |W| taken 2 at a time, or 2^{{|W|} \choose {2}}. How many possible cliques can you form with W as a subset? To compute this, first note that the remaining edges of G, if we take away the edges formed by W is \displaystyle {{n}\choose {2}} - {{|W|}\choose {2}}. If we take the power set of this remaining edges union each subset of it with W, we will get all possible graphs with W as a clique. Therefore,

\displaystyle \sum _{G\in G(n)} \chi(G,W) = 2^{ {{n}\choose {2}} - {{|W|}\choose {2}}}

Plugging this into the expression for \bar c(n) will give us:

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum_{W \subseteq V} 2^{ {{n}\choose {2}} - {{|W|}\choose {2}}}

Let us now attack this expression. Let’s imagine W has k vertices, i.e., |W| = k. Then there are {{n}\choose {k}} subsets of V of size k. The summation can now be written as:

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum_{k=0}^n {{n}\choose {k}} 2^{ {{n}\choose {2}} - {{k}\choose {2}}}

By canceling the factor {2^{{n}\choose{2}}}, we get:

\displaystyle \bar c(n) = \sum_{k=0}^n {{n}\choose {k}} 2^{ - {{k}\choose {2}}}

If we let t_k = {{n}\choose {k}} 2^{ - {{k}\choose {2}}}, we can simplify the expression of \bar c(n) to

\displaystyle \bar c(n) = \sum_{k=0}^n t_k.

Below is the graph of t_k versus k.

As you can see from the graph, there is a value l such that t_l \ge t_k for all k. This means that

\displaystyle \bar c(n) = \sum_{k=0}^n t_k \le \sum_{k=0}^n t_l = \underbrace{ t_l + \ldots + t_l }_{n + 1 \text{ times}}= (n + 1) t_l.

Therefore,

\displaystyle \bar c(n) \le (n+1) t_l.

It’s just a matter of getting the value of t_l. If we take the ratio of successive terms in the series, we get

\displaystyle \frac{t_k}{t_{k-1}} = \frac{n - k + 1}{k2^{k-1}}

The details of the computation can be found here.

Notice that past the point l, the value of the above ratio is less than 1, or

\displaystyle n < k-1 + k 2^{k-1}.

For the sake of simplicity, if we let k=log_2n, we have

\displaystyle n < log_2 n - 1 + log_2 n 2^{log_2 n -1 }
\displaystyle = log_2 n - 1 + \frac{nlog_2n}{2}

This inequality is true if the value of log_2 n \ge 2 because n < 2 - 1 + n = n + 1 . The implies that l \le log_2 n. Using this upper bound, we can compute the upper bound of t_l by the following process. First, observe that

\displaystyle t_l = {{n}\choose{l}}2^{-{{l}\choose {2}}} < {{n}\choose {l}}.

But

\displaystyle {{n}\choose {l}} = \frac{n!}{l! (n-l)!}= \frac{\overbrace{ n\dot (n-1)... (n-l +1)}^{l \text{ times}}}{l!} < \underbrace{n\cdot n ... n }_{l \text { times}}= n^l

Since we have shown above that l \le log_2 n,

\displaystyle t_l < n^l \le n^{log_2 n}

from which \displaystyle \bar c(n) \le (n+1)n^{log_2 n}. The average running time of the all-clique algorithm is therefore O(n^{log_2n+2}). For a graph with 81 vertices, the average number of operations is 8.250472e+15. Since it takes 315 seconds to compute this graph on my machine, my computer was able to compute 2.619197e+13 operations, on the average, in one second. For a graph with 351 vertices, the number of operations on the average is 4.092784e+26 which my computer can do in 1.562610e+13 seconds, or 495,500 years!

* For more information, read the book of Donald Kreher and Douglas Stinson entitled “Combinatorial Algorithms: Generation, Enumeration and Search”. You can find the algorithm and the analysis on page
112.

Advertisements

Published by

Bobby Corpus

Is an IT Architect.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s