Good Friends Are Hard To …Compute !

Almost 2 years ago, I ran an algorithm to produce the maximal cliques of my facebook friends. My experiment was made possible because of the old facebook REST api. You can find the description of my experiment from this site.

Recently, when I had 351 friends, I tried to do the same experiment, now with a much larger size. To my disappointment, it took forever to compute the maximal cliques.

Why would a network of size 81 take only 315 seconds to compute while a network of size 351 takes all of your patience away? The answer is in the time complexity of the maximal clique algorithm.

The maximal clique algorithm* is as follows:

\text{ALLCLIQUES} (l)
\text{global } X, C_l \text{ }(l=0,1,...,n-1)
\text{comment: every } X \text { generated by the algorithm is a clique}
\text{if } l=0 \text { then output}([])
\text { else output } ([x_0,...,x_{l-1}])
\text {if } l = 0
\text { then } N_l \gets V
\text { else } N_l \gets A_{x_{l-1}} \cap N_{l-1}
\text {if } N_l=\emptyset
\text { then } \{x_0,...,x_{l-1}\} \text { is a maximal clique}
\text {if } l = 0
\text { then } C_l \gets V \text { else } C_l \gets A_{x_{l-1}} \cap B_{x_{l-1}} \cap C_{l-1}
\text {for each } x \in C_l
\text {do } \begin{cases} x_l \gets x \\ \text{ALLCLIQUES} (l + 1) \end{cases}

How this works

Before you call this algorithm, you first number your vertices from 0 to n-1. This is to allow the algorithm to sort the vertices according the number that was assigned to them.

This algorithm is called with the value of l=0. At this point, the algorithm will output the string “[]”, which stands for an empty clique. The set N_l is initialized to the entire vertex set V, the choice set C_l is also initialized to V. The sets A_v and B_v are precomputed before the algorithm starts and are defined as follows: (Here, E is defined as the set of vertices)

A_v = \{ u \in V: \{ u,v\} \in E\}
B_v = \{ u \in V: u > v \}

What this means in layman’s terms is that the set A_v is the set of all vertices connected to v and the set B_v is the set of all vertices whose number is greater than v.

Using the definition above, the choice set C_l is defined as:

\displaystyle C_l = A_{x_{l-1}} \cap B_{x_{l-1}} \cap C_{l-1}

The set N_l is defined as

\displaystyle N_l = N_{l-1}\cap A_{x_{l-1}}

When N_l = \emptyset, we say that the generated set of vertices, X=[x_0,...,x_{l-1}] is a maximal clique.

For example, suppose we have the following graph:

Running the algorithm over this graph would generate:

The “*” indicates that the generated vertices form a clique.

So how do we answer the question: “Why would a network of size 81 take only 315 seconds to compute while a network of size 351 a very long time”? We compute the running time of the algorithm.

Average Case Analysis

Let n be the set of nodes. The number of ways you can connect one node to another node is {{n}\choose{2}}, these correspond to the number of possible vertices which you can make a graph G. Given this set, the power set corresponds to the number of possible graphs of you can create using n nodes. The cardinality of this power set is \displaystyle 2^{{n}\choose{2}}. Let us call this power set G(n).

Let G \in G(n) be an arbitrary graph and define c(G) be the number of cliques in G. By the recursive nature of the algorithm, it produces a tree, called the state space tree, for which the number of nodes is equal to c(G). Since the algorithm is invoke recursively n times, the running time of this algorithms is O(nc(G)). It remains for us to compute c(G) in the average case.

Denote by \bar c(n) the average clique size, then

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum _{G\in G(n)} c(G)

Let’s dissect the above formula. Keeping in mind what you learned in elementary math, the average value is computed over the total number of graphs is G(n), which is equal to 2^{{n}\choose {2}}. The numerator of the formula is just the sum of the number of cliques of all graphs G \in G(n). That’s how we get the formula above. Although, this does not help us compute the average complexity on first blush. We have to express this explicitly. For example, how do we compute c(G), which is the number of cliques of graph G? A more algorithmic way of counting the number of cliques of G is to get all possible set of vertices W of G and for each set W determining if it is a clique or not. Then we just count the number of cliques to get the value of c(G).

Let W be a subset of the set of vertices V of the graph G. Define a function, \chi(W) which returns 1 if W is a clique and 0 if W is not a clique:

\chi (G,W) = \begin{cases} 1 & \text{if W is a clique},\\ 0 & \text{if W is not a clique}.\end{cases}.

With the definition above, we can then express c(G) mathematically as:

\displaystyle c(G) = \sum_{W \subseteq V} \chi(G,W)

Plugging this into the formula for \bar c (n), we get

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum _{G\in G(n)} \sum_{W \subseteq V} \chi(G,W)

By the summation property, we are able to switch the order of the summation:

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum_{W \subseteq V} \sum _{G\in G(n)} \chi(G,W)

Let’s now focus on the inner summation. It is the summation of the value \chi(G,W) over all graph G. Remember that \chi(G,W) is 1 if W is a clique of G for a specific W and 0 if not. Imagine W to be a clique. Then all its edges are contained in G. How many are the edges of W then? It is the number of vertices in W, which is the cardinality of W, given by |W| taken 2 at a time, or 2^{{|W|} \choose {2}}. How many possible cliques can you form with W as a subset? To compute this, first note that the remaining edges of G, if we take away the edges formed by W is \displaystyle {{n}\choose {2}} - {{|W|}\choose {2}}. If we take the power set of this remaining edges union each subset of it with W, we will get all possible graphs with W as a clique. Therefore,

\displaystyle \sum _{G\in G(n)} \chi(G,W) = 2^{ {{n}\choose {2}} - {{|W|}\choose {2}}}

Plugging this into the expression for \bar c(n) will give us:

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum_{W \subseteq V} 2^{ {{n}\choose {2}} - {{|W|}\choose {2}}}

Let us now attack this expression. Let’s imagine W has k vertices, i.e., |W| = k. Then there are {{n}\choose {k}} subsets of V of size k. The summation can now be written as:

\displaystyle \bar c(n) = \frac{1}{2^{{n}\choose{2}}} \sum_{k=0}^n {{n}\choose {k}} 2^{ {{n}\choose {2}} - {{k}\choose {2}}}

By canceling the factor {2^{{n}\choose{2}}}, we get:

\displaystyle \bar c(n) = \sum_{k=0}^n {{n}\choose {k}} 2^{ - {{k}\choose {2}}}

If we let t_k = {{n}\choose {k}} 2^{ - {{k}\choose {2}}}, we can simplify the expression of \bar c(n) to

\displaystyle \bar c(n) = \sum_{k=0}^n t_k.

Below is the graph of t_k versus k.

As you can see from the graph, there is a value l such that t_l \ge t_k for all k. This means that

\displaystyle \bar c(n) = \sum_{k=0}^n t_k \le \sum_{k=0}^n t_l = \underbrace{ t_l + \ldots + t_l }_{n + 1 \text{ times}}= (n + 1) t_l.

Therefore,

\displaystyle \bar c(n) \le (n+1) t_l.

It’s just a matter of getting the value of t_l. If we take the ratio of successive terms in the series, we get

\displaystyle \frac{t_k}{t_{k-1}} = \frac{n - k + 1}{k2^{k-1}}

The details of the computation can be found here.

Notice that past the point l, the value of the above ratio is less than 1, or

\displaystyle n < k-1 + k 2^{k-1}.

For the sake of simplicity, if we let k=log_2n, we have

\displaystyle n < log_2 n - 1 + log_2 n 2^{log_2 n -1 }
\displaystyle = log_2 n - 1 + \frac{nlog_2n}{2}

This inequality is true if the value of log_2 n \ge 2 because n < 2 - 1 + n = n + 1 . The implies that l \le log_2 n. Using this upper bound, we can compute the upper bound of t_l by the following process. First, observe that

\displaystyle t_l = {{n}\choose{l}}2^{-{{l}\choose {2}}} < {{n}\choose {l}}.

But

\displaystyle {{n}\choose {l}} = \frac{n!}{l! (n-l)!}= \frac{\overbrace{ n\dot (n-1)... (n-l +1)}^{l \text{ times}}}{l!} < \underbrace{n\cdot n ... n }_{l \text { times}}= n^l

Since we have shown above that l \le log_2 n,

\displaystyle t_l < n^l \le n^{log_2 n}

from which \displaystyle \bar c(n) \le (n+1)n^{log_2 n}. The average running time of the all-clique algorithm is therefore O(n^{log_2n+2}). For a graph with 81 vertices, the average number of operations is 8.250472e+15. Since it takes 315 seconds to compute this graph on my machine, my computer was able to compute 2.619197e+13 operations, on the average, in one second. For a graph with 351 vertices, the number of operations on the average is 4.092784e+26 which my computer can do in 1.562610e+13 seconds, or 495,500 years!

* For more information, read the book of Donald Kreher and Douglas Stinson entitled “Combinatorial Algorithms: Generation, Enumeration and Search”. You can find the algorithm and the analysis on page
112.

Finding the cliques of my Facebook friends

The availability of the facebook graph api made the studying of social interactions much easier. I have seen some social experiments using the data from facebook. For example, you can verify the small world effect or what is known as the six degrees of separation.

I did my own experiments also using the facebook api. What I did was to compute the cliques of my friends. I did this experiment last August of 2009 and this is the only time I’m able to document it. During that time, I was only connected to 82 friends.

What is a CLIQUE? A clique is a group of people who are mutual friends with each other. In the facebook setting, if 5 people are mutual friends, they form a clique. In this experiment, I computed the maximal cliques contained in my set of friends. A maximal clique is the maximum number of friends who are mutually connected with each other. Why study cliques? Cliques allow you to determine the interest of a person just by studying the group he/she belongs to. As the saying goes, birds of the same feather flock together.

To gather my data, I used the facebook api to get a list of my friends, and for each friend I queried the facebook function friends.areFriends by passing to this function a list of facebook ids corresponding to my friends. The output is an array of zeroes and ones. A 1 indicates that two people are mutual friends and a 0 indicate that they are not.

Once I gathered my data, I put it in a format that can be read by a program to compute the cliques. Here is a sample of the input file.

 0 1
 0 2
 0 4
 0 7
 0 10
 0 13
 0 14
 0 15
 1 2
 1 3
 1 7
 1 8
 1 9
 1 10
 1 11
 1 12
 1 14
 1 15
 1 19
 1 22
 1 25
  2 3
 2 4
 2 7
 2 8
 2 10
 2 12
 2 13
 2 14
 2 16

Let me explain the format. The numbers represent a person id. Each row is composed of two numbers. For example, the row “0 1” can be read as person number 0 is a friend of person 1. The row containing “2 14” can be read as person number 2 is a friend of person 14. I did this for all my friends.

The algorithm I applied to generate the maximal cliques is taken from the book of Donald Kreher and Douglas Stinson entitled “Combinatorial Algorithms: Generation, Enumeration and Search”. The algorithm can be found on page 112 of the book. Below is the java program I wrote to implement the algorithm.

import java.io.BufferedReader;
import java.io.File;
import java.io.FileReader;
import java.util.Comparator;
import java.util.Date;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Set;
import java.util.StringTokenizer;
import java.util.TimerTask;
import java.util.TreeSet;

import com.apple.mrj.macos.carbon.Timer;

public class AllClique {

	long startTime = System.currentTimeMillis();

	public void allClique(Set NLminusOne, Set CLminusOne, int x , String v){
		if( null == NLminusOne ){
			System.out.println("[]");
		} else {
			v += x + " ";
			//printVector();
			//System.out.println(v);
		}

		Set NL= null;
		if(null == NLminusOne){

			 NL= (Set) nodeSet.clone();
		} else {
			Node n = nodeMap.get(x);
			if(null !=n){

				NL = (Set) n.connections.clone();
				intersect( NL , NLminusOne);
			}
		}

		if(null == NL || NL.size() == 0){
			//System.out.println("*");
			v+="*";
			System.err.println(v);
		}

		System.out.println(v);

		Set CL = null;
		if( null == CLminusOne ){
			CL = (Set) nodeSet.clone();
		} else {
			Node n= nodeMap.get(new Integer(x));
			if(null != n){
				CL = (Set) n.connections.clone();
				Set BLminusOne = getBL(x);
				intersect(CL, BLminusOne, CLminusOne);
			}
		}

		if(null != CL){

			if(v.length()> 0){
				char c =  v.charAt(v.length() -1);
				if(c == '*'){
					v = v.substring(0,v.length()-1);
				}
			}

			for(Iterator iter= CL.iterator(); iter.hasNext();){
				allClique(NL, CL, iter.next(), v);

			}
		}
	}

	public Set getBL(int l){
		int max = nodeSet.size();
		TreeSet set = new TreeSet();
		for(int i=l+1;i
			set.add(new Integer(i));
		}

		return set;
	}

	public void printVector(int l){
		String str="[";
		for(int i=0;i
			str+= i;
		}
		str+= "]";

		System.out.println(str);
	}

	public void printVector(Integer[] X){
		String str = "[";
		for(Integer i: X){
			str += i + " ";
		}
		str += "]";

		System.out.println(str);
	}

	HashMap nodeMap = new HashMap();
	TreeSet nodeSet = new TreeSet();

	public boolean intersect(Set a, Set b, Set c){
		boolean bool = a.retainAll(b);
		if(bool){
			bool = a.retainAll(c);
		}

		return bool;
	}

	public boolean intersect(Set a, Set b){
		boolean bool = a.retainAll(b);

		return bool;
	}

	public void readInput(File f) throws Exception {
		BufferedReader r = null;
		String line = null;
		try {
			r = new BufferedReader(new FileReader(f));
			while(null != (line = r.readLine())){
				if(line.startsWith("#")) continue;
				StringTokenizer st = new StringTokenizer(line," ");
				String sFrom = st.nextToken();
				String sTo = st.nextToken();
				createConnections(sFrom.trim(),sTo.trim());

			}
		} finally {
			try {
				if(null != r){
					r.close();
				}
			}catch (Exception e) {
				// TODO: handle exception
				e.printStackTrace();
			}
		}

		//add to the nodeSet
		nodeSet.addAll(nodeMap.keySet());

	}

	public void createConnections(String sFrom, String sTo){

		System.out.println(sFrom + "--" + sTo);
		Node n1 = nodeMap.get(new Integer(sFrom));
		if(null == n1){
			n1 = new Node();
			n1.id = new Integer(sFrom);
			n1.connections = new TreeSet();
			nodeMap.put(n1.id, n1);

		}

		Node n2 = nodeMap.get(new Integer(sTo));
		if(null == n2){
			n2 = new Node();
			n2.id = new Integer(sTo);
			n2.connections = new TreeSet();
			nodeMap.put(n2.id, n2);
		}

		n1.connections.add(n2.id);
		n2.connections.add(n1.id);
	}

	class Node {
		Integer id;
		TreeSet connections;
	}

	public static void main(String[] args) throws Exception {
		final AllClique app = new AllClique();
		app.readInput(new File(args[0]));
		for(Iterator iter = app.nodeSet.iterator();iter.hasNext();){
			System.out.println("=> " + iter.next());
		}

		app.allClique(null, null, 0, "");

	}
}

Running program with my data, I was able to generate the cliques. To visualize the result, a graph drawing application called graphviz came in very handy. Here is the result of the clique computation:

Click on the graphic so that you can zoom on it.

The Power of the Command Line

The command line separates the amateurs from the experts. The command line lies at the heart of the Unix computing power.

The beauty of command line is that you are able to do relatively complicated tasks in just one line. Take for example renaming of doc files to txt files. An ordinary user will issue the command

bobby-corpuss-imac-5:~ bobby$ mv file1.doc file1.txt

for every file the current directory. If the directory turns out to contain so many files, then the poor user will spend so much time trying to do a non-value added work.

A better use of one’s time is to study how the command line works and be able to do the job in just one line of code:

bobby-corpuss-imac-5:~ bobby$ /bin/ls -1 *.doc |sed 's/(.*).doc/mv &amp; 1.txt/g'|bash

To see how this works, let us dissect the command which is composed of 3 commands linked together using the pipe symbol “|”. The pipe allows you to feed the output of one command to the input stream of the next command. The 3 commands used are:

  • /bin/ls
  • sed
  • bash

The first command will list down the files in the current directory and present them in one column. For example,

bobby-corpuss-imac-5:tmp bobby$ /bin/ls -1 *.doc
file1.doc
file2.doc
file3.doc
file4.doc
file5.doc

The next command is quite tricky so let’s conquer it little by little. The command name is sed which is short for Stream Editor. It takes a regular expression that matches any string ending with a doc. The cryptic looking expression (.*) is nothing to worry about. The “” (backslash) you see that precedes the “(” and “)” is an escape character and is used to indicate to the shell that the next character should not be interpreted. The “.*” is a regular expression composed of “.” and “*”. The “.” tells “sed” to match any character. The “*” modifies the behaviour of “.” by telling “sed” to match a sequence of any character zero or more times. If you remove the escape character, the regular expression reads (.*). The parenthesis tells “sed” that whatever string of characters it matches in .* should be saved for later use. The final part of the regular expression says that the matched expression should contain the string doc.

The last paragraph is really a mouthful. If you are new to this, you might need to read it again, and again. That is just the first part of the “sed” input. Sed commands typically accept inputs of the following format:

sed 's/regex/replacement/g'

where regex is a regular expression to match, as discussed above and replacement is the text which can make use of the saved text matched in the regex. The “g” tells sed to apply the substitution for all matches. Let’s now try to dissect the replacement part of the sed input.

The replacement looks like mv & 1.txt. This still looks complicated but we’re almost done. The symbol & stands for the entire string matching the regular expression. The symbol 1 stands for the sequence of characters that matched (.*). Therefore, an input string file1.doc will give the following values to the replacement:

  • & = file1.doc
  • 1 = file1
  • mv & 1.txt = mv file1.doc file1.txt

If we now try to string the first two commands together, we get:

bobby-corpuss-imac-5:tmp bobby$ /bin/ls -1 *.doc |sed 's/(.*).doc/mv &amp; 1.txt/g'
mv file1.doc file1.txt
mv file2.doc file2.txt
mv file3.doc file3.txt
mv file4.doc file4.txt
mv file5.doc file5.txt

If you look closely at the output above, you can see that we are in effect building a command. For example, the first line says mv file1.doc file1.txt which is a command to move file1.doc to file2.doc. The only thing we need to do now is to execute this command, which is conveniently done by piping the output to bash. Executing this command in the shell gives us the following output.

bobby-corpuss-imac-5:tmp bobby$ ls *.doc
file1.doc	file2.doc	file3.doc	file4.doc	file5.doc
bobby-corpuss-imac-5:tmp bobby$ /bin/ls -1 *.doc |sed 's/(.*).doc/mv &amp; 1.txt/g'|bash
bobby-corpuss-imac-5:tmp bobby$ ls *.docls: *.doc: No such file or directory
bobby-corpuss-imac-5:tmp bobby$ ls *.txt
file1.txt	file2.txt	file3.txt	file4.txt	file5.txt

Are you ready to tap the power of the command line?

New Blog Location

I migrated my blog from a hosted linux server which I maintain into wordpress.com hosting service. Times are changing. I used to have more time before to maintain my site, upgrade my linux applications to the latest security patches, configure my own DNS server, email server, http server, what have you… No I just don’t have the luxury of time. It’s good that we have this hosting providers doing the job for you for free!!!

I thought that migrating my old wordpress blog from the old site to the new site would be a challenge. I was wrong. Thank you wordpress for making my migration painless.

I just followed the instructions from here to export and import my blog:

http://en.support.wordpress.com/import/

and it worked like a charm!