Modelling and Mining of Networked Information Spaces

Project Title: Clustering Citation Graphs
Participants: Biao Chen, Dr. Jeannette Janssen, Dr. Evangelos Milios
Project Description: Data clustering is an important problem in information retrieval and knowledge management. Clustering is the procedure of assigning objects into small groups according to their contents or link patterns.

Unlike the traditional content-based document clustering methodology, our approach is based on graph theory and uses only the connectivity of the citation graph, ignoring the content.We first build an ``affinity graph'' from the given citation graph using an algorithm first formulated for web graph compression. The affinity graph captures the optimal way of coding graph nodes based on the coding of their out-neighbors. Then we produce a Minimum Rooted Spanning Tree using an algorithm to find optimal branchings. Each branch in the tree is interpreted as a cluster of papers on a specific topic.

We used both artificial and real citation graphs to test the algorithms. We did various experiments to tune the algorithm to produce fine clusters. We evaluated our output clusters using the labelled graphs and looking into papers' author lists for unlabelled graphs to ensure the quality of our experiment. We found that a key parameter for the algorithm is the encoding bit-per-node, the number of bits needed to represent a single node in a graph; we call it the B value. We noticed that the ceiling of log(n) is a good approximation for B in general, but the best B values vary from graph to graph.