GRAPHLETS AND GRAPH EVOLUTION:

Methods and Models for the Web Graph and other "Small World" Networks


ABSTRACTS
September 27

Methods and Models for the Web Graph
Jeannette Janssen

The World Wide Web can be seen as a graph, whose nodes are the web pages and edges are the hyperlinks. It is an enormous graph, with over one and a half billion nodes. This graph has received a lot of attention lately, mainly motivated by the quest for better search engines. Mathematical methods and models are emerging, and more are needed, in order to understand the structure of the Web and extract information from it.

I will start by discussing some of the characteristics of the web graph, as shown by various studies. Next I will show how mathematical methods based on eigenvectors have lead to a vast improvement of the performance of search engines. I will review the attempts that have been made, with limited success this far, to automatically locate `web communities', i.e. web pages that have a common topic and are highly interconnected. Finally, I will discuss two recent attempts to define a stochastic model for the web graph. These lead to a new concept of dynamic random graphs, based on growth by copying with error.

Links:

Back to main page


October 4 and 11

An introduction to the Haar Wavelet Family
Jason Brown

The parents of the Haar Family are a rather simple yet constant father function and a more complex mother. The son functions resemble their fathers, while the daughters take after their mother. A visit with this family serves as an introduction to wavelets and how linear algebra can be used to compress data.

Reference:

Back to main page


October 18 and 25

Similarity and Structure in Networked Information Spaces
Evangelos Milios

Large networked information spaces, such as the World Wide Web and the Citation Graph, are becoming amenable to representation and analysis with desktop computing resources. Search engines have been indexing the WWW for some time now, an exercise that requires substantial computing resources, which are increasing exponentially, just like the WWW does. We are interested in problems involving networked information spaces that can be solved with constant computing resources, independent of the size of the space, either with or without the help of search engines (indexing and back-link tracing). We will discuss approaches to three problems related to networked information spaces.

1. Characterization of the Citation Graph for Computer Science. We use various techniques from graph theory to study the properties of the citation graph, constructed from ResearchIndex (Citeseer). Results so far have shown that the citation graph is highly connected, and that such connectivity does not depend on particular hub or authority papers.

2. Similarity measure for computer science papers based on the similarity of their neighbourhoods in the citation graph, defined either in terms of the size of a minimum cut to separate the two papers, or in terms of the overlap in authority papers in the two neighbourhoods. The method has been seen to uncover unexpected relations.

3. Construction of a highly connected context graph for focused crawling on the WWW.

The work presented is joint research with Jeannette Janssen and Nathalie Japkowicz.

Links:

Back to main page


November 22

Models for the web graph
Nauzer Kalyaniwalla

In recent times several models, describing the growth and characterising the large-scale structure of teh web graph have appeared. Two models, one based on preferential attachment and the other based on copying with error, will be discussed and compared.

Links:

Back to main page


November 22

Compressibility of graphs
Jeannette Janssen

In order to find local structures in a large graph we first need a good definition of structure. One way of defining structure is as the opposite of randomness. For sequences, a measure of their randomness is closely linked to the maximum compressibility of the sequence. Could it be possible to extend this concept to graphs?

I will discuss a recent compression method proposed by Adler and Mitzenmacher. The algorithm is based on the premise that, in a graph with local structure, the neighbourhood of one node probably looks similar than that of another node. This premise also lies at the heart of the stochastic web graph model proposed by Kumar et al. I will spend some time discussing this model as well.

Links:

Back to main page


January 17, 2002

Focused web crawling using word graphs
Adam Nickerson

The exponentially increasing amount of the content on the Web makes finding documents of interest especially difficult. Focused crawlers have been proposed as a way of selectively following hyper-links that are most likely to lead to relevant documents.

Several techniques for focused crawling have been developed, including methods that examine the content of a page to determine whether its links are relevant and those that take into consideration the link structure of the web and the context in which a page appears. The current methods do not, however, take into account the changes in content that occur from page to its children.

Consider a method where the changes in content from a web-page A to one of its children B are tracked using a graph. The graph consists of nodes that are each labeled with a word, and a directed edge between two nodes (words) u,v if word u occurs on page A and word v on page B. The edge is weighted with the observed probability of that situation occurring.

Potential uses of this graph in focused crawling will be discussed.

Links:

Back to main page


February 7, 2002

Algorithms for finding dense components in a graph
Richard Hoshino

In this talk, we will study the problem of finding highly connected subgraphs of some large graph G. To do this, we will define f(G), the density of a graph, and develop algorithms that will determine the exact value of this density. We will then test our algorithms on various graphs, using Maple.

We will show that the problem of finding f(G) is equivalent to solving a linear program. Thus, a large part of the talk will discuss ideas and concepts in linear programming, and explain how it relates to the problem of determining the exact value of f(G).

We are interested in measuring the density of a graph so that we can quantify this notion of "highly connectedness". This is especially useful when we apply this concept to analyze sparse graphs such as the Web Graph and the Citation Graph.

In this talk, we will only examine undirected graphs. However, the same ideas and techniques can be used to find the density of any directed graph. For more information, please see the links.

Links:

Back to main page


April, 2002

Graph partitioning and data clustering
Biao Chen

Text clustering is a well-studied problem. Previous works are mostly based on machine learning algorithms. These algorithms often have lots of parameters and are not easy to understand. This talk presents two novel new algorithms which can be applied to both text clustering and graph partitioning based on a bipartite graph model. Both make good use of eigenvectors and yields satisfied results.

Links:

Back to main page