GRAPHLETS AND GRAPH EVOLUTION:
Methods and Models for the Web Graph and other "Small World" Networks
ABSTRACTS
September 27
Methods and Models for the Web Graph
Jeannette Janssen
The World Wide Web can be seen as a graph, whose nodes are the web pages
and edges are the hyperlinks. It is an enormous graph, with over one and a
half billion nodes. This graph has received a lot of attention lately,
mainly motivated by the quest for better search engines. Mathematical
methods and models are emerging, and more are needed, in order to
understand the structure of the Web and extract information from it.
I will start by discussing some of the characteristics of the web graph,
as shown by various studies. Next I will show how mathematical methods
based on eigenvectors have lead to a vast improvement of the performance
of search engines. I will review the attempts that have been made, with
limited success this far, to automatically locate `web communities', i.e.
web pages that have a common topic and are highly interconnected. Finally,
I will discuss two recent attempts to define a stochastic model for the
web graph. These lead to a new concept of dynamic random graphs, based on
growth by copying with error.
Links:
Back to main page
October 4 and 11
An introduction to the Haar Wavelet Family
Jason Brown
The parents of the Haar Family are a rather simple yet constant
father function and a more complex mother. The son functions resemble
their fathers, while the daughters take after their mother. A visit with
this family serves as an introduction to wavelets and how linear algebra
can be used to compress data.
Reference:
- Discovering Wavelets, Aboufadel, E. and Schlicker, S., Wiley, 1999, ISBN 0-471-33193-7
Back to main page
October 18 and 25
Similarity and Structure in Networked Information Spaces
Evangelos Milios
Large networked information spaces, such as the World Wide Web and the
Citation Graph, are becoming amenable to representation and analysis
with desktop computing resources. Search engines have been indexing
the WWW for some time now, an exercise that requires substantial
computing resources, which are increasing exponentially, just like the
WWW does. We are interested in problems involving networked
information spaces that can be solved with constant computing
resources, independent of the size of the space, either with or
without the help of search engines (indexing and back-link
tracing). We will discuss approaches to three problems related to
networked information spaces.
1. Characterization of the Citation Graph for Computer Science. We
use various techniques from graph theory to study the properties of
the citation graph, constructed from ResearchIndex (Citeseer). Results
so far have shown that the citation graph is highly connected, and
that such connectivity does not depend on particular hub or authority
papers.
2. Similarity measure for computer science papers based on the
similarity of their neighbourhoods in the citation graph, defined
either in terms of the size of a minimum cut to separate the two
papers, or in terms of the overlap in authority papers in the two
neighbourhoods. The method has been seen to uncover unexpected
relations.
3. Construction of a highly connected context graph for focused
crawling on the WWW.
The work presented is joint research with Jeannette Janssen and
Nathalie Japkowicz.
Links:
Back to main page
November 22
Models for the web graph
Nauzer Kalyaniwalla
In recent times several models, describing the growth and characterising
the large-scale structure of teh web graph have appeared. Two models, one
based on preferential attachment and the other based on copying with
error, will be discussed and compared.
Links:
Back to main page
November 22
Compressibility of graphs
Jeannette Janssen
In order to find local structures in a large graph we first need a good
definition of structure. One way of defining structure is as the opposite
of randomness. For sequences, a measure of their randomness is closely
linked to the maximum compressibility of the sequence. Could it be
possible to extend this concept to graphs?
I will discuss a recent compression method proposed by Adler and
Mitzenmacher. The algorithm is based on the premise that, in a graph with
local structure, the neighbourhood of one node probably looks similar than
that of another node. This premise also lies at the heart of the
stochastic web graph model proposed by Kumar et al. I will spend some time
discussing this model as well.
Links:
Back to main page
January 17, 2002
Focused web crawling using word graphs
Adam Nickerson
The exponentially increasing amount of the content on the Web makes
finding documents of interest especially difficult. Focused crawlers
have been proposed as a way of selectively following hyper-links that
are most likely to lead to relevant documents.
Several techniques for focused crawling have been developed, including
methods that examine the content of a page to determine whether its
links are relevant and those that take into consideration the link
structure of the web and the context in which a page appears. The
current methods do not, however, take into account the changes in
content that occur from page to its children.
Consider a method where the changes in content from a web-page A to
one of its children B are tracked using a graph. The graph consists
of nodes that are each labeled with a word, and a directed edge
between two nodes (words) u,v if word u occurs on page A and word v on
page B. The edge is weighted with the observed probability of that
situation occurring.
Potential uses of this graph in focused crawling will be discussed.
Links:
Back to main page
February 7, 2002
Algorithms for finding dense components in a graph
Richard Hoshino
In this talk, we will study the problem of finding highly connected
subgraphs of some large graph G. To do this, we will define f(G), the
density of a graph, and develop algorithms that will determine the
exact value of this density. We will then test our algorithms on various
graphs, using Maple.
We will show that the problem of finding f(G) is equivalent to solving a
linear program. Thus, a large part of the talk will discuss ideas and
concepts in linear programming, and explain how it relates to the problem
of determining the exact value of f(G).
We are interested in measuring the density of a graph so that we can
quantify this notion of "highly connectedness". This is especially useful
when we apply this concept to analyze sparse graphs such as the Web Graph
and the Citation Graph.
In this talk, we will only examine undirected graphs. However, the same
ideas and techniques can be used to find the density of any directed
graph. For more information, please see the links.
Links:
- Presentation slides. Pdf, PostScript.
- "Greedy approximation algoritms for finding
dense components in a graph", Moses Charikar, APPROX 2000.
- Maple worksheet containing the algorithms presented in the lecture.
Back to main page
April, 2002
Graph partitioning and data clustering
Biao Chen
Text clustering is a well-studied problem. Previous works
are mostly based on machine learning algorithms. These algorithms often
have lots of parameters and are not easy to understand. This talk presents
two novel new algorithms which can be applied to both text clustering and
graph partitioning based on a bipartite graph model. Both make good use of
eigenvectors and yields satisfied results.
Links:
Back to main page