Modelling and Mining of Networked Information Spaces

Project Title: Focused Crawling By Learning The User's Topic-Specific Browsing Patterns
Participants: Hongyu Liu, Dr. Jeannette Janssen, Dr. Evangelos Milios
Project Description: Focused Crawling has gained much attention in recent years due to the exploding volume of the Web. It is widely used to build domain-specific Web search portals and online personalized search tools. A focused crawler is an efficient tool to traverse the Web to gather documents on a specific topic. It must use information gleaned from previously crawled pages to estimate the relevance of a newly seen URL. Therefore, good performance depends on powerful modeling of context as well as current observations.

We model the process of crawling over an underlying Markov chain of hidden states, defined by the number of hops away from targets, from which the actual topics of the documents are observed. When a new document is seen, the prediction is to estimate the distance this document is away from a target. Within this proposed framework, three probabilistic graphical models are applied, each with different focus: (1) with Hidden Markov Models (HMMs), we focus on semantic content analysis learned from users browsing behavior on specific topics; (2) with Maximum Entropy Markov Models (MEMMs), we investigate the impact of multiple overlapping features; and (3) with Conditional Random Fields (CRFs), we further handle long-range interactions directly along the sequences. The goal is that with powerful composite criterion on important features, relevant paths can be effectively identified. Our research aims to take advantage of both content and link information embedded in the Web graph, and to provide fast navigation and access to online information.