Modelling and Mining of Networked Information Spaces

Project Title: Focused crawling based on word graphs
Participantss: Hongyu Li, Adam Nickerson, Dr. Jeannette Janssen, Dr. Evangelos Milios
Project Description: Focused crawlers have been proposed as a solution to the exponentially increasing size and multimedia content of the Web, which makes Web indexing problematic in the long run. Focused crawlers take the user's query (or description of problem) and search the Web by following hyperlinks starting at locations the user provides looking for web pages that are relevant. Judging relevance of hyperlinks to pursue purely based on the user's query may not be a good idea, because it does not capture the knowledge of certain types of web pages being "close" to pages of interest (for example university sites may contain papers on neural networks without this being directly evident from the text of their root web page. However, knowing that universities contain computer science departments that may employ a professor doing research in neural networks makes the root web page of a university a good place our of which to explore).
Diligenti et al. proposed the notion of a context graph (VLDB 2000). Nodes of the context graph are "learned" by training classifiers to recognize them.

In this project, we will have a system that observes a users browsing and allows them to record documents of interest that they encounter. The system will also be able to observe many users simultaneously and as such be able to combine data about individual preferences while taking advantage of the knowledge of a class of users' preferences. In addition to recording the link structure of the users' web browsing, we will also create a weighted, digraph where the nodes are words found in the web-pages and the edges between the nodes represent links between pages that contained those words. After such a graph is built by observing users' browsing habits, their target documents will be analyzed to determine the statistically significant words and the nodes that contain those words in the graph will be noted and assigned a score depending on the significance of the word. When the focused crawler takes over to look for new documents, it faces the following question "is it worth following the links on this page or not?" -- to determine the answer, we go to the word graph.