MoMiNIS Seminar Series at Dalhousie University

Location: Room 430, Goldberg Computer Science Building, 6050 University Avenue (Building E600 on Studley Campus Map), Halifax, Nova Scotia, Canada

Time: Tuesday 1130-1300 (Thursday 1130-1300, or other times are also possible, if necessary).

The MoMiNIS Seminar Series provides a research oriented forum for prominent researchers to present their current research on the Modelling and Mining of Networked Information Spaces. The seminars are meant to appeal to a broad audience, and present both theoretical results in graph theory, machine learning and text mining, and their practical ramifications in areas such as Web mining, social network analysis, network management and security, and digital libraries.

Seminar Co-ordinators: Jeannette Janssen, Evangelos Milios, Nur Zincir-Heywood

June 21, 2012 Ricardo Baeza-Yates
Yahoo! Research, Barcelona
Search Engines and Social Media  
jj, eem
June 23 or 24, 2012
Nelly Litvak
University of Twente
Extremal properties of Web graphs  


June 23 or 24, 2012
Jennifer Chayes
Microsoft Research Cambridge, Mass
A weak distributional limit for preferential attachment graphs (tentative)  
TBA Mourad Debbabi
Concordia University
Network security protocols  
Nov. 21, 2008 Allan Borodin,
Univ. of Toronto
Personalized Search, Community Extraction in Blog sites  
jj, eem
Mar. 3, 2009 George Forman,
HP Labs
What Are You Talking About? Topic Recognition via Machine Learning Text Classification and Quantification  
nzh, eem
Mar. 11, 2009 Ellen Zegura,
Georgia Tech
NetMark: Selecting a Benchmark of Network Topologies  
Aug. 12, 2009, 2.30pm Natasa Przulj,
Dept. of Computer Science, UC Irvine
From Network Topology to Biological Function and Disease  

Aug. 13, 2009

Russell Greiner,
Dept. of Computer Science, U. of Alberta
Budgeted Learning of Probabilistic Classifiers  
Dec. 10, 2009

Stan Matwin,
School of Information Technology and Engineering,
U. of Ottawa

Privacy and Data Mining: New Developments and Challenges  
Feb. 18, 2010
Hugh Chipman,
Acadia Univ.
Mixed Membership Stochastic Blockmodels for multi-recipient transactions on a network (joint work with M. Mahdi Shafiei).  
jj, eem
Mar. 30, 2010, 2:30am Aaron Clauset
Santa Fe Institute
The trouble with community detection in complex networks  
Sep. 17, 2010 Edo Airoldi
Harvard University
Modeling approaches for analyzing complex networks  
March 24, 2011 Bernie Hogan
Oxford Internet Institute
the University of Oxford
Capture of online networks  
ag, jj
May 3, 2011 Frank Tompa
University of Waterloo
Finding implicit lists and tables in web pages  
eem, nzh
May 10, 2011 Laks V.S. Lakshmanan
University of British Columbia
Musings on Next Generation Recommender Systems  
May 19, 2011 Julita Vassileva
University of Saskatchewan
Sharing experience in Social Computing, Persuasion and Science Outreach  
nzh, eem
May 19, 2011 Jian Pei
Simon Fraser University
Query Friendly Compression and Analysis of Social Networks Using Multi-Position Linearization  
March 15, 2012 Denilson Barbosa
University of Alberta
Towards Summarizing and Making Sense of the Blogosphere  

Seminars will be announced by e-mail 1-2 weeks in advance.
To subscribe to the mailing list, please send an email to cs-seminars AT with your name, position, email address and home page.

To suggest speakers or topics, or to volunteer for a seminar, please contact one of the co-ordinators.

We gratefully acknowledge MPrime (formerly MITACS), NSERC and the Faculty of Computer Science, Dalhousie University, for their financial and logistical support of the seminar series.

Last updated Saturday, 17-Mar-2012 20:20


Title: WHAT ARE YOU TALKING ABOUT? Text classification and quantification via machine learning
Speaker: Dr. George Forman
Hewlett Packard Labs, Port Orchard, WA

Date: Tuesday, March 3, 2009
Time: 11:30 a.m.
Location: Jacob Slonim Conference Room (430), 6050 University Ave., Halifax

In theory, practice is the same as theory, but in practice it is not. In the process of applying proven text classification methods from the
research literature to business-driven problems at Hewlett-Packard, I have encountered substantial failures and gaps. Investigating the
failures in detail has repeatedly led to new discoveries and perspectives for research that are simply not afforded by the academic
benchmark datasets and problem formulations. In this talk, I will describe two interesting applications of supervised machine learning
that we have deployed inside Hewlett-Packard, as well as the challenging, fundamental research opportunities they have led to.

Short Biography:
George Forman is a senior research scientist at Hewlett-Packard Labs. His research interests stem from practical issues that arise in the
application of machine learning to industrial problems, e.g. feature selection, robustness, small training sets, and novel problem
formulations, such as quantification. His Ph.D. in Computer Science & Engineering is from the University of Washington, Seattle, 1996
Speaker URL:

Title: NetMark: Selecting a Benchmark of Network Topologies
Speaker : Dr. Ellen Zegura, Georgia Tech

Wednesday, March 11, 9.30am, Colloquium room, Chase building

Prof. Zegura's research work concerns the development of wide-area (Internet) networking services and, more recently, mobile wireless networking. Wide-area services are utilized by applications that are distributed across multiple administrative domains (e.g., web, file sharing, multi-media distribution). Her focus is on services implemented both at the network layer, as part of network infrastructure, and at the application layer. In the context of mobile wireless networking, she is interested in challenged environments where traditional ad-hoc and infrastructure-based networking approaches fail. These environments have been termed Disruption Tolerant Networks.

Ellen W. Zegura received the B.S. degree in Computer Science (1987), the B.S. degree in Electrical Engineering (1987), the M.S. degree in
Computer Science (1990) and the D.Sc. in Computer Science (1993) all from Washington University, St. Louis, Missouri. Since 1993, she has been on the faculty in the College of Computing at Georgia Tech. She was an Assistant Dean in charge of Space and Facilities Planning from Fall 2000 to January 2003. She served as Interim Dean of the College for six months in 2002. Since February 2003, she has been an Associate Dean, with responsibilities ranging from Research and Graduate Programs to Space and Facilities Planning. She has spent five years as the user representative in the planning of the Klaus Advanced Computing Technologies Building, scheduled to open in Fall 2006. Starting in August 2005, she has chaired the Computing Science and Systems Division of the College of Computing. She is the proud mom of two girls, Carmen (born in August 1998) and Bethany (born in May 2001).

Title: From Network Topology to Biological Function and Disease
Speaker: Natasa Przulj, Dept. of Computer Science, UC Irvine

Date: August 12, 2.30pm

We discuss our new tools that are advancing network analysis towards a theoretical understanding of the structure of biological networks. Analogous to tools for analyzing and comparing genetic sequences, we are developing new tools that decipher large network data sets, with the goal of improving biological understanding and contributing to development of new therapeutics. We demonstrate that local node similarity corresponds to similarity in biological function and involvement in disease. We introduce a systematic highly constraining measure of a network's local structure and demonstrate that protein-protein interaction (PPI) networks are better modeled by geometric graphs than by any previous model. The geometric model is further corroborated by demonstrating that PPI networks can explicitly be embedded into a low-dimensional geometric space. We also present a new network alignment algorithm.

Dr.Przulj is an Assistant Professor in the Department of Computer Science, UC Irvine. She is also a member of the UCI Cancer Center, the UCI Center for Complex Biological Systems (CCBS), the UCI's program in Mathematical, Computational and Systems Biology (MCSB), and the UCI’s Institute for Genomics and Bioinformatics (IGB). She received an NSF CAREER award for 2007-2011. She is on the Editorial Review Board of the International Journal of Knowledge Discovery in Bioinformatics (IJKDB). Dr. Przulj's research involves applications of graph theory, mathematical modeling, and computational techniques to solving large-scale problems in computational and systems biology. I am interested in computational and theoretical solutions to practical problems in many areas of systems biology, planar cell polarity, proteomics, cancer informatics, and chemo-informatics.

Title:Budgeted Learning of Probabilistic Classifiers
Speaker: Russell Greiner, Dept. of Computer Science, Un. of Alberta

Researchers often use clinical trials to collect the data needed to evaluate some hypothesis, or produce a classifier. During this process, they have to pay the cost of performing each test. Many studies will run a comprehensive battery of tests on each subject, for as many subjects as their budget will allow -- ie, "round robin" (RR). We consider a more general model, where the researcher can sequentially decide which single test to perform on which specific individual; again subject to spending only the available funds. Our goal here is to use these funds most effectively, to collect the data that allows us to learn the most accurate classifier.
We first explore the simplified "coins version" of this task. After observing that this is NP-hard, we consider a range of heuristic algorithms, both standard and novel, and observe that our "biased robin" approach is both efficient and much more effective than most other approaches, including the standard RR approach. We then apply these ideas to learning a naive-bayes classifier, and see similar behavior. Finally, we consider the most realistic model, where both the researcher gathering data to build the classifier, and the user (eg, physician) applying this classifier to an instance (patient) must pay for the features used --- eg, the researcher has $10,000 to acquire the feature values needed to produce an optimal $30/patient classifier. Again, we see that our novel approaches are almost always much more effective that the standard RR model.
This is joint work with Aloak Kapoor, Dan Lizotte and Omid Madani.

After earning a PhD from Stanford, Russ Greiner worked in both academic and industrial research before settling at the University of Alberta, where he is now a Professor in Computing Science and the founding Scientific Director of the Alberta Ingenuity Centre for Machine Learning, which won the ASTech Award for "Outstanding Leadership in Technology" in 2006. He has been Program Chair for the 2004 "Int'l Conf. on Machine Learning", Conference Chair for 2006 "Int'l Conf. on Machine Learning", Editor-in-Chief for "Computational Intelligence", and is serving on the editorial boards of a number of other journals. He was elected a Fellow of the AAAI (Association for the Advancement of Artificial Intelligence) in 2007, and was awarded a McCalla Professorship in 2005-06 and a Killam Professorship in 2007. He has published over 100 refereed papers and patents, most in the areas of machine learning and knowledge representation. The main foci of his current work are (1) bioinformatics and medical informatics; (2) learning effective probabilistic models and (3) formal foundations of learnability.

Title: Privacy and Data Mining: New Developments and Challenges
Speaker: Stan Matwin, University of Ottawa

Privacy and Data Mining: New Developments and Challenges

There is little doubt that data mining technologies create new challenges in the area of data privacy. In this talk, we will review some of the new developments in Privacy-preserving Data Mining. In particular, we will discuss techniques in which data mining results can reveal personal data, and how this can be prevented. We will look at the practically interesting situations where data to be mined is distributed among several parties. We will mention new applications in which mining
spatio-temporal data can lead to identification of personal information. We will argue that methods that effectively protect personal data, while at the same time preserve the quality of the data from the data analysis perspective, are some of the principal new challenges before the field.

Title: Mixed-Membership Stochastic Block-Models for Transactional Data
Speaker: Hugh Chipman, Acadia University
Time: Thursday, February 18, 2.30pm

Transactional network data arise in many fields. Although social network models have been applied to transactional data, these models typically assume binary relations between pairs of nodes. We develop a latent mixed membership model capable of modelling richer forms of transactional data. Estimation and inference are accomplished via a variational EM algorithm. Simulations indicate that the learning algorithm can recover the correct generative model. We further present results on a subset of the Enron email dataset. This is joint work with Mahdi Shafiei.

About the speaker: Dr. Hugh Chipman is a Canada Research Chair in Mathematical Modelling at Acadia University, and the director of the Acadia Centre for Mathematical Modelling and Computations. His research focuses on statistical models for extracting information from such large and complex datasets. He completed his PhD studies at the University of Waterloo in 1994, and held a faculty position at the University of Chicago before moving to Acadia. In 2009, he was awarded the CRM-SSC Prize for his outstanding contributions to the application of Bayesian statistical inference for data analysis.

Title: The trouble with community detection in complex networks
Speaker: Aaron Clauset, Santa Fe institute
Date: Tuesday, March 30, 2.30pm

Abstract: Although widely used in practice, the performance of the popular network clustering technique called "modularity
maximization" is not well understood when applied to networks with unknown modular structure. In this talk, I'll show that precisely in
the case we want it to perform the best---that is, on modular networks---the modularity function Q exhibits extreme degeneracies,
in which the global maximum is hidden among an exponential number of high-modularity solutions. Further, these degenerate solutions can
be structurally very dissimilar, suggesting that any particular high- modularity partition, or statistical summary of its structure,
should not be taken as representative of the other degenerate solutions. These results partly explain why so many heuristics do
well at finding high-modularity partitions and why different heuristics can disagree on the modular composition the same network.
I'll conclude with some forward-looking thoughts about the general problem of identifying network modules from connectivity data alone,
and the likelihood of circumventing this degeneracy problem.

Title: Modeling approaches for analyzing complex networks
Speaker: Edo Airoldi, Department of Statistics, FAS Center for Systems Biology, Faculty of Arts & Sciences, Harvard University
Date: Friday Sept 17, 2010, 9:30 a.m.

Abstract: Networks are ubiquitous in science and have become a focal point for discussion in everyday life. Formal statistical models for the
analysis of network data have emerged as a major topic of interest in diverse areas of study, and most of these involve a collections of
measurements on pairs of objects. Probability models on graphs date back to 1959. Along with empirical studies in social psychology and sociology
from the 1960s, these early works generated an active “network community” and a substantial literature in the 1970s. This effort moved
into the statistical literature in the late 1970s and 1980s, and the past decade has seen a burgeoning network literature in statistical
physics and computer science. The growth of the World Wide Web and the emergence of online “networking communities” such as Facebook and
LinkedIn, and a host of more specialized professional network communities has intensified interest in the study of networks and
network data. In this talk, I will review a few ideas that are central to this burgeoning literature, placing emphasis on modeling approaches
available for data analysis, and review some of the recent work that is going on in my group.

Speaker Bio: In December 2006, Dr. Airoldi received a Ph.D. from Carnegie Mellon, working on statistical machine learning and the
analysis of complex systems with Stephen Fienberg and Kathleen Carley. His dissertation introduced statistical and computational elements of graph theory that support data analysis of complex systems and their evolution. Till December 2008, he was a postdoctoral fellow in the Lewis-Sigler Institute for Integrative Genomics of Princeton University working with Olga Troyanskaya, David Botstein, and James Broach. He developed mechanistic models to gain computational insights into aspects of the molecular and cellular biology that are not directly observable with experimental probes. He has been working closely with biologists and in the areas of cellular differentiation, cellular development and cancer, since.
Speaker URL:

Title: Facebook as a data capture site: Techniques, Traps, Terms and Conditions

This talk will give an overview of the sorts of social network data that are accessible through the Facebook API and some of issues that come with downloading and processing this data. In the first part of the talk, I review several pieces of software that allow for the download and capture of social networks, including NodeXL, NetVizz, NameGenWeb, iGraph and Pajek. I walk through different routines and cover efficiency through FQL queries. The talk will also walk through three recent examples of privacy leaks with the Facebook data (The "Taste, Ties and Time" data set, Pete Warden's open profiles data and the Oxford 100 schools data set) and how privacy issues inhibited their full use. I tie this to the evolving developer terms of use on Facebook, as well as some of the other emergent API issues (such as Twitter's recent decision to no longer whitelist accounts). My intention is to end the talk by reinforcing the importance of
careful and minimal data collection efforts rather than a cavalier approach indifferent to the risks of real world data. I also wish to make an appeal to technical fields whose ethics procedures tend to be inadequate for this sort of semi-private and sensitive data.

Slides -- Slides from March 25 seminar in the Social Media Lab

Bernie Hogan is a Research Fellow at the Oxford Internet Institute. He specializes in novel methods for online data capture and analysis,
especially via social media. Recent work has focused on the capture analysis of Facebook networks, particularly through his application
namegenweb, which downloads a social network for visualization in network programs such as NodeXL. Past work included an online audit
study of racism on Craigslist, pen and paper methods for visualizing social networks, the analysis of profile photos and techniques for
online surveys of spouses and partners. Bernie received his dissertation from the University of Toronto in 2009 under Barry
Wellman. This thesis won the Dordick award for Best Dissertation from the Communication and Technology section of the International
Communication Association.

Speaker's contact info:
Dr Bernie Hogan
Research Fellow, Oxford Internet Institute
University of Oxford

Title: Towards Summarizing and Making Sense of the Blogosphere

Speaker: Prof. Denilson Barbosa
Department of Computing Science, Univ. of Alberta

Date: Thursday March 15, 2012

Time: 2:40 p.m.

Location: Jacob Slonim Conference Room (430), Computer Science
Dalhousie University
6050 University Avenue, Halifax

The extraction of structured information from text is a fast improving subfield of Natural Language Pro- cessing which has been re-invigorated with the ever-increasing availability of user-generated textual content online. One environment which stands out as a source of invaluable information is the blogosphere–the network of social media sites, in which individuals express and discuss opinions, facts, events, and ideas pertaining to their own lives, their community, profession, or society at large. Indeed, the automatic extraction of reliable information from the blogosphere promises a viable approach for discovering very rich social data: the issues that engage society in thousands of collective and parallel conversations online. Considerable attention has been given to the problem of automatically extracting and studying the social dynamics among the participants (i.e., authors) in shared environments like the blogosphere. In that line of work, the goal is to understand how the network of humans conversing in the blogosphere is formed, evolves over time, and influences others in their own opinions. Our goal, on the other hand, is to extract the network of entities, facts, ideas and opinions expressed in social media sites, as well as the relationships among them. Such structured data can be organized as one or more information networks, which in turn are powerful metaphors for the study and visualization of various kinds of complex systems. In this talk, I will cover the basic NLP tools that are necessary for automatically extracting information networks from social media text, relying to a large extent on the experiences gathered on our ongoing SONEX project.

Speaker Bio:
I am an Associate Professor at the University of Alberta, where I joined in 2008. I completed my Ph.D. at the University of Toronto in 2005, working on XML data management and took an academic job at the University of Calgary between 2005 and 2008. I am interested in databases on their own merit, and also on the application of database and information retrieval principles to the management of linked data. I am a member of the NSERC Strategic Network on Business Intelligence, where I work on information extraction, and the Canadian Writing Research Collaboratory, where I work text mining, data management for prosopography, and document engineering.
Speaker url:

Host: Evangelos Milios (