Statistics Colloquia, 2006

Department of Mathematics and Statistics, Dalhousie University

Statistics Colloquium Chair: Hong Gu


Unless otherwise indicated, the following location and time apply to all colloquium talks.

Location: Colloquium Room (Chase Building, Room 319)

Time: Thursday 3:30 to 4:30pm.


Date: Feb. 2, 3:30 to 4:30pm
Speaker: Christopher Field, Dalhousie University, Halifax, Canada
Title: Wandering and Outlying Birds: A Statistical Analysis


Wandering North American birds show up in England on a fairly regular basis and are avidly chased and recorded by birders. Will try to predict the abundance of these vagrants based on North American migration counts, migratory behavior and physiological characteristics of a particular species using statistical models. In the analysis
we're interested in species which are over-represented as vagrants and perhaps more importantly for the ``twitchers'', the species which haven't shown up but for whom, the predicted values are large. Will examine the role of robustness in this analysis given that the vagrants are already outliers as their behavior deviates from that of the bulk of their species. This is joint work with Ian McLaren and Krista Collins.

Date : Apr. 6, 2006. Thur. 3:30 to 4:30
Speaker: Ian D. Jonsen, Ph.D.
Department of Biology, Dalhousie University, Halifax, Nova Scotia,Canada
Title: Robust state-space modeling of Leatherback turtle movement


Biological and statistical complexity are features common to most ecological data that hinder ecologists' ability to extract meaningful patterns. Using the largest Atlantic satellite telemetry dataset for the critically endangered leatherback turtle, I show how movement behaviours can be inferred from complex, noisy data using robust hierarchical Bayes state-space methods. These methods deal with several features of the data such as estimation errors that are non-Gaussian and vary in time, observations that occur irregularly in time, and complexity in the underlying behavioural processes. I show how we are using these methods to infer behaviour and navigation ability of leatherbacks during their impressive annual migrations from northern foraging areas around Nova Scotia to tropical nesting beaches.


Date : Sep. 21, 2006. Thur. 3:30 to 4:30
Speaker: Dennis Pilkey
Director, Community Counts
Nova Scotia Department of Finance
1723 Hollis Street
P.O. Box 187
Halifax, NS B3J 2N3
Bus: (902) 424-6816
Fax: (902) 424-0635

Presentation Overview:

Nova Scotia Community Counts is a statistical infrastructure that provides government, business, community decision-makers and citizens with easy and timely access to quality, comprehensive statistics presented in an intuitive and informative manner. It provides information to develop an understanding of and make assessments and comparisons about the health, safety and security, and quality of life in Nova Scotia communities and regions. Find it at:

This presentation provides a socio-economic overview of Nova Scotia communities using graphs and thematic maps from the Nova Scotia Community Counts system as well as a brief introduction to the Community Counts system. The session includes commentary on implications of community level data for planning purposes. It will also highlight areas of research and collaboration currently underway. The purpose of the session is to explore mutual opportunities for research and use of the Community Counts system.

Dennis Pilkey - Professional Profile

As of June 2004, Dennis Pilkey was seconded to work full-time on the development of the Nova Scotia Community Counts system. For the previous seven years, Dennis was Director of the Nova Scotia Statistics Agency within the Nova Scotia Department of Finance.  In this role, he was responsible for the provision of statistical information and related services to all provincial government departments and agencies. He was the Provincial Focal Point for Statistics Canada and represented the Province on several Federal-Provincial-Territorial statistics committees.

Dennis was a member of the External Advisory Panel for Treasury Board of Canada’s Annual Report to Parliament–Canada’s Performance 2003. He was a member of the steering committee for the Quality of Life Indicators Project developed by the Canadian Policy Research Networks (CPRN). Over the last several years, Dennis has cultivated key strategic partnerships in development of the Community Counts system, including the Office of Health Promotion, the Newfoundland and Labrador Statistics Agency, Antigonish Area Partnership, ACOA, District Health Authorities and Community Health Boards, and several Nova Scotia universities.


Date : Sep. 28, 2006. Thur. 3:30 to 4:30
Speaker: Dr. Huaichun Wang, Dept. of Math. & Stat. ,
Dalhousie University
Title: Modeling covarion process of protein evolution


The covarion hypothesis of molecular evolution proposes that selective pressures on an amino acid or nucleotide site change through time, thus causing changes of evolutionary rate along the edges of a phylogenetic tree. Several kinds of Markov models for the covarion process have been proposed. One, proposed by Huelsenbeck (2002), has two substitution rate classes: the substitution process at a site can switch between a single variable rate, drawn from a discrete gamma distribution, and a zero invariable rate. A second model, suggested by Galtier (2001), assumes rate switches among an arbitrary number of rate classes but switching to and from the invariable rate class is not allowed. The latter model allows for some sites that do not participate in the rate switching process. Here we propose a general covarion model that combines features of both models, allowing evolutionary rates not only to switch between variable and invariable classes, but also to switch among different rates when they are in a variable state. We have implemented all three covarion models in a maximum likelihood framework for amino acid sequences and tested them on 23 protein data sets. We found significant likelihood increases for all data sets for the three models, compared to a model that does not allow site-specific rate switches along the tree. Furthermore, we found that the general model fit the data better than the simpler covarion models in the majority of the cases, highlighting the complexity in modeling the covarion process. The general covarion model can be used for comparing tree topologies, molecular dating studies and the investigation of protein adaptation.


Date : Oct. 5, 2006. Thur. 3:30 to 4:30
Speaker: Michael Dowd, Dept. of Math. & Stat.,
Dalhousie University
Title: Statistical Estimation for Nonlinear Stochastic Dynamical Systems


This talk will consider state estimation and prediction for nonlinear models of the type used in the marine environmental sciences. These models are based on differential equations and characterized by complex dynamical behaviour. Observations on such systems are diverse including time series, spatial imagery and spatial/temporal data. State space models are used to combine the nonlinear models and non-Gaussian observations via Bayesian principles. Monte Carlo approaches for both filtering and smoothing are outlined and applied. A prototypical marine ecosystem (biogeochemical) model is used for illustration throughout.


Date : Oct. 19, 2006. Thur. 3:30 to 4:30 
Speaker: Prof. Tessema Astatkie,Professor of Statistics,Department of Engineering, Nova Scotia Agricultural College

Title: Application of two-level un-replicated factorial designs in agricultural field experiments


Un-replicated 2^k factorial designs are common in industrial experiments, but not in agricultural field experiments. The main reasons are large experimental errors due to weather and soil related factors, and unwillingness of agronomists to consider un-replicated designs. In this study, an un-replicated 25 factorial design was used to determine the effect of nitrogen, phosphorus, potassium, Compost, and seaweed extract on dry matter yield. The Normal Probability Plot of effect estimates and the Lenth method were used in the first phase of the analysis, and in the second phase, ANOVA was completed by either projecting the design to a replicated factorial with fewer factors or reducing the model by moving insignificant high-order interaction effects to the error. The study revealed up to four-factor significant interaction effects despite the unusually dry weather during the study years.

Date : Oct. 26, 2006. Thur. 3:30 to 4:30 
Speaker: Paul Sheridan

Title: On Issues of Singularity for Confidence Regions and Hypothesis Tests for Topologies Using Generalized Least Squares

Abstract: Recently, Susko~\cite{susko} described a computationally inexpensive way to construct confidence regions (CR) for topologies using a generalized least squares (GLS) test statistic, with chi square distribution, which applies to maximum likelihood (ML) distances. A software implementation for both nucleotide and protein data, called glsdna and glsprot respectively, were also provided by Susko~\cite{susko2}. The accuracy of both the GLS test statistic and sample average approximations used for the variances and covariances for the ML distances are asymptotic in the number of sites; however, in practice usable sequences may be only hundreds of characters long. It is untested just how GLS will perform under these conditions.

In this thesis, a simulation study is undertaken to gauge the consequences of these asymptotic limitations. To this end, 4 and 7 taxon trees were used to simulate nucleotide sequence data for each of the lengths 50, 100, 250, 500, 1000, 5000, and 10000. For each tree used, and each sequence length, on the order of 10000 CR's were generated, and the coverage probability of the true tree, size of each CR, estimated ML distances, and estimated sample average variances-covariances were recorded. It was found that the coverage probabilities agreed with what is expected asymptotically for sequence lengths 1000 and higher. For smaller sample sizes the coverage probabilities were generally found to be higher than the 0.95 value. It was anticipated that, for small sample sizes, the coverage probabilities would attain the expected 0.95 value, if the true covariances were used to compute the GLS test statistic. Surprisingly, the coverage probabilities were drastically underestimated. The underlying cause can be attributed to a tendency for the ML distances to be overestimated for small sequence lengths together with what we found to be exponential increase in variance with distance between taxa.

The second part of this thesis is directed toward a fixing a serious limitation of the GLS software. Namely, computation of the GLS test statistic requires the estimated covariance matrix of the ML distances to be invertible. If singularity does occur, then the test statistic cannot be computed and the programs will crash. In molecular evolution models, the covariance matrix is a function of the substitution model and the underlying tree but it is not generally known what types of trees and models cause singular covariance matrices. In this thesis, we show that singular covariance matrices arise if and only if some distance is exactly 0 or equivalently when a pair of taxa have identical sequences with probability 1. However, in practice the covariance matrix must be estimated and the underlying causes of singularity are more complex. A necessary condition for singularity in the estimated covariance matrix is given, as well as two sufficient conditions which are: 1)~The number of distinct nucleotide patterns at a site is less than the number of pairs of taxa, and 2)~A special type of linear dependence is constructed in the rows of the estimated covariance matrix.

Finally, two alternatives to using the glsdna and glsprot routines are introduced which allow for the construction of a CR even when the covariance matrix is singular. First, the routines glsdna\_eig and glsprot\_eig, as described in~\cite{susko2}, use an eigenvalue cutoff approach. The causes of singularity described in this thesis led to an alternate approach which uses a distance cutoff, or in other words, groups of taxa which are closely related are combined together before computing a CR. This approach is implemented as glsdna\_dist and glsprot\_dist. These different approaches were compared via a simulation on two 8 taxon trees using ucleotide sequence data. Briefly, the results show that for small samples the glsdna\_dist routine gives better coverage probabilities and far smaller CR sizes than those obtained by using glsdna\_eig, while for longer sequence lengths the routines exhibit similar performance.


Date : Nov. 2, 2006. Thur. 3:30 to 4:30 
Speaker: Prof. Yonggan Zhao, Associate Professor of Finance,Canada Research Chair (tier 2) in Risk Management,Faculty of Management,6100 University Avenue, Suite 2010
Title: Weak Interest Rate Parity and Currency Portfolio Diversification

Abstract: This paper presents a dynamic model of optimal currency returns with a hidden Markov regime switching process. We postulate a weak form of interest rate parity that the hedged risk premiums on currency investments are identical within each regime across all currencies. Both the in-sample and the out-of-sample data during January 2002 - March 2005 strongly support this hypothesis. Observing past asset returns, investors infer the prevailing regime of the economy and determine the most likely future direction to facilitate portfolio decisions. Using standard mean variance analysis, we find that an optimal portfolio resembles the Federal Exchange Rate Index which characterizes the strength of the U.S. dollar against world major currencies. The similarity provides a strong implication that our three-regime switching model is appropriate for modeling the hedged returns in excess of the U.S. risk free interest rate. To investigate the mpact of the equity market performance on changes of exchange rates, we include the S\&P500 index return as an exogenous factor for parameter estimation.


Date : Nov. 9, 2006. Thur. 3:30 to 4:30
Speaker: Christian Blouin, Computational Biology and Bioinformatics, Faculty of Computer Science, Dept. of Biochemistry and Molecular Biology, Dalhousie University

Title: Protein phylogeny using atomic coordinates

Abstract: It is possible to model evolution at the sequence level using a process of character substitution. The dissimilarity amongst homologous characters between two sequences is related to the evolutionary distance. However, this relationship does not hold for 3D structures that are encoded by these sequences. Evolution in 3D appears to be much more sporadic, although such process is inconsistent with our current knowledge in protein science. This work presents a method to reconstruct the evolution of protein structures using atomic coordinates. Phylogenies are resolved using the Neighbor-Joining algorithm on a matrix of Qh structural similarity scores. The resulting phylogenies are used to infer the location and nature of insertion events. From these results, it appears that complex and novel protein structures can emerge via a gradual process, although this suggestion is biochemically counter-intuitive. The general application of structural (3D) data to phylogeny will be discussed.


Date : Nov. 16, 2006. Thur. 3:30 to 4:30 
Speaker: Prof. Robert Beiko, Faculty of Computer Science, Dalhousie University. Director of Bioinformatics, Genome Atlantic
Title: Evolutionary Themes in the Microbial World

Abstract: Microbes cover the globe, from pole to pole and from ocean trench to upper atmosphere. To survive in such a wide range of environments, they must be adaptable to new challenges and opportunities rought about by changing environmental conditions, food sources, and competitors. The phenomenal rate of genome sequencing allows us to ask many questions about bacterial diversity and function, but our models and methods are still primitive and often fail to bridge the gap between statistics and biology. In particular, models that assess diversity without considering evolutionary background can yield stellar p-values that are completely meaningless.

My presentation will provide a brief overview of microbial diversity and evolution, and touch on three important themes that address different evolutionary phenomena, with widely divergent levels of statistical maturity: the role that gene sharing plays in adaptation, the effect of changing mutational biases, and the relationship that exists between geography, evolution and microbial diversity.

Date : Nov. 30, 2006. Thur. 3:30 to 4:30 
Speaker: Prof. Guoqi Qian, Department of Mathematics and Statistics, The University of Melbourne, VIC 3010 Australia.
Title: On Time Series Model Selection Involving Very Many Candidate ARMA Models

Abstract:We study how to perform model selection for time series data where millions of candidate ARMA models may be eligible for selection. We propose a feasible computing method based on the Gibbs sampler. By this method model selection is performed through a random sample generation algorithm, and given a model of fixed dimension the parameter estimation is done through the maximum likelihood method. Our method takes into account several computing difficulties encountered in estimating ARMA models. The method is found to have probability of 1 in the limit in selecting the best candidate model under some regularity conditions. We then propose several empirical rules to implement our computing method for applications. Finally a simulation study and an example on modelling China's Consumer Price Index (CPI) data are presented for purpose of illustration and verification.



Link to the 2005 statistical colloquia