## Statistics Colloquia, 2005

Date : Jan. 27 3:30 to 4:30pm

Speaker: Prof. Chris Field,

Dept. of Math. & Stat.

Dalhousie University

Topic : Variable Selection both Practical and Conceptual

`Abstract:    `
`I will first look at issues of variable selection as they arise in longitudinal data analysis. In ongoing work with Eva Cantoni, Joanna Flemming and Elvezio Ronchetti, have developed variable selection procedures based on estimated prediction error.Our computations use GEE estimates but other settings are possible.  To compute the prediction error, we use the ideas of cross validation where the size of the prediction sample grows as the number of experimental units increases. To handle the cases where the number of variables is large, we use MCMC ideas as developed by Guoqi Qian and me to move through the model space. The procedure not only gives an estimate of the best model but will also give all models whose prediction error is within one standard error of the chosen model, as in the spirit of bagging. Will also briefly discuss some unifying ideas of model selection using the Kullback-Leibler information. Our aim to have a procedure which works when the true model lies outside the class of models within which we are doing the selection. This is based on ongoing work with Guoqi Qian.`

Time : Feb. 3, 3:30 to 4:30pm

Speaker: Prof. Ying Zhang

Dept. of Math. & Stat.

Topic : Computer Algebra Derivation of the Bias of Commonly-Used Time Series Estimators

`Abstract: `

There are three commonly used linear estimators in the time series analysis: least squares, Burg and Yule Walker estimators. Burg and Yule Walker estimators both use the Durbin-Levinson recursion for fitting the AR(p) model. The Burg algorithm estimates the partial autocorrelation by minimizing a sum of squares of observed forward and backward prediction errors, while Yule-Walker algorithm calculates it recursively using the estimated autocovariances up to lag p. The Burg estimator can be considered as the least-squares estimator constrained by the Durbin-Levinson recursion.

Using computer algebra the biases to O (1/n) of these three estimators in AR(1) and AR(2) models is derived. For the Burg estimator, it is shown the bias to O (1/n) to be equal to that of the least-squares estimates in both the known and unknown mean cases. Previous researchers have only been able to obtain simulation results for the bias of the Burg estimator because this problem is too intractable without using computer algebra. We demonstrate that the Burg estimator is preferable to the least-squares or Yule-Walker estimator.

Time: Feb. 10, 3:30 to 4:30pm

Speaker: Dr. Huaichun Wang

Dept. of Math. & Stat.

Dalhousie University

Topic :The effects of DNA compositional bias on genome evolution

`Abstract: `

The genomic G+C content of prokaryotes varies from approximately 25% to 75% among genomes but the variation is relatively small within a genome. In contrast, vertebrate genomes have greatest variation in G+C content within the same genome rather than between genomes. There has been a long-standing controversy concerning the causes of these inter- and intra-specific variations. Is it caused by natural selection or, conversely, is it selectively neutral? In this study, we investigated the source of nucleotide compositional variation (nucleotide bias) and the consequences of the bias on protein sequence and genome evolution. Thermal adaptation is a primary example to study the effect of natural selection. We found that both GC content and length of ribosomal RNA genes show positive correlations with optimal growth temperature in prokaryotes and these correlations are not due to phylogenetic history. The correlations are concentrated almost entirely within the stem regions of the rRNA. The rRNA loops, however, show very constant base composition regardless different temperature optima or genomic GC content.

To analyze the consequences of nucleotide bias in eukaryotic genomes, we studied homologous genes and their encoded proteins in two flowering plants, Oryza sativa (rice) and Arabidopsis thaliana. Rice genes of high G+C content encode proteins composed of high frequency of G+C-rich codon encoded amino acids, i.e., glycine, alanine, arginine and proline. Rice genes of Low G+C content and Arabidopsis genes encode proteins composed of high frequency of A+T-rich codon encoded amino acids, i.e., phenylalanine, tyrosine, methionine, isoleucine, asparagines and lysine. The synonymous codon usage in rice genome is primarily dictated by G+C content of the genes, rather than translational selection. These results provide persuasive evidence that nucleotide bias is a cause, rather than consequences, of protein evolution and this affects codon usage and protein composition in a predictable way.

Time: Feb. 21, 3:30 to 4:30pm

Speaker: Prof. Mu Zhu,

Department of Statistics and Actuarial Science

University of Waterloo

Topic: A Computationally Efficient Approach for Statistical Detection

Abstract: We propose a computationally efficient method for detecting items belonging to a rare class from a large database. Our method consists of two steps. In the first step, we estimate the density function of the rare class alone with an adaptive bandwidth kernel density estimator. The adaptive choice of the bandwidth is inspired from the ancient Chinese board game known today as Go. In the second step, we adjust this density locally depending on the density of the background class nearby. We show that the amount of the adjustments needed in the second step are approximately equal to the adaptive bandwidths we choose from the first step, which gives us additional computational savings. We name the resulting method LAGO for ``locally adjusted Go-kernel density estimator.'' We then apply LAGO to a real data set in drug discovery and compare its performance with a number of existing and popular methods.

This is joint work with Wanhua Su and Hugh A. Chipman.

Time: Mar. 3, 3:30 to 4:30pm

Speaker: Prof. Kuan Xu,

Department of Economics

Dalhousie University

Topic: U-Statistics and Its Asymptotic Results for some Inequality and Poverty Measures

Abstract

The U-statistics as a unique framework for a cluster of statistics is suitable to some income inequality and poverty measures. For example, the variance, Gini index or coefficient, poverty rate, poverty gap ratios, Foster-Greer-Thorbecke index, Sen index, and modified Sen index are good candidates because their sample counterparts can be viewed as either U-statistics themselves or functions of simpler U-statistics. This paper applies the existing theories for U-statistics to these suitable income inequality and poverty measures so that a general framework of statistical inference can be established. In particular, with this framework, the asymptotic distributions of the sample Sen and modified Sen indices can be established.

The paper is available from the author upon request.

Time: Mar. 10, 3:30 to 4:30pm

Speaker: Prof. Ransom A. Myers (RAM),

Killam Chair of Ocean Studies

Department of Biology

Dalhousie University

Topic: How to count the fish in the sea: Statistical Problems in the Census of Marine Life

Abstract: (not available)

Time: Mar. 24, Thur.
3:30 to 4:30
Speaker: J.P. Bielawski

Department of Biology & Dept. of Math. & Stat.
Title: Phylogenetic methods for detecting molecular adaptation

Abstract: Molecules evolve; the genes encoding them undergo mutation, and the evolutionary fate of these variants is determined by random genetic drift, purifying selection or positive (Darwinian) selection. The ability to study this process only began in the late 1970's when techniques to measure genetic variation were beginning to be developed. About the same time a debate began about the relative importance of neutral drift and positive selection to the process of molecular evolution. Ever since there has been considerable interest in documenting well-established cases of molecular adaptation.  Despite a spectacular increase in the amount of available nucleotide sequence data over the last three decades, the number of such well-established cases is still relatively small. This is largely due to the difficulty in developing powerful statistical tests for adaptive molecular evolution.  In this talk I will provide an overview of some recent developments in the statistical methods, and provide an example illustrating a cross-disciplinary approach to studying molecular adaptation that integrates these statistical methods with the experimental methods of molecular biology.

Date : Mar. 31, 3:30 to 4:30pm

Speaker: Matthew Spencer

Dept. Mathematics and Statistics and Molecular Biology and Biochemistry

Dalhousie University

Topic : Markov models for coral reef communities: discrete or continuous time?

`Abstract:    `

Communities of sessile organisms (organisms that stay in one place) such as corals and trees are often modelled using Markov chains. The state space is the set of species in the community (of which only one can occur at a given point in space and time), plus empty space. Ecologists usually formulate these models in discrete time. Using a discrete-time model when the underlying dynamics are continuous can have important consequences:

i. The matrix of transition probabilities in discrete time is often thought to give a direct representation of interactions between species. We estimate discrete and continuous time Markov models for data on a coral reef community by maximum likelihood, and show that the discrete-time matrix contains nonzero elements that probably do not represent direct replacements.

ii. The sensitivity of the stationary distribution to perturbations is important in conservation. We will usually obtain the wrong sensitivity if we use a discrete-time model when the underlying dynamics are in continuous time.

This is joint work with Ed Susko

Date : Apr. 7, 3:30 to 4:30pm

Speaker: Connie Stewart

Dept. Mathematics and Statistics

Dalhousie University

Topic : Inference on the Diet of Predators Using Fatty Acid Signatures

`Abstract:    `

Methods of accurately estimating the diet of predators are of great ecological importance, yet prior to the work of Iverson et al (2004) the methods used were often unsatisfactory.

Iverson et al (2004) proposed estimating the diet by matching fatty acid signatures of predators to those of their prey. Given the potential species in a predator's diet, they were able to use statistical methods to estimate the proportion of each species in the diet. To date, only point estimates of the diets of predators have been studied.

Our focus has been interval estimation of the diet of predators. As the data is compositional and often with zeros, special techniques are required to handle this situation. In this talk, both parametric and nonparametric confidence interval methods will be discussed. Some results from our simulation study on coverage probabilities and interval lengths will be presented. Our recommended confidence interval method will then be applied to some real-life seabird data.

Time: Thur. Aug.4, 2005
Location: Dunn 302
Speaker: Eva Cantoni

Department of Econometrics, University of Geneva
Joint
work with J. Mills Flemming, C. Field and E. Ronchetti

TITLE: Variable selection techniques for marginal longitudinal models
ABSTRACT:
Variable selection is an essential part of any statistical analysis. Although it is often perceived as an extensive search for a single best model, it should be viewed as a technique which facilitates the identification of a few good models. This implies that variable selection criteria which allow direct comparison of models should be preferred to stepwise procedures based on significance testing. In this talk we focus on marginal longitudinal models. We first present a generalized version of Mallows's Cp (RGCp) for these models. It is a flexible technique that can also include weighting scheme to address additional issues, like robustness, heteroscedaticity and missingness. This technique (among others) requires the compution of the target criterion on all possible models, which becomes unfeasible for large-scale studies with a large number of covariates. In the second part of the talk, we will present an MCMC random search procedure that allows us to sensibly sample the model space, avoiding
therefore to visit all candidates models. We will illustrate the benefits of our approaches on both real and simulated data.

Time: Aug. 11, 2005
Place: Dunn 302
Speaker: Olivier Renaud (
Methodology and Data Analysis, Section of Psychology, University of Geneva)
Olivier.Renaud@pse.unige.ch
http://www.unige.ch/~renaud/

TITLE: Simultaneous tests in the Time-Frequency Plane for Electroencephalogram signals

ABSTRACT:
After measuring an Electroencephalogram (EEG) on several subjects in several different conditions at several occurrences, it is common practice to average the signals over the occurrences (Event Related Potentials, ERP) and to analyze some features of these signals. Depending on the problem, one may want to do either a frequency analysis (Fourier) or a time analysis (peak detection, latency comparison and so on). In this talk, we present how to test simultaneously at all time points the differences between conditions using the Hadwiger or Euler characteristic. We also present how to generalize this procedure to test on the time-frequency/time-scale plane of the continuous wavelet transform or kernel smoother. We will also see how to incorporate the model that is hypothesized on the time dependence of the ERPs . These techniques are illustrated on an experiment in psychology where subjects have to detect targets in an image.

Time: 3:30 to 4:30, Oct. 23,2005,

Speaker: Michele Millar (Dept. of Math. & Stat., Dalhousie University)

Title: Breeding Value Estimation and Biodiversity Considerations in Forest Genetics

Abstract:

In any breeding program the goal is to identify parents who will produce
offspring with desired characteristics to improve the next generation.
Rigorous selection of parents which can improve a trait in the
short term has to be balanced in the long term with loss of genetic
diversity as possible inbreeding increase could lead to consequences such
as reduced resistance to pests and diseases or reduced adaptive potential.

For our forest tree improvement program the trait of immediate
interest is height: we have the heights of offspring from half-sib
families, and information on survival. Our statistical analysis
provides estimates of the genetic worth (often referred to as breeding
values or combining abilities) of parents based on the observations of
their offspring. These estimates can then be used to select the
superior parents for breeding purposes.

We develop a new model for height which includes parameters to account
any spatial correlation due to the micro-environment differences
affecting growth, and to allow for the differing within family
genetic variances. The maximum ikelihood estiators for this model are then
shown to be consistent, efficient, and asymptotically normal.

Since the likelihood for our model is intractable we provide a procedure
to approximate the maximum likelihood estimates using weighted linear
mixed effects procedures to model the within family variances along with
pseudo-likelihood and indirect estimation to account for the correlation.

To take into account the presence of influential observations, due to
contamination or possible measurement error, we provide a robust version
of our method. We also show how to incorporate estimates of family survival
rates and measurements of genetic diversity into the selection process.
Finally we use bootstrap methods to provide inference on the potential
gain from selection and for the selection process itself.

Time: Oct. 6, 2005

Speaker: Arnold Mitnitski, (Faculties of Medicine & Computer Science, Dalhousie University)

Title: How many variables should be included in multivariable models?

Abstract:

In is commonly believed that there exist a relatively small number of important variables (risk factors) and that their careful selection is the key to successful performance of multivariable models for outcome prediction. Unfortunately, however, risk factors typically interact with each other in a complicated and generally unknown way, and therefore often are eliminated from predictive models.  Using a sampling technique, we investigated how the number of variables included in a model affects the accuracy of outcome prediction. We analyzed 4 epidemiological and clinical cohorts: a Canadian clinical examination cohort of elderly people (n=2305, 1007 died within 4 years, 70 variables), a community-dwelling cohort (n=5586, 40 self-reported variables), a Swedish birth cohort aged 70 at baseline (n=973, 412 died within 10 years, 130 variables), and survivors of acute myocardial infarction from a provincial registry (n=4432, 974 died within 1year, 37 variables). Several multivariable techniques (e.g., logistic regression, a least squares linear discriminant function, an  artificial neural network) were employed to calculate the risk of death in training samples. The performance of the model was assessed in independent testing samples by using the areas under the receiver operating characteristic curves (AUCs). We demonstrated that the AUC increased monotonically with the number of variables and saturates only when the number of variables is more than 5-15, depending on the dataset and the model used with AUC range 0.70 to 0.85. In each simulation,artificial neural networks slightly but consistently over-performed the other techniques; this difference, however, diminished with the number of variables included in the analyses. Variations in predictive accuracy between techniques within data sets were substantially lower than the variation between datasets. It appears that the nature of the data more than the type of the technique limits the accuracy of prediction in multivariable models.

Date : Nov. 3, 3:30 to 4:30pm

Speaker: Prof. David Hamilton

Title: Inferences for a standardized composite measure of linkage disequilibrium

Abstract:

Linkage disequilibrium is the term used by geneticists to denote association between alleles (genetic variants) at two genetic loci or between alleles at one locus and disease status. In many situations the four alleles at two loci are known, but their haplotype structure (which alleles are on which chromosome) is not. In such cases a composite measure of linkage disequilibrium is used. In this talk I will show how this measure can be standardized so that it takes values between -1 and +1, like a correlation coefficient. This allows comparisons between populations. An approximate variance for the standardized measure is obtained, and its use in statistical inference is investigated. No knowledge of genetics is required, I'll explain  the concepts as needed.

Date : Nov. 17, 3:30 to 4:30pm

Speaker: Prof. Jonathan Taylor, Stanford University

Title: Tail strength of a dataset

Abstract:

We propose an overall measure of signifcance for a set of hypothesis tests. The tail strength is a simple function of the p-values computed for each of the tests. This measure is useful, for example, in assessing the overall univariate strength of a large set of features in microarray and other genomic and biomedical studies. It also has a simple relationship to the false discovery rate of the collection of tests, as well as to a weighted area under an ROC curve. We derive the asymptotic distribution of the tail strength measure, and illustrate its use on a number of real datasets.

This is joint work with Rob Tibshirani.