The Second Workshop on Monte Carlo Methods

Complete Program Schedule

Friday August 27, 2004

8:45-9:00 Jun Liu
Opening Remark
9:00-9:45 Keynote Don Rubin
9:45-10:30 Keynote Warp Bridge Sampling for Likelihood and Bayesian Computation
Xiao-Li Meng
Harvard University

Bridge sampling, a general formulation of the acceptance ratio method in physics (Bennett 1976) for computing free-energy difference, is an effective Monte Carlo method for computing normalizing constants of probability models. The method was originally proposed for cases where the probability models have overlapping support. Voter (1985) proposed the idea of shifting physical systems before applying the acceptance ratio method to calculate free-energy differences between systems that are highly separated in a configuration space. Meng and Schilling (2002) recently proposed to extend Voter's idea further by applying more general transformations, including stochastic transformations resulting from mixing over transformation groups, to the underlying variables before performing bridge sampling. Such methods were termed warp bridge sampling to highlight the fact that in addition to location shifting (i.e., centering) one can further reduce the difference/distance between two densities by warping their shapes without changing the normalizing constants. Real data based empirical studies using the full information item factor model and a nonlinear mixed model are provided to demonstrate the potentially dramatic gains in Monte Carlo efficiency by going beyond centering and by using efficient bridge sampling estimators. The proposed general method is also applicable to a couple of recent proposals for computing marginal likelihoods and Bayes factors (e.g., Chib and Jeliazkov 2001) because these methods turn out to be covered by the general bridge sampling framework.
Break
Parallel Sessions
Session A: Science Center Hall A
Session B: Science Center Hall D
10:50-11:30 Session A Integrated Statistical Modeling of Gene Expression Data
Ning Sun, Hongyu Zhao
Yale University

Recent advances in large-scale RNA expression measurements, DNA-protein interactions, and the availability of genome sequences from many organisms have opened the opportunity for massively parallel biological data acquisition and integrated understanding of the genetic networks underlying complex biological phenotypes. Many existing statistical procedures have been proposed to analyze a single data type, e.g. clustering algorithms for microarray data and motif finding methods for sequence data. However, different data sources offer different perspectives on the same underlying system, and they can be combined to increase our chance of uncovering underlying biological mechanisms. In this talk, we will describe our attempts to develop a statistical framework to integrate diverse genomics and proteomics information to dissect transcriptional regulatory networks. The developed methods will be illustrated through their applications in the reconstruction of trascription networks during yeast cell cycle.

Chair: Jun S. Liu
Session B Equi-energy Sampler: From Statistical Inference to Statistical Mechanics
Samuel Kou (with Qing Zhou and Wing Wong)
Harvard University

We introduce a new sampling algorithm, the equi-energy sampler, for efficient statistical sampling and estimation. Complementary to the widely used temperature-domain methods, the equi-energy sampler, utilizing the temperature-energy duality, targets the energy directly. The focus on the energy function not only facilitates efficient sampling, but also provides a powerful means for statistical estimation, for example, the calculation of the density of states and microcanonical averages in statistical mechanics. The equi-energy sampler is applied to a variety of problems, including exponential regression in statistics, motif sampling in computational biology, and protein folding in biophysics.

Chair: Yves Atchade
11:30-12:10 Session A Simulating protein structures by sequenctial Monte Carlo
Jie Liang
University of Illinois, Chicago

(Joint work with Rong Chen and Jinfeng Zhang) Sequential Monte Carlo provides an important method for studying the structures of protein molecules, as it is essential to sample rare events such as compact self-avoiding walks efficiently. Here we describe how sequential Monte Carlo can help to assess the fundamental roles of chirality and inflexibility of side chains of amino acid residues as observed in nature for protein folding. We also discuss how sequential Monte Carlo can be used to generate off-lattice models of protein-like conformations under appropriately constrained growth function. We give examples of simulated conformations for sequences of different length that are in good agreement with experimental structures of native proteins.

Chair: Jun S. Liu
Session B Variable Selection via Sequential Method
Rong Chen
University of Illinois, Chicago

(with Junni Zhang, Lin Ming, Jun Liu) Variable selection procedure in regression and generlized linear models has been studied extensively in the statistical literature. The best subset selection is commonly used to check every possible subset of variables and find the optimal one according to a given model selection criterion. The suboptimal forward, backward and stepwise selection procedures were developed in the early days due to the lack of computational power to carry out the best subset selection procudure. These procedures remain to be useful today, due to the large number of variables encountered in modern data mining applicatons. In this talk, we formulate the variable selection procedure as a dynamic system and propose deterministic and stochastic sequential optimization algorithms for searching the best set of variables according to certain model selection criterion. Stochastic optimization is carried out with Sequential Monte Carlo with look-ahead.

Chair: Yves Atchade
Lunch (Science Center First Floor)
1:30-2:10 Session A MC-INVERSE
Raoul LePage
Michigan State University

Bayes' method may be viewed as a particular type of linear inverse problem in which some integrals (prior and transition probabilities) are specified while others (posterior or marginal) are sought. It is in the nature of the Bayesian setup that the specified integrals be free of contradiction and determine the sought integrals. A heralded feature holds that given enough transition information the posterior quantities will be relatively insensitive to the choice of prior. Perhaps good methods of solving linear inverse problems, especially if they may be applied to incompletely specified or slightly contradiction-infected Bayes' models, might have a role. This talk sets forth (1) a nice suite of properties for MC-INVERSE, a monte carlo driven solution of linear inverse problems with (2) some illustrations of performance and limitations. Expect crisp detail for that part of the talk which stands as a monte carlo solution of linear inverse problems. Some few small steps toward exploring overlaps with (3) the objectives of Bayesian analysis will be speculative in nature. Part of the message is that inverse problem methods are technically a superset for Bayesian analysis and will have their own monte carlo methods as well.

Chair: Jie Liang
Session B Stochastic search and sparsity in high-dimensional regression and graphical models
Chris Hans, Adrian Dobra and Mike West
Duke University

Problems of model search in regression and multivariate analysis with very large numbers of candidate models raise challenges of specification and computation that we are a long way from solving. Model/prior assumptions that encourage (or enforce) sparsity - in regression structures or in multvariate relationships - are desirable, if not necessary, in order that currently known model search methods - stochastic or deterministic - scale to even modest dimensions. Our work with large-scale regressions and graphical models provides some examples of how coherent Bayesian models can be developed and applied in problems in quite high dimensions, underpinned by this emphasis â^È^Ò through sparsity inducing priors â^È^Ò on sparse structure consistent with both general scientific parsimony as well as the substantive context of motivating applications in areas such as exploratory gene expression analysis. Questions of model search are addressed through distributed computational approaches including shotgun stochastic search (SSS) for rapid identification and evaluation of many, many models. This talk will review and discuss aspects of this work, open questions and challenges, with examples in regression and graphical models.

Chair: Shane T. Jensen
2:10-2:50 Session A Random walks on treespace and extensions
Susan Holmes
Stanford University

We can code phylogenetic trees and hierarchical clustering trees using matching representation, this enables for the definition of a random walk which in joint work with Persi Diaconis, we found to have an eigenvalue structure we could actually analyze. This has applications to analyses of MCMC methods on hierarchical clusters and phylogenetic trees in molecular biological data analysis.

Chair: Jie Liang
Session B Quasi-Monte Carlo for Markov Chain Monte Carlo
Art B. Owen
Stanford University

There has been considerable progress recently in developing improvements to Monte Carlo methods. Much of the progress has been in quasi-Monte Carlo (QMC) methods, and Markov chain Monte Carlo (MCMC). The intersection between these two methods is conspicuously small. Few articles make deep use of both methods and not many researchers are active in both areas. This talk describe some proposals for the intersection. This is joint work with Seth D. Tribble

Chair: Shane T. Jensen
2:50-3:30 Session A Estimating the Number of Competing Terminals in an IEEE 802.11 Wireless Network
Xiaodong Wang
Columbia University

We treat the problem of estimating the number of competing terminals in an IEEE 802.11 wireless network, based on the Bayesian Monte Carlo signal processing techniques. It is assumed that the number of users in the network evolves according to a Markov chain with unknown transition probability matrix and the estimation is based on the observed collision probabilities in the shared wireless channel. Both off-line and online estimators are developed. The off-line estimator makes use of the Gibbs sampler, a Markov chain Monte Carlo (MCMC) procedure for jointly estimating the system parameters and temporal sequence of the number of users in the observation window. The online estimator, on the other hand, is based on the sequential Monte Carlo (SMC) technique for filtering nonlinear dynamic systems. In particular, an SMC algorithm is developed to jointly track the system parameters and the number of users sharing the channel. A deterministic variant of the SMC estimator is also developed, which is simpler to implement in practice and offers superior performance. Computer simulation results are provided to demonstrate the excellent performance of the proposed online and off-line estimators. It is seen that these Monte Carlo-based estimators significantly outperform the extended Kalman filter (EKF)-based estimator.

Chair: Jie Liang
Session B An Improved Merge-Split Sampler for Conjugate Dirichlet Process Mixture Models
David B. Dahl
University of Wisconsin

The Gibbs sampler is the standard Markov chain Monte Carlo sampler for drawing samples from the posterior distribution of conjugate Dirichlet process mixture models. Researchers have noticed the Gibbs sampler's tendency to get stuck in local modes and, thus, poorly explore the posterior distribution. Jain and Neal (2004) proposed a merge-split sampler in which a naive random split is made more realistic by a series of restricted Gibbs scans, where the number of restricted Gibbs scans is a tuning parameter that must be supplied by the user. In this talk, an alternative merge-split sampler is introduced. The new sampler, called the sequentially-allocated merge-split (SAMS), proposes splits by sequentially allocating indices to one of two split components using allocation probabilities that are conditional on previously allocated data. The proposal mechanism is fast, attempts to update many indices at a time, and makes good proposals, yielding a computationally efficient algorithm. No tuning parameter needs to be chosen. While the conditional allocation of indices is similar to sequential importance sampling, the output from the SAMS sampler has the correct stationary distribution due to the use of the Metropolis-Hastings ratio. The computational efficiency of the SAMS sampler is compared to the Gibbs sampler and the sampler of Jain and Neal (2004) with various values for their tuning parameter. Comparisons are made in terms of autocorrelation times for four univariate summaries of the Markov chains taken at fixed time intervals. In four examples involving different models and datasets, the SAMS merge-split sampler generally performs substantially better --- in some cases, two to five times faster --- than existing methods.

Chair: Shane T. Jensen
Coffee Break
3:50-4:30 Session A Gene regulatory networks and blind deconvolution
Chiara Sabatti
University of California, Los Angeles

Gene regulatory networks and blind deconvolution The problem of reconstructing gene regulatory networks in simple systems as E. Coli bears some resemblance to a blind deconvolution one: the concentrations of active forms of regulatory proteins are unobserved sources and the transcript levels of genes observed outputs. In general, it is not known which source influences which output; genome sequence data can be used to obtain information on the presence of binding sites in the region upstream genes, providing some indication on the set of genes controlled by a given regulatory protein. The control strength of a regulatory protein on target genes varies across genes, and across chemical conditions of the cell. It is of interest to reconstruct both the concentration of active forms of the regulatory proteins, and their control strength, in a series of conditions on the basis of measurements of gene transcripts levels, obtained with microarrays. We describe a Bayesian model for this system and illustrate its estimation via MCMC.

Chair: Lei Shen
Session B Bayesian Inference and Computation for the Cox Regression Model with Missing Covariates
Ming-Hui Chen
University of Connecticut

Missing covariate data in the Cox model is a fundamentally important practical problem in biomedical research. In this talk, I will present necessary and sufficient conditions for posterior propriety of the regression coefficients, $beta$, in Cox's partial likelihood, which can be obtained through a gamma process prior for the cumulative baseline hazard and a uniform improper prior for $beta$. The main focus of my talk will be on how to carry out a very challenging Bayesian computation that arises from this interesting problem. The novel Bayesian computational scheme we have developed is based on the introduction of several latent variables and the use of the collapsed Gibbs technique of Liu (1994). A real dataset is presented to illustrate the proposed methodology. This is a joint work with Joseph G. Ibrahim and Qi-Man Shao.

Chair: Rong Chen
4:30-5:10 Session A An Integrated Approach for Protein Function Prediction
Fengzhu Sun, Ting Chen, Minghua Deng, Quansong Ruan
University of Southern California

Assigning functions to novel proteins is one of the most important problems in the post-genomic era. Several approaches have been applied to this problem, including the analysis of gene expression patterns, phylogenetic profiles, protein fusions, and protein-protein interactions. We developed a novel approach that employs the theory of Markov random fields to infer a protein's functions using an integrated approach combining various sources of biological data including protein-protein physical interactions, genetic interactions, gene expressions, and domain structure. For each function of interest and protein, we predict the probability that the protein has such a function using Bayesian approaches. Unlike other available approaches for protein annotation in which a protein has or does not have a function of interest, we give a probability for having the function. This probability indicates how confident we are about the prediction. The model is flexible in that other protein pairwise relationship information and features of individual proteins can be easily incorporated. Two features distinguish the integrated approach from other available methods for protein function prediction. One is that the integrated approach uses all available sources of information with different weights for different sources of data. It is a global approach that takes the whole network into consideration. The second feature is that the posterior probability that a protein has the function of interest is assigned. The posterior probability indicates how confident we are about assigning the function to the protein. We apply our integrated approach to predict functions of yeast proteins based upon MIPS protein function classifications and upon the interaction networks based on MIPS physical and genetic interactions, gene expression profiles, Tandem Affinity Purification (TAP) protein complex data, protein domain information as well as correlations and protein functions.

Chair: Lei Shen
Session B Bayesian Variable Selection and Regularization for Audio Signal Analysis
Patrick J. Wolfe
Cambridge University

This talk will describe novel statistical methods for the analysis of non-stationary audio waveforms. Bayesian models for time-frequency coefficients are postulated based on the idea of a Gabor regression, in which a signal is represented as a superposition of translated, modulated versions of a window function exhibiting good time-frequency concentration. As a necessary consequence, the resultant set of potential predictors is in general overcomplete--constituting a frame rather than a basis--and hence such models require careful regularization through appropriate choices of variable selection schemes and prior distributions. Prior specifications are tailored to typical audio time series, and stochastic computation via Markov chain Monte Carlo methods is employed to obtain point estimates of the model parameters.

Chair: Rong Chen

Saturday August 28, 2004

9:00-9:45 Keynote Symmetry Analysis of Markov Chains
Persi Diaconis
Stanford University

Markov chains sometimes have some symmetry and it is natural to ask what mathematics has to offer regarding their analysis. I will review classical tools and some new work with Steve Boyd, Lin Xiao and Arun Ram on what's called orbit theory. This is illustrated on some fastest mixing Markov chains.
Break
Parallel Sessions
Session A: Science Center Hall A
Session B: Science Center Hall D
10:10-10:50 Session A Generalizing Swendsen-Wang to arbitrary posterior probabilities
Adrian Barbu, Song Chun Zhu
University of California, Los Angeles

The Swendsen-Wang (1987) algorithm was well celebrated in statistical mechanics for simulating (sampling) Ising/Potts models. It can achieve fast mixing even near critical temperature. But it was unfortunately limited to simple Ising/Potts models and doesn't make use of data information. In this talk, we generalize SW to arbitrary posterior probabilities and apply it to Bayesian optimization tasks in vision, such as image and motion segmentation, and stereo matching. We observe 100 times speedup over the classic Gibbs sampler, and thus can segment images in the speed of less than 1 minute with near global optimal solution. The problem is posed as graph partition: given a number of image elements as graph nodes, (pixels, atomic regions, edges etc.) we construct an adjacency graph G. The goal is to partition G into a unknown number of subgraphs each being a region, an object, or a curve, specified by a few families of probabilistic models. The key issue is to design smart Markov chain jumps in the space of all possible partitions. The Markov chain moves are driven by heuristics computed as discriminative models (edge weights) in the graph G. This is based on recent work by a Ph.D. student Adrian Barbu.

Chair: Cristian Castillo-Davis
Session B Conditional Inference on Tables with Structural Zeros
Yuguo Chen
Duke University

We describe a sequential importance sampling procedure for analyzing two-way zero-one or contingency tables with fixed marginal sums and a given set of structural zeros. Our method produces Monte Carlo samples that are remarkably close to the uniform distribution, enabling one to approximate closely the null distributions of various test statistics about these tables.

Chair: Xiaodong Wang
10:50-11:30 Session A High density admixture mapping to find genes for complex disease.
Nick Patterson, David Reich
Broad Institute

(Joint work with N Hattangadi, MW Smith, SJ OBrien, PD Jager, MJ Daly, D Altshuler, JR Oksenberg, SL Hauser and DA Hafler) Admixture mapping is a powerful way, in principle, to carry out whole-genome scans for disease genes. The method at least in favorable circumstances is both efficient and statistically powerful. The key idea of admixture mapping is that near a disease gene, patient populations descended from the recent mixing of two or more ethnic groups should have an increased probability of inheriting the alleles from the group with greater disease susceptibility. Since gene flow occurred recently (in African and Hispanic Americans in the past 20 generations), recombination has not had much time to act and linkage disequilibrium should extend many centimorgans. We sketch the biology, describe the quite complex statistical model, and highlight some issues that arise in the statistical inference.

Chair: Cristian Castillo-Davis
Session B Statistics Inference Based On Non-Smooth Estimating Functions
L. Tian, J. Liu, Y. Zhao, and L.J. Wei
Harvard University

When the estimating function for a vector of parameters is not smooth, it is often rather difficult, if not impossible, to obtain a consistent estimate by solving the corresponding estimating equation using the standard numerical techniques. In this article, we propose a simple inference procedure via the the Monte Carlo methods such as importance sampling and MCMC algorithms, which provide a consistent root to the estimating equation and also an approximation to its distribution without solving any equations or involving non-parametric function estimates.

Chair: Xiaodong Wang
11:30-12:10 Session A Improving Asymptotic Variance of MCMC Estimators: Non-reversible Chains are Better
Radford M. Neal
University of Toronto

I show how any reversible Markov chain on a finite state space that is irreducible, and hence suitable for estimating expectations with respect to its invariant distribution, can be used to construct a non-reversible Markov chain on a related state space that can also be used to estimate these expectations, with asymptotic variance at least as small as that using the reversible chain (typically smaller). The non-reversible chain achieves this improvement by avoiding (to the extent possible) transitions that backtrack to the state from which the chain just came. The proof that this modification cannot increase the asymptotic variance of an MCMC estimator uses a new technique that can also be used to prove Peskun's (1973) theorem that modifying a reversible chain to reduce the probability of staying in the same state cannot increase asymptotic variance. A non-reversible chain that avoids backtracking will often take little or no more computation time per transition than the original reversible chain, and can sometime produce a large reduction in asymptotic variance, though for other chains the improvement is slight. In addition to being of some practical interest, this construction demonstrates that non-reversible chains have a fundamental advantage over reversible chains for MCMC estimation. Research into better MCMC methods may therefore best be focused on non-reversible chains.

Chair: Cristian Castillo-Davis
Session B A One-Pass Sequential Monte Carlo Method for Bayesian Analysis of Massive Datasets
Suhrid Balakrishnan and David Madigan
Rutgers University

Standard Markov chain Monte Carlo (MCMC) methods generally require a complete scan of the dataset for each iteration. This essentially rules out MCMC as a feasible algorithm for Bayesian analysis of massive data. Ridgeway & Madigan (2002) and Chopin (2002) recently presented importance sampling algorithms that combined simulations from a posterior distribution conditioned on a smal portion of the dataset with a reweighting of those simulations to condition on the remainder of the dataset. While these algorithms drastically reduce the number of data accesses as compared to traditional MCMC, they still require substantially more than a single pass over the dataset. In this presentation, we present ``1PFS,'' an efficient, one-pass algorithm. The algorithm employs a simple modification of the standard particle filtering algorithm that replaces the MCMC based ``rejuvenation" step with a more efficient ``shrinkage" kernel smoothing based step. To show proof-of-concept and to enable a direct comparison, we demonstrate 1PFS on the same examples presented in Ridgeway & Madigan, namely a mixture model for Markov chains and Bayesian logistic regression.

Chair: Xiaodong Wang
Lunch (Science Center First Floor)
1:30-2:10 Session A Stochastic Volatility with Leverage: Fast Likelihood Inference
Siddhartha Chib
Washington University

cite{KimShephardChib(98)} provided a Bayesian analysis of stochastic volatility models based on a very fast and reliable Markov chain Monte Carlo (MCMC) algorithm. Their method, which was for models without a leverage effect, has been extensively used in the financial economics literature and more recently in macroeconometrics. In this paper we show how the KSC approach can be extended to models with the leverage effect, which is known to be important in certain applications. Several illustrative examples are provided.

Chair: Steve Qin
Session B Population Monte Carlo
Christian P. Robert
University of Paris, Dauphine

While MCMC methods have come to dominate the Bayesian computational area, there are still difficulties in devising adaptive methods that are satisfactory both from a theoretical and a methodological point of view. We reanalyze in this talk the Population Monte Carlo method that encompasses much more adaptive and local importance sampling schemes than thought previously. By introducing a temporal dimension to the selection of the importance function, an adaptive perspective can be achieved at little cost, with the same theoretical justification as the standard importance sampling technology. We have shown that the PMC scheme is a viable alternative to MCMC schemes in missing data settings, among others for the stochastic volatility model. We will present in addition recent results on the convergence properties of the most standard PMC schemes. (This is joint work with O. Cappe, R. Douc, A. Guillin, J.M. Marin, and E. Moulines.)

Chair: Ping Ma
2:10-2:50 Session A Contour Monte Carlo with Applications in Protein Structure Optimization
Faming Liang
Texas A&M University

The contour Monte Carlo (CMC) algorithm has been proposed recently by the speaker as a general Monte Carlo and stochastic optimization algorithm. CMC possesses a number of features that the conventional Monte Carlo algorithms do not have. First, it can overcome any barrier of the energy landscape, so it is an excellent tool for stochastic optimization. Under regularity conditions, it can be shown that it converges to the global optimum geometrically. This is superior to other existing stochastic optimization algorithms, such as simulated annealing and genetic algorithms. Second, it provides a new method for Monte Carlo integration. It estimates integrals based on stochastic optimization instead of the average or the weighted average of samples. Third, its self-adjusting nature makes the simulation converge very fast. The convergence can be checked on-line in a single run, in contrast to the multiple-run checking required by the conventional Monte Carlo algorithms. As a numerical example to demonstrate its performance, CMC is applied to protein folding simulations for off-lattice AB models. For all sequences, the algorithm has renewed the putative ground energy values of the two-dimensional model and set the putative ground energy values of the three-dimensional model.

Chair: Steve Qin
Session B Variance Reduction in MCMC
Antonietta Mira
University of Insubria

We will present various techniques that can be used in MCMC simulation to reduce the asymtotic variance of MCMC estimators and thus increase the efficiency of the sampler

Chair: Ping Ma
2:50-3:30 Session A Phylogeny from Genome Arrangements: A Bayesian Approach
Bret Larget
University of Wisconsin

Phylogenies are tree diagrams that show evolutionary relationships based on analysis of genetic data. Most phylogenies are estimated from aligned DNA sequences. Bayesian methods of phylogenetic inference were first introduced in the mid-1990s and are now among the most popular: the 2001 paper describing the Bayesian phylogenetic software MrBayes by Huelsenbeck and Ronquist was the most cited computer science paper in any area in a recent year. The increasing availability of genome arrangement data offers the opportunity to extend Bayesian methods of phylogenetic inference to this type of data. Our most recent work in this area is implemented in the software BADGER (Bayesian Analysis to Describe Genomic Evolution by Rearrangement). In this talk I will describe our model for genome rearrangement as well as some details of our Markov chain Monte Carlo approach to sample from the posterior distribution of phylogenetic trees. I will demonstrate the method by summarizing the results of an analysis of all known complete mitochondrial genomes in animals to infer relationships among animal phyla.

Chair: Steve Qin
Session B ONLINE PARAMETER ESTIMATION IN GENERAL STATE SPACE MODELS
Christophe Andrieu
University of Bristol

We present several original methods to perform online parameter estimation in stationary nonlinear non-Gaussian state-space models. Several methods based on Sequential Monte Carlo (SMC) have recently been proposed but all the algorithms we are aware of suffer from a degeneracy problem; i.e., there is accumulation of errors over time and the algorithms can diverge. The methodology we have proposed does not suffer from this problem. The first application of our methodology relies on online Expectation-Maximization-type algorithms that maximize the average so-called split-data likelihood. Implementing these methods typically requires the use of SMC. The second application of this approach is a computationally cheap method which does not have to perform state estimation and maximizes using a stochastic gradient algorithm a suitable contrast function.

Chair: Ping Ma
Coffee Break
3:50-4:30 Session A State-Space Models and Sequential Monte Carlo Methods with Applications to Biomedical Dynamic Systems
Hulin Wu
University of Rochester

Modern biotechnologies have provided a good opportunity for us to deeply understand biological systems at cellular level. Many biomedical systems can be described by a state-space model which is very popular in automatic control, signal processing, econometrics, and time series models. I will present several application examples in infectious diseases and propose the state-space models and sequential Monte Carlo methods for model identification, filtering and forecasting. In particular, the methods for dealing with time-varying parameters in the state space models and models for longitudinal dynamic systems will be discussed.

Chair: Mayetri Gupta
Session B Decayed MCMC Filtering
Bhaskara Marthi, Hanna Pasula, Yuval Peres, Stuart Russell
University of California, Berkeley

We present em{Decayed MCMC}, a Monte Carlo algorithm for filtering in state space models. The algorithm is based on Gibbs sampling, and resamples state variables according to a decay distribution that favours more recent time steps. We introduce a notion of marginal mixing with respect to the current time step distribution and extend standard Markov chain techniques to this setting. The main result is that, given an inverse-polynomial decay, the resulting Markov chain has the right stationary distribution, and the marginal mixing time is independent of the history length. This implies that Decayed MCMC is a nondivergent filter with constant update time.

Chair: Xiaodong Wang
4:30-5:10 Session A Monte-Carlo based unsupervised and semi-supervised analyses of gene expression data
Gaddy Getz
Broad Institute



Chair: Mayetri Gupta
Session B Sequential Monte Carlo Samplers
Arnaud Doucet
Cambridge University

In this talk, we present a general methodology to sample from a sequence of probability distributions defined on a common space; ie in cases where one traditionally uses MCMC . We propose to approximate these distributions by a large set of random samples which evolves over time using simple sampling and resampling mechanisms. This methodology yields a whole set of principled algorithms to make parallel Markov chain Monte Carlo runs interact and it also allows us to derive new algorithms to perform global optimization or to solve sequential Bayesian estimation problems. This talk is illustrated by several complex examples arising in Bayesian inference.

Chair: Xiaodong Wang

Friday August 27, 2004 6:00-11:00 Poster session (Terrace Room, Sheraton Commander Hotel)

List of Posters:

A Bayesian Clustering Model for Transcription Factor Binding Motifs
Shane T. Jensen and Jun S. Liu
Department of Statistics, The Wharton School, University of Pennsylvania

Genes are often regulated in living cells by proteins called transcription factors that bind directly to short segments of DNA in close proximity to certain target genes. These short segments have a conserved appearance, which is called a motif. We propose a Bayesian hierarchical clustering model for the common structure between a set of discovered motifs. This clustering model utilizes a Dirichlet process prior distribution and is implemented using a Gibbs sampling strategy. We apply our model to a dataset of 116 motifs from several species and discuss several methods for analyzing the clustering results. A uniform clustering prior is also considered and is compared to the Dirichlet process prior. Our clustering model and MCMC implementation is general enough to be appropriate and useful in a variety of other statistical settings.
A Conditional Auto-Regression Model for Protein Function Prediction
Quansong Ruan, Minghua Deng, Ting Chen and Fengzhu Sun
Department of Mathematics, USC

Protein function prediction remains to be one of the most important problems in functional genomics. Several methods for protein function prediction based on Markov random fields have been developed. In the models, functions are generally considered one at a time. It is known that proteins may have multiple functions and these protein functions show correlations. We developed a new spatial model for protein function prediction by jointly considering protein interactions as well as protein function correlations. We compared the performance of three different approaches based on protein interactions, function correlations, and both interaction and function correlations, respectively. We show that the new approach significantly outperforms other existing approaches.
A MCMC Approach to U Statistics based Estimation and Inference
Victor Chernozhukov and Han Hong (MIT and Duke)
Department of Economics, MIT

Many semiparametric estimation methods are based on optimizing a non-smooth U-statistics based objective function or solving corresponding moment equation. Examples include rank estimation methods and semiparametric panel data fixed effect models. These estimators are often very difficult to compute using conventional optimization tools both because the objective functions are not smooth and because the optimum in any given finite sample is usually not unique. While these estimators are usually root-n consistent and asymptotically normal, statistical inferences using these estimators are also difficult using conventional asymptotic methods or resampling methods, because the asymptotic distribution often involves nonparametric functionals and the computational burden of these estimators makes resampling methods essentially prohibitive. We develop a classical estimation and inference method through the use of Markov Chain Monte Carlo, a tool used for the computation of Bayesian posterior distribution. The MCMC estimators we develop are as efficient as those based on optimizing the objective function or solving the estimating equations, and are far easier to compute. The associated MCMC inference methods we develop are not only as precise than the ones based on asymptotic distribution or resampling methods, but are also computationally more efficient than these convential inference tools.
A Non-Linear Stochastic Inflation Approach for Actuarial Science-2:(GMM Varience-Covarience Matrix)
CHITRO MAJUMDAR
P-Analytics

Since Wilkie (1986) introduced the first investment model of stochastic inflation which emarges the keen attribution in actuarial development and investment models for applications by the actuaries. Two empirical features of monthly inflation rates are dynamic dependence on the level of the series and seasonal fluctuations. We propose a constructive model scenario, termed multiplicative seasonal self-existing threshold autoregressive (SEASETAR), which enables both features simultaneously. Moreover we adopt the multiple time-series modeling approach suggested by Tiao & Box (1981) to construct a stochastic investment model for price inflation, share dividends, share dividend yields. In our core asymmetry-model we can generate a special case of a general non-multiplicative SETAR model. The usefulness of multiplicative SEASETAR models is demonstrated by analyzing five data series of monthly inflatrion rates. One of these series corresponds to a country with hyperinflation episodes. To get a better understanding of the basic features underlying the fitted SEASETAR models we also perform a dynamic analysis. The feature of my conclusion segment is the area of focusing the relation of their individual tests and methodology which embarks the further MCMC simulation exercise and result with some relevant statistical background and I wish to show the proposed conjecture of the low-tests power asymmetry which would enable the GMM varience-covarience matrix from a type of co-linearity that greatly attenuates the preciseness of the parameter estimates by Smith (1999, Journal of Econometrics 93) and Belsley (1991).
A Sequential Marginal Likelihood Online Change Detector: Sequential Monte Carlo Approach
T. Matsumoto
Waseda Univerisity (currently on leave at Cambridge University)

Given a sequential data from unknown target system, this study attempts to perform online change detection by (i) using a parameter/hyperparameter dynamics driven by the available data; (ii) examining time dependency of sequential marginal likelihood. (iii) implementing the scheme via Sequential Monte Carlo (Particle Filter). Two (numerical) experimental results are reported. One is a nonlinear regression, while another is a nonlinear dynamical system.
A split-merge sampler for discovering classes in relational data
Charles Kemp, Thomas L Griffiths, Joshua B Tenenbaum
MIT

Dirichlet process mixture models partition a set of objects into classes given a list of attributes for each object. We discuss the direct equivalent of the Dirichlet Process mixture model for relational data. Our approach is an extension of the stochastic blockmodel (Snijders & Nowicki, 1997), which assumes that relations are conditionally independent given the class of every object. Previous versions of this model have assumed that the number of classes is specified in advance, but our model discovers the number of classes using a prior over partitions induced by the Dirichlet process. We fit the model using a split-merge sampler (Jain & Neal, 2000), and present some applications to social network data.
An MCMC approximation to the Posterior of a Gaussian Process Prior for Density Estimation.
Surya Tapas Tokdar
Department of Statistics, Purdue University

Gaussian process priors for Bayesian non-parametric density estimation were introduced and studied by Lenk (1988, 1991). Posterior consistency of these priors has been established in Tokdar and Ghosh (2004). However implementing such priors poses a hard problem as it is very difficult to generate sample densities from the posterior or to calculate the Bayes estimate. In this paper we investigate a novel approximation scheme where the original process is replaced by an interpolated version of it and an MCMC proceeds by generating the value of the process at the interpolating nodes. We prove that this approximation converges to the true posterior as the number of nodes increases. A variety of simulation studies are also presented. These studies indicate that approximate posterior is stable even at a moderate level of the number of nodes.
Ascent-Recovering Monte Carlo EM
Wolfgang Jank
University of Maryland

The EM algorithm is a popular tool in statistics and many other fields. One of the reasons for EM's popularity are its convenient statistical and numerical properties. Arguably, one of the most famous of these properties is EM's likelihood-ascent property. That is, EM guarantees an increase in the likelihood function between pairs of successive parameter updates. Unfortunately, not all of these properties are inherited automatically by its stochastic version, the Monte Carlo EM (MCEM) algorithm. Indeed, while MCEM does not converge with a fixed Monte Carlo sample size, it also does not inherit EM's ascent property. In this work we propose a new MCEM formulation, the Ascent MCEM algorithm. That is, borrowing ideas from its deterministic origin, this new formulation recovers the likelihood-ascent property, at least with high probability. We show that while Ascent MCEM allows for a convenient implementations of many different sampling schemes, it can also result in a better allocation of Monte Carlo resources than competing methods. Furthermore, it admits standard stopping rules and it can reduce the total simulation effort significantly over existing approaches.
Bayesian CART Model
Yuhong Wu, Haakon Tjelmeland & Mike West
Duke University

We describe our recent work in modelling and simulation-based analysis of Bayesian classification and regression tree models. First, we describe novel classes of prior distributions that allow flexibility as well as numerical convenience. This includes what we refer to as the "pinball prior" for generating flexible families of tree structures, priors over data "splitting thresholds" that define branches within given trees, and the use of conjugate priors within tree leaves. Secondly, and of main focus in this presentation, we describe MCMC approaches to model analysis and posterior simulation that includes some of the now standard Metropolis-Hastings proposals for moving around in the high-dimensional and complex space of trees. A critical focus is a novel MC strategy that we refer to as the "restructure move", and that allows major "jumps" in tree space while aiming to generate "relevant" candidates within the MCMC framework. Some illustration of the approach and the utility of the restructure move is given in an analysis of binary survival data from a breast cancer genomics study at Duke.
Bayesian Clustering for Time Course Microarray Data
Wenxuan Zhong
Purdue University

A flexible model-based procedure for clustering time course gene expression data is proposed. The technique is useful for curve data especially with missing value. Our method not only provide the cluster assignment but also the predicted expression representer curve of each cluster. The Drosophila melanogaster life cycle data was analyzed to illustrate the methodology.
BAYESIAN FRAILTY MODELS BASED ON BOX-COX TRANSFORMED HAZARDS
Guosheng Yin and Joseph Ibrahim
Department of Biostatistics, M. D. Anderson Cancer Center, The University of Texas

Due to natural or artificial clustering, multivariate failure time data often arise in biomedical research. To account for the intracluster correlation, we propose a class of frailty models by imposing the Box-Cox transformation on the hazard functions. This class of models generalizes the relationships between the baseline hazard and the hazard functions, which includes the proportional and the additive hazards frailty models as two special cases. Since hazards cannot be negative, complex multidimensional nonlinear parameter constraints must be imposed in the model formulation. To facilitate a tractable computational algorithm, the joint priors are constructed through a conditional-marginal specification. We propose a Markov chain Monte Carlo (MCMC) computational scheme for sampling from the posterior distribution of the parameters. We derive an MCMC approximation for the conditional predictive ordinate to assess the frailty model fitting adequacy, and illustrate the proposed method with a real dataset.
Computational Strategies for Mixture Models
Subharup Guha
Harvard School of Public Health

Many of the simulation techniques currently used to generate MCMC draws for mixture models produce chains that slowly explore the posterior distribution. We discuss some general strategies for improving the efficiency of slow-mixing algorithms. The techniques are general enough that they can be implemented in a wide variety of problems. We illustrate the substantial improvements achieved using an example from the literature.
Discovery of regulatory motif modules by hierarchical mixture modeling
Qing Zhou and Wing H. Wong
Dept. Stat, Harvard University

The regulatory information for a eukaryotic gene is encoded in cis-regulatory modules. The binding sites for a set of interacting transcription factors have the tendency to co-localize to the same modules. Current de novo motif discovery methods do not take advantage of this knowledge. We propose a hierarchical mixture approach to model the cis-regulatory module structure. Based on the model, a new de novo motif-module discovery algorithm, CisModule, is developed for the Bayesian inference of module locations and within-module motif sites. Dynamic programming-like recursions are developed to reduce the computational complexity from exponential to linear in sequence length. Using both simulated and real datasets, we demonstrate that CisModule is not only accurate in predicting modules, but also more sensitive in detecting motif patterns and binding sites compared with standard motif discovery methods.
Generalized Gamma Networks
Paola Sebastiani and Marco F Ramoni
Boston University

Current methods to learn Bayesian networks from continuous data rest on the assumptions that the observations are generated from normal distributions and that the mean of each child node is a linear function of the parent nodes. Motivated by the need of modeling gene expression data measured with microarrays --in which both linearity and normality appear to be restrictive assumptions -- we introduce a new class of Bayesian networks called Generalized Gamma Networks (GGN). GGNs are characterized by the fact that the nodes are continuous variables and, conditional on the configuration of the parent nodes, each child node follows a Gamma distribution with a mean that is possibly a non-linear function of its parent nodes. We describe a method for structural learning and parameter estimation of GGNs, and a stochastic algorithm to propagate evidence through them. Gene expression data from a microarray experiment are used to illustrate the proposed formalism.
Independent Particle Filters
Ming Lin, Junni L. Zhang, Qiansheng Cheng and Rong Chen
Peking University

Sequential Monte Carlo methods, especially the particle filter (PF) and its various modifications, have been used effectively in dealing with stochastic dynamic systems. The standard PF samples the current state through the underlying state dynamics, then uses current observation to evaluate the samples' importance weight. However, in many applications the current observation provides significant information about the current state while the state dynamics is weak. Sampling using the current observation in this case often produces efficient samples. In this work, we formulate the framework for a new variant of the particle filter, the independent particle filter (IPF). It generates exchangeable samples of the current state from a sampling distribution that is conditionally independent of the previous states, a special case of which uses only the current observation. Each sample can then be matched with multiple samples of the previous states for evaluating the importance weight. We present some theoretical results showing that this strategy improves efficiency of estimation as well as reduces resampling frequency in many situations. We also discuss some extensions of the IPF, so that the method can be applied to more general problems.
MCMC and SMC for Dynamic Data-Driven Event Reconstruction of Atmospheric Releases
G. Johannesson
Lawrence Livermore National Laboratory

For atmospheric releases, event reconstruction answers the critical questions: "How much material was released?", "When?", "Where?" and "What are the potential consequnces?". Current methods rely on first responders or analysts to estimate source characteristics, which are then used as input to preditive models to analyze the impacts of the release. Inaccurate characterization of unknown source parameters can lead to gross errors during a crises as a result of a harmfull release. A group of scientists at Lawrence Livermore National Laboratory is developing a Bayesian model framework that couples together observed sensor data with computationally demanding predictive atmospheric dispersion models. Posterior inference are conducted using MCMC and SMC, taking advantage of large-scale (parallel) computing resources. A project overview will be presented along with early prototype results. (This is a joint work with G. Sugiyama, W. Hanley, B. Kosovic, S. Larsen, G. Loosmore, J. Lundquist, A. Mirin, J. Nitao, R. Serban, K. Dyer, R. Wheeler.)
Monte Carlo Method for Structural Health Monitoring in Sensor Network
Jong-ho Baek
UCLA

Providing quick answers over continuous data streams is a crucial requirement for many application environment, for example, sensor network - different sensors continuously collects data around environment. Potential application of sensor networks includes habitat sensing, seismic monitoring, and contaminant transport monitoring. Above all, sensor networks will play an important role in the deployment of next-generation health monitoring systems for civil infrastructures, which involves the observation of the system over a period of time, the extraction of features from these measurements, and the analysis of these features to determine the current state of the system. There are two primary challenges for structural health monitoring in sensor networks. First, intractable amount of information collected from each node should be processed quickly and approximated in real-time, so that a structure can be identified efficiently. In addition, each node in the network senses a different phenomenon that we need to integrate over uncertain estimates of the unknown states given the measurement from sensors. In this proposal, we consider the problem of identifying characteristics of the sensor data streams using Monte Carlo method, in particular, particle filtering and Expectation Maximization. Results from the study of the responses of a real building subjected to earthquakes show great promise of our proposed method.
Monte Carlo techniques in a hierarchichal model approach to protein structure and design
A. K. Setty and J. Liang
Department of Bioengineering, University of Illinois at Chicago

Monte Carlo techniques find applications in many biophysical and bioinformatics studies such as protein dynamics and protein design. In the context of bioinformatics, studying a hierarchy of models beginning with minimal models and gradually increasing the complexity can give insight into protein structure and yield criteria for discriminating between protein-like and non protein-like structures. From a set of structures generated by exact enumeration or sequential Monte Carlo, criteria for choosing those with protein-like behavior has implications in protein design as well as for discrimination. Criteria that can be considered include thermodynamical properties and folding pathways. A Monte Carlo approach is envisaged to examine and score possible pathways from unfolded (high energy) to folded (low energy) states. Progress in this direction will be presented. The stochastic nature of Monte Carlo is also particularly well suited for protein dynamical studies which are fundamentally probabalistic in nature. Conformational dynamics studies and native state fluctuations in a variety of proteins using an unbiased kinetic Monte Carlo approach can yield results [1,2] in good accord with experiment. Comprehensive information regarding the ensemble of pathways of conformational transitions also becomes available on a rapid timescale. [1] D. M. Zuckerman, J. Phys. Chem. B., in press. [2] A. K. Setty and D. M. Zuckerman, Proc. Mat. Res. Soc., V3.5.13, 2004.
Near native structures of proteins from sequential Monte Carlo
Hsiaomei Lu, Ming Lin, Jinfeng Zhang, Rong Chen, Jie Liang
University of Illinois at Chicago

Proteins are the working molecules of cell, whose sequences determine their structures, which provide the basis for their biological functions. Here we study the ensemble of near native structure (NNS) of proteins, namely, conformations that are very similar to the native structure obtained experimentally. NNS are important in computational modeling of protein structures: they can be used to assess the effectiveness of protein structure prediction method by examining how efficiently NNS can be generated from protein sequences. They can also be used as positive training examples in deriving optimized potential functions for discriminating native proteins from decoys, a critical task in structure prediction. In addition, protein experiences dynamics fluctuations inside cells, and its biological properties are better described by the set of NNS instead of a single structural snapshot as captured by crystallography. In this study, we develop methods to generate and characterize the properties of NNS ensembles. This is a challenging task that has not been addressed before, as the occurrence of NNS is extremely rare in the large conformational space accessible to any protein models. We developed discrete state models with low complexity and high accuracy for modeling protein structures. With sequential Monte Carlo (SMC) method, we are able to obtain adequate samples of NNS to characterize the entire set of NNS and estimate several important properties of NNS, such as the size of NNS, the probability of randomly sampling one structure from NNS, the average native contact content, and the average native secondary structure content. The accuracy of SMC is assessed by comparison with results from exhaustive enumeration of small proteins. The effects of several techniques of SMC, such as look-ahead, resampling, and variable temperatures are studied and optimal conditions are determined. We also show that how this method is generally applicable to discrete state models and how we can quantitatively evaluate the effectiveness of various models for studying protein structures.
On Real-Parameter Evolutionary Monte Carlo Algorithm
Gopika R. Goswami and Jun S. Liu
Harvard University

Real-parameter Evolutionary Monte Carlo algorithm (henceforth referred to as EMC) has been proposed as an effective method not only for sampling from multi-modal distributions but also for stochastic optimization. The authors (Liang and Wong, 2001) have shown by various well-chosen examples that it works better than parallel tempering, it has a certain ability to learn from the past and it improves mixing by sampling along a temperature ladder. We introduce four new moves which equip the algorithm with more learning capability and while improving the quality of the samples (as measured by total auto-correlation time) produced from the target distribution. Secondly, we present a new easy to implement strategy for determining the temperature range and construction of the temperature ladder. We illustrate the above techniques through various examples. Lastly, we generalize a result originally proved in (Gilks, Roberts, George, 1994) about the validity of the conditional (or line) sampling step in the context of the snooker algorithm which is a type of move involved in the EMC.
Partitioning and Antithetic Variables
Francois Perron
Dept of math, University of Montreal

We are interesting in approximating $E[g(X)]$ by simulation. We assume that $F^{-1}$, the inverse of the cdf, is available and $g$ is monotone. The approximation is given by $$hat F=sum_{i=1}^n g(Y_i).$$ In our construction, the distribution of each of the $Y_i$ is identical to the distribution of $X$ but $mbox{Cov}(Y_i,Y_j)<0$ for all $1leq i < j leq n$. Our approach has a direct application to the resampling schemes. We give numerical examples.
Residue replacement rates at functional site of proteins: Assessing surface similarity for function prediction
Yan Yuan Tseng and Jie Liang
University of Illinois at Chicag

Protein is the working molecule of the cell that carries out numerous functions. Understanding the structural basis of the evolution of protein function is an important problem in biology. Residues on protein functional surface experience selection pressure that is different from residues in folding core. We use a continuous time Markov process to model the evolution of functional surface. We develop a Markov chain Monte Carlo method to estimate the posterior mean values of the substitution rates of amino acid residues. This technique is especially effective in extracting evolutionary information from the phylogeny of sequences homologous to a protein structure, all of which may be of unknown functions. We show how scoring matrices of residue similarity specific to a functional site can then be generated to identify similar binding surfaces on other protein structures of different evolutionary distances, and how such information can be useful for inferring biological roles of protein structures with unknown functions.
State Dependent Stochastic Volatility Models
Zhaohui Liu, Nalini Ravishanker
University of Connecticut

We discuss stochastic volatility models with hidden Markov and semi-Markov regime switches. In particular, the models analyzed include the univariate stochastic volatility model for financial return time series, and a bivariate model for the returns and transaction volumes. In addition to the AR structure of the volatility process, it is assumed that the unobserved volatility follows either a Markov regime switching process, or a semi-Markov regime switching process with the duration time at each state assuming different distributions. Under this framework, the mean and variance as well as duration times of the volatility at different states could vary. These assumptions afford a richer model which is expected to better describe the underlying volatility process due to its influence from different economic forces. Statistical inference, including out-of-sample prediction for such models using the sampling based Bayesian approach will be described. Model selection will be described and illustrated.
Statistical Approaches for Detecting Transcription Regulatory Modules
Mayetri Gupta
University of North Carolina at Chapel Hill

Using statistical methods to understand gene regulation is one of the major scientific challenges in the post-genome era. DNA sequence motifs represent binding sites for certain proteins, called transcription factors, in the process of gene regulation, and their identification can shed light on interactions within the gene regulatory network. In eukaryotic genomes, motif detection is a challenging problem as motifs tend to be sparse, weakly conserved, and occur in multi-pattern clusters called regulatory modules. This paper describes some of our recent modeling and algorithmic efforts in the discovery of regulatory modules in noncoding DNA sequences. First, we introduce a Bayesian hidden Markov model framework for sequences containing binding site clusters, the Markovian dependence being for both the motif types and motif site occurrences. We then develop a parameter estimation methodology using dynamic programming-like recursions. Simultaneously, a fast state-space model selection algorithm is formulated, based on evolutionary Monte Carlo (Liang and Wong, 2000) that enables us to detect likely clusters comprising the module. The performance of this methodology is illustrated by applications to bacterial, fly and human genomes.
Thin Blue Noise Sampling and Its Application to Antialiasing
Xiaohu Zhang
Statistics Department, Stanford University

A sampling pattern is called "blue noise" samples if its Fourier transform nearly vanishes at low frequencies and is nearly uniform at high frequencies. "Poisson disks" samples have this blue noise characteristics. Blue noise samples are in wide use for antialiasing in computer image synthesis. Dart throwing and its derivative algorithms have been used to generate blue noise samples but suffers from its expensive computation cost. Lloyd's relaxation is a much faster alternative which iterates between computing Voronoi diagram and moving points to centroids of their Voronoi cells. But it suffers from the convergence to a hexagonal grid after a moderate number of iterations thus losing the blue noise property. Thin blue noise sampling is an algorithm similar to the Lloyd's relaxation in spirit. Instead of the ordinary Voronoi diagram, we use a special kind of weighted Voronoi diagram called the power diagram. This weighting circumvents the convergence to a hexagonal grid. Computation of the power diagram is of about the same complexity as ordinary Voronoi diagram and it has polygonal cells thus allows fast computation of centroids. Thin blue noise sampling is also applicable in 3d or higher dimensional Euclidean space, and also on spheres. Spherical Fourier analysis is carried out to evaluate the blueness of sampling patterns generated by various algorithms.
Uniform scale mixture representations
Steve Qin, Paul Damien, Stephen Walker
University of Michigan

The uniform scale representation originally proposed by Feller (1970) is extended in this paper. Several families of univariate and multivariate densities are examined under this new representation. Some Bayesian econometric models in Zellner (1971) are revisited using the scale mixture of uniform representation, leading to straightforward implementation via Markov chain Monte Carlo methods.

back to home