Bayesian Model Selection for High-dimensional High-throughput Data


Book Description

Bayesian methods are often criticized on the grounds of subjectivity. Furthermore, misspecified priors can have a deleterious effect on Bayesian inference. Noting that model selection is effectively a test of many hypotheses, Dr. Valen E. Johnson sought to eliminate the need of prior specification by computing Bayes' factors from frequentist test statistics. In his pioneering work that was published in the year 2005, Dr. Johnson proposed using so-called local priors for computing Bayes? factors from test statistics. Dr. Johnson and Dr. Jianhua Hu used Bayes' factors for model selection in a linear model setting. In an independent work, Dr. Johnson and another colleage, David Rossell, investigated two families of non-local priors for testing the regression parameter in a linear model setting. These non-local priors enable greater separation between the theories of null and alternative hypotheses. In this dissertation, I extend model selection based on Bayes' factors and use nonlocal priors to define Bayes' factors based on test statistics. With these priors, I have been able to reduce the problem of prior specification to setting to just one scaling parameter. That scaling parameter can be easily set, for example, on the basis of frequentist operating characteristics of the corresponding Bayes' factors. Furthermore, the loss of information by basing a Bayes' factors on a test statistic is minimal. Along with Dr. Johnson and Dr. Hu, I used the Bayes' factors based on the likelihood ratio statistic to develop a method for clustering gene expression data. This method has performed well in both simulated examples and real datasets. An outline of that work is also included in this dissertation. Further, I extend the clustering model to a subclass of the decomposable graphical model class, which is more appropriate for genotype data sets, such as single-nucleotide polymorphism (SNP) data. Efficient FORTRAN programming has enabled me to apply the methodology to hundreds of nodes. For problems that produce computationally harder probability landscapes, I propose a modification of the Markov chain Monte Carlo algorithm to extract information regarding the important network structures in the data. This modified algorithm performs well in inferring complex network structures. I use this method to develop a prediction model for disease based on SNP data. My method performs well in cross-validation studies.




High-Dimensional Data Analysis in Cancer Research


Book Description

Multivariate analysis is a mainstay of statistical tools in the analysis of biomedical data. It concerns with associating data matrices of n rows by p columns, with rows representing samples (or patients) and columns attributes of samples, to some response variables, e.g., patients outcome. Classically, the sample size n is much larger than p, the number of variables. The properties of statistical models have been mostly discussed under the assumption of fixed p and infinite n. The advance of biological sciences and technologies has revolutionized the process of investigations of cancer. The biomedical data collection has become more automatic and more extensive. We are in the era of p as a large fraction of n, and even much larger than n. Take proteomics as an example. Although proteomic techniques have been researched and developed for many decades to identify proteins or peptides uniquely associated with a given disease state, until recently this has been mostly a laborious process, carried out one protein at a time. The advent of high throughput proteome-wide technologies such as liquid chromatography-tandem mass spectroscopy make it possible to generate proteomic signatures that facilitate rapid development of new strategies for proteomics-based detection of disease. This poses new challenges and calls for scalable solutions to the analysis of such high dimensional data. In this volume, we will present the systematic and analytical approaches and strategies from both biostatistics and bioinformatics to the analysis of correlated and high-dimensional data.




Bayesian Variable Selection for High-dimensional Data with an Ordinal Response


Book Description

Health outcome and disease status measurements frequently appear on an ordinal scale, that is, the outcome is categorical but has inherent ordering. Many previous studies have shown associations between gene expression and disease status. Identification of important genes may be useful for developing novel diagnostic and prognostic tools to predict or classify stage of disease. Gene expression data is usually high-dimensional, meaning that the number of genes is much greater than the sample size or number of patients. We will describe some existing frequentist methods for high-dimensional data with an ordinal response. Following Tibshirani (1996) who described the LASSO estimate as the Bayesian posterior mode when the regression parameters have independent Laplace priors, we propose a new approach for high-dimensional data with an ordinal response that is rooted in the Bayesian paradigm. We show that our proposed Bayesian approach outperforms the existing frequentist methods through simulation studies. We then compare the performance of frequentist and Bayesian approaches using hepatocellular carcinoma studies.




High-dimensional Variable Selection for Genomics Data, from Both Frequentist and Bayesian Perspectives


Book Description

Variable selection is one of the most popular tools for analyzing high-dimensional genomic data. It has been developed to accommodate complex data structures and lead to structured sparse identification of important genomics features. We focus on the network and interaction structure that commonly exist in genomic data, and develop novel variable selection methods from both frequentist and Bayesian perspectives. Network-based regularization has achieved success in variable selections for high-dimensional cancer genomic data, due to its ability to incorporate the correlations among genomic features. However, as survival time data usually follow skewed distributions, and are contaminated by outliers, network-constrained regularization that does not take the robustness into account leads to false identifications of network structure and biased estimation of patients' survival. In the first project, we develop a novel robust network-based variable selection method under the accelerated failure time (AFT) model. Extensive simulation studies show the advantage of the proposed method over the alternative methods. Promising findings are made in two case studies of lung cancer datasets with high dimensional gene expression measurements. Gene-environment (G×E) interactions are important for the elucidation of disease etiology beyond the main genetic and environmental effects. In the second project, a novel and powerful semi-parametric Bayesian variable selection model has been proposed to investigate linear and nonlinear G×E interactions simultaneously. It can further conduct structural identification by distinguishing nonlinear interactions from main-effects-only case within the Bayesian framework. The proposed method conducts Bayesian variable selection more efficiently and accurately than alternatives. Simulation shows that the proposed model outperforms competing alternatives in terms of both identification and prediction. In the case study, the proposed Bayesian method leads to the identification of effects with important implications in a high-throughput profiling study with high-dimensional SNP data. In the last project, a robust Bayesian variable selection method has been developed for G×E interaction studies. The proposed robust Bayesian method can effectively accommodate heavy-tailed errors and outliers in the response variable while conducting variable selection by accounting for structural sparsity. Spike and slab priors are incorporated on both individual and group levels to identify the sparse main and interaction effects. Extensive simulation studies and analysis of both the diabetes data with SNP measurements from the Nurses' Health Study and TCGA melanoma data with gene expression measurements demonstrate the superior performance of the proposed method over multiple competing alternatives. To facilitate reproducible research and fast computation, we have developed open source R packages for each project, which provide highly efficient C++ implementation for all the proposed and alternative approaches. The R packages regnet and spinBayes, associated with the first and second project correspondingly, are available on CRAN. For the third project, the R package robin is available from GitHub and will be submitted to CRAN soon.




Statistical Analysis for High-Dimensional Data


Book Description

This book features research contributions from The Abel Symposium on Statistical Analysis for High Dimensional Data, held in Nyvågar, Lofoten, Norway, in May 2014. The focus of the symposium was on statistical and machine learning methodologies specifically developed for inference in “big data” situations, with particular reference to genomic applications. The contributors, who are among the most prominent researchers on the theory of statistics for high dimensional inference, present new theories and methods, as well as challenging applications and computational solutions. Specific themes include, among others, variable selection and screening, penalised regression, sparsity, thresholding, low dimensional structures, computational challenges, non-convex situations, learning graphical models, sparse covariance and precision matrices, semi- and non-parametric formulations, multiple testing, classification, factor models, clustering, and preselection. Highlighting cutting-edge research and casting light on future research directions, the contributions will benefit graduate students and researchers in computational biology, statistics and the machine learning community.




Bayesian Variable Selection for High Dimensional Data Analysis


Book Description

In the practice of statistical modeling, it is often desirable to have an accurate predictive model. Modern data sets usually have a large number of predictors.Hence parsimony is especially an important issue. Best-subset selection is a conventional method of variable selection. Due to the large number of variables with relatively small sample size and severe collinearity among the variables, standard statistical methods for selecting relevant variables often face difficulties. Bayesian stochastic search variable selection has gained much empirical success in a variety of applications. This book, therefore, proposes a modified Bayesian stochastic variable selection approach for variable selection and two/multi-class classification based on a (multinomial) probit regression model.We demonstrate the performance of the approach via many real data. The results show that our approach selects smaller numbers of relevant variables and obtains competitive classification accuracy based on obtained results.







Bayesian Nonparametric Clusterings in Relational and High-dimensional Settings with Applications in Bioinformatics


Book Description

Recent advances in high throughput methodologies offer researchers the ability to understand complex systems via high dimensional and multi-relational data. One example is the realm of molecular biology where disparate data (such as gene sequence, gene expression, and interaction information) are available for various snapshots of biological systems. This type of high dimensional and multirelational data allows for unprecedented detailed analysis, but also presents challenges in accounting for all the variability. High dimensional data often has a multitude of underlying relationships, each represented by a separate clustering structure, where the number of structures is typically unknown a priori. To address the challenges faced by traditional clustering methods on high dimensional and multirelational data, we developed three feature selection and cross-clustering methods: 1) infinite relational model with feature selection (FIRM) which incorporates the rich information of multirelational data; 2) Bayesian Hierarchical Cross-Clustering (BHCC), a deterministic approximation to Cross Dirichlet Process mixture (CDPM) and to cross-clustering; and 3) randomized approximation (RBHCC), based on a truncated hierarchy. An extension of BHCC, Bayesian Congruence Measuring (BCM), is proposed to measure incongruence between genes and to identify sets of congruent loci with identical evolutionary histories. We adapt our BHCC algorithm to the inference of BCM, where the intended structure of each view (congruent loci) represents consistent evolutionary processes. We consider an application of FIRM on categorizing mRNA and microRNA. The model uses latent structures to encode the expression pattern and the gene ontology annotations. We also apply FIRM to recover the categories of ligands and proteins, and to predict unknown drug-target interactions, where latent categorization structure encodes drug-target interaction, chemical compound similarity, and amino acid sequence similarity. BHCC and RBHCC are shown to have improved predictive performance (both in terms of cluster membership and missing value prediction) compared to traditional clustering methods. Our results suggest that these novel approaches to integrating multi-relational information have a promising future in the biological sciences where incorporating data related to varying features is often regarded as a daunting task.




Variable Selection for High-dimensional Data with Error Control


Book Description

Many high-throughput genomic applications involve a large set of covariates and it is crucial to discover which variables are truly associated with the response. It is often desirable for researchers to select variables that are indeed true and reproducible in followup studies. Effectively controlling the false discovery rate (FDR) increases the reproducibility of the discoveries and has been a major challenge in variable selection research, especially for high-dimensional data. Existing error control approaches include augmentation approaches which utilize artificial variables as benchmarks for decision making, such as model-X knockoffs. We introduce another augmentation-based selection framework extended from a Bayesian screening approach called reference distribution variable selection. Ordinal responses, which were not previously considered in this area, were used to compare different variable selection approaches. We constructed various importance measures that fit into the selection frameworks, using either L1 penalized regression or machine learning techniques, and compared these measures in terms of the FDR and power using simulated data. Moreover, we applied these selection methods to high-throughput methylation data for identifying features associated with the progression from normal liver tissue to hepatocellular carcinoma to further compare and contrast their performances. Having established the effectiveness of FDR control for model-X knockoffs, we turned our attention to another important data type - survival data with long-term survivors. Medical breakthroughs in recent years have led to cures for many diseases, resulting in increased observations of long-term survivors. The mixture cure model (MCM) is a type of survival model that is often used when a cured fraction exists. Unfortunately, currently few variable selection methods exist for MCMs when there are more predictors than samples. To fill the gap, we developed penalized MCMs for high-dimensional datasets which allow for identification of prognostic factors associated with both cure status and/or survival. Both parametric models and semi-parametric proportional hazards models were considered for modeling the survival component. For penalized parametric MCMs, we demonstrated how the estimation proceeded using two different iterative algorithms, the generalized monotone incremental forward stagewise (GMIFS) and Expectation-Maximization (E-M). For semi-parametric MCMs where multiple types of penalty functions were considered, the coordinate descent algorithm was combined with E-M for optimization. The model-X knockoffs method was combined with these algorithms to allow for FDR control in variable selection. Through extensive simulation studies, our penalized MCMs have been shown to outperform alternative methods on multiple metrics and achieve high statistical power with FDR being controlled. In two acute myeloid leukemia (AML) applications with gene expression data, our proposed approaches identified important genes associated with potential cure or time-to-relapse, which may help inform treatment decisions for AML patients.




Asymptotic Statistics


Book Description

This book is an introduction to the field of asymptotic statistics. The treatment is both practical and mathematically rigorous. In addition to most of the standard topics of an asymptotics course, including likelihood inference, M-estimation, the theory of asymptotic efficiency, U-statistics, and rank procedures, the book also presents recent research topics such as semiparametric models, the bootstrap, and empirical processes and their applications. The topics are organized from the central idea of approximation by limit experiments, which gives the book one of its unifying themes. This entails mainly the local approximation of the classical i.i.d. set up with smooth parameters by location experiments involving a single, normally distributed observation. Thus, even the standard subjects of asymptotic statistics are presented in a novel way. Suitable as a graduate or Master s level statistics text, this book will also give researchers an overview of the latest research in asymptotic statistics.