Bayesian Variable Selection for High Dimensional Data Analysis


Book Description

In the practice of statistical modeling, it is often desirable to have an accurate predictive model. Modern data sets usually have a large number of predictors.Hence parsimony is especially an important issue. Best-subset selection is a conventional method of variable selection. Due to the large number of variables with relatively small sample size and severe collinearity among the variables, standard statistical methods for selecting relevant variables often face difficulties. Bayesian stochastic search variable selection has gained much empirical success in a variety of applications. This book, therefore, proposes a modified Bayesian stochastic variable selection approach for variable selection and two/multi-class classification based on a (multinomial) probit regression model.We demonstrate the performance of the approach via many real data. The results show that our approach selects smaller numbers of relevant variables and obtains competitive classification accuracy based on obtained results.




Bayesian Variable Selection for High-dimensional Data with an Ordinal Response


Book Description

Health outcome and disease status measurements frequently appear on an ordinal scale, that is, the outcome is categorical but has inherent ordering. Many previous studies have shown associations between gene expression and disease status. Identification of important genes may be useful for developing novel diagnostic and prognostic tools to predict or classify stage of disease. Gene expression data is usually high-dimensional, meaning that the number of genes is much greater than the sample size or number of patients. We will describe some existing frequentist methods for high-dimensional data with an ordinal response. Following Tibshirani (1996) who described the LASSO estimate as the Bayesian posterior mode when the regression parameters have independent Laplace priors, we propose a new approach for high-dimensional data with an ordinal response that is rooted in the Bayesian paradigm. We show that our proposed Bayesian approach outperforms the existing frequentist methods through simulation studies. We then compare the performance of frequentist and Bayesian approaches using hepatocellular carcinoma studies.




Bayesian Solutions to High-dimensional Data Challenges Using Hybrid Search


Book Description

In the era of Big Data, variable selection with high-dimensional data has drawn increasing attention. With a large number of predictors, there rises a big challenge for model fitting and prediction. In this dissertation, we propose three different yet interconnected methodologies, which include theory, computation, and real applications for various scenarios of regression analysis. The primary goal in this dissertation is to develop powerful Bayesian solutions to high-dimensional data challenges using a new variable selection strategy, called hybrid search. To effectively reduce computation costs in high-dimensional data analysis, we propose novel computational strategies that can quickly evaluate a large number of marginal likelihoods simultaneously within a single computation. In Chapter 1, we discuss background and current challenges in high-dimensional variable selection. The motivation of our study is also justified. In Chapter 2, we introduce a new Bayesian method of best subset selection in the context of linear regression. The proposed method rapidly finds the best subset via a hybrid search algorithm that combines deterministic local search and stochastic global search. In Chapter 3, on the basis of the approach in Chapter 2, we extend it to a framework of multivariate linear regression model, which analyzes the relationship between multiple response variables and a common set of predictors. In Chapter 4, we propose a general Bayesian method to perform high-dimensional variable selection for various data types, such as binary, count, continuous and time-to-event (survival) data. Using Bayesian approximation techniques, we develop a general computing strategy that enables us to assess the marginal likelihoods of many candidate models within a single computation. In addition, to accelerate the convergence, we employ a hybrid search algorithm that can quickly explore the model spaces and accurately obtain the global maximum of marginal posterior probabilities.




Statistical Analysis for High-Dimensional Data


Book Description

This book features research contributions from The Abel Symposium on Statistical Analysis for High Dimensional Data, held in Nyvågar, Lofoten, Norway, in May 2014. The focus of the symposium was on statistical and machine learning methodologies specifically developed for inference in “big data” situations, with particular reference to genomic applications. The contributors, who are among the most prominent researchers on the theory of statistics for high dimensional inference, present new theories and methods, as well as challenging applications and computational solutions. Specific themes include, among others, variable selection and screening, penalised regression, sparsity, thresholding, low dimensional structures, computational challenges, non-convex situations, learning graphical models, sparse covariance and precision matrices, semi- and non-parametric formulations, multiple testing, classification, factor models, clustering, and preselection. Highlighting cutting-edge research and casting light on future research directions, the contributions will benefit graduate students and researchers in computational biology, statistics and the machine learning community.




Handbook of Bayesian Variable Selection


Book Description

Bayesian variable selection has experienced substantial developments over the past 30 years with the proliferation of large data sets. Identifying relevant variables to include in a model allows simpler interpretation, avoids overfitting and multicollinearity, and can provide insights into the mechanisms underlying an observed phenomenon. Variable selection is especially important when the number of potential predictors is substantially larger than the sample size and sparsity can reasonably be assumed. The Handbook of Bayesian Variable Selection provides a comprehensive review of theoretical, methodological and computational aspects of Bayesian methods for variable selection. The topics covered include spike-and-slab priors, continuous shrinkage priors, Bayes factors, Bayesian model averaging, partitioning methods, as well as variable selection in decision trees and edge selection in graphical models. The handbook targets graduate students and established researchers who seek to understand the latest developments in the field. It also provides a valuable reference for all interested in applying existing methods and/or pursuing methodological extensions. Features: Provides a comprehensive review of methods and applications of Bayesian variable selection. Divided into four parts: Spike-and-Slab Priors; Continuous Shrinkage Priors; Extensions to various Modeling; Other Approaches to Bayesian Variable Selection. Covers theoretical and methodological aspects, as well as worked out examples with R code provided in the online supplement. Includes contributions by experts in the field. Supported by a website with code, data, and other supplementary material







Bayesian Variable Selection with Spike-and-slab Priors


Book Description

A major focus of intensive methodological research in recent times has been on knowledge extraction from high-dimensional datasets made available by advances in research technologies. Coupled with the growing popularity of Bayesian methods in statistical analysis, a range of new techniques have evolved that allow innovative model-building and inference in high-dimensional settings – an important one among these being Bayesian variable selection (BVS). The broad goal of this thesis is to explore different BVS methods and demonstrate their application in high-dimensional psychological data analysis. In particular, the focus will be on a class of sparsity-enforcing priors called 'spike-and-slab' priors which are mixture priors on regression coefficients with density functions that are peaked at zero (the 'spike') and also have large probability mass for a wide range of non-zero values (the 'slab'). It is demonstrated that BVS with spike-and-slab priors achieved a reasonable degree of dimensionality-reduction when applied to a psychiatric dataset in a logistic regression setup. BVS performance was also compared to that of LASSO (least absolute shrinkage and selection operator), a popular machine-learning technique, as reported in Ahn et al.(2016). The findings indicate that BVS with a spike-and-slab prior provides a competitive alternative to machine-learning methods, with the additional advantages of ease of interpretation and potential to handle more complex models. In conclusion, this thesis serves to add a new cutting-edge technique to the lab’s tool-shed and helps introduce Bayesian variable-selection to researchers in Cognitive Psychology where it still remains relatively unexplored as a dimensionality-reduction tool.




High-dimensional Data Analysis


Book Description

Over the last few years, significant developments have been taking place in highdimensional data analysis, driven primarily by a wide range of applications in many fields such as genomics and signal processing. In particular, substantial advances have been made in the areas of feature selection, covariance estimation, classification and regression. This book intends to examine important issues arising from highdimensional data analysis to explore key ideas for statistical inference and prediction. It is structured around topics on multiple hypothesis testing, feature selection, regression, cla.




Bayesian Variable Selection and Functional Data Analysis


Book Description

High-dimensional statistics is one of the most studied topics in the field of statistics. The most interesting problem to arise in the last 15 years is variable selection or subset selection. Variable selection is a strong statistical tool that can be explored in functional data analysis. In the first part of this thesis, we implement a Bayesian variable selection method for automatic knot selection. We propose a spike-and-slab prior on knots and formulate a conjugate stochastic search variable selection for significant knots. The computation is substantially faster than existing knot selection methods, as we use Metropolis-Hastings algorithms and a Gibbs sampler for estimation. This work focuses on a single nonlinear covariate, modeled as regression splines. In the next stage, we study Bayesian variable selection in additive models with high-dimensional predictors. The selection of nonlinear functions in models is highly important in recent research, and the Bayesian method of selection has more advantages than contemporary frequentist methods. Chapter 2 examines Bayesian sparse group lasso theory based on spike-and-slab priors to determine its applicability for variable selection and function estimation in nonparametric additive models.The primary objective of Chapter 3 is to build a classification method using longitudinal volumetric magnetic resonance imaging (MRI) data from five regions of interest (ROIs). A functional data analysis method is used to handle the longitudinal measurement of ROIs, and the functional coefficients are later used in the classification models. We propose a P\\'olya-gamma augmentation method to classify normal controls and diseased patients based on functional MRI measurements. We obtain fast-posterior sampling by avoiding the slow and complicated Metropolis-Hastings algorithm. Our main motivation is to determine the important ROIs that have the highest separating power to classify our dichotomous response. We compare the sensitivity, specificity, and accuracy of the classification based on single ROIs and with various combinations of them. We obtain a sensitivity of over 85% and a specificity of around 90% for most of the combinations.Next, we work with Bayesian classification and selection methodology. The main goal of Chapter 4 is to employ longitudinal trajectories in a significant number of sub-regional brain volumetric MRI data as statistical predictors for Alzheimer's disease (AD) classification. We use logistic regression in a Bayesian framework that includes many functional predictors. The direct sampling of regression coefficients from the Bayesian logistic model is difficult due to its complicated likelihood function. In high-dimensional scenarios, the selection of predictors is paramount with the introduction of either spike-and-slab priors, non-local priors, or Horseshoe priors. We seek to avoid the complicated Metropolis-Hastings approach and to develop an easily implementable Gibbs sampler. In addition, the Bayesian estimation provides proper estimates of the model parameters, which are also useful for building inference. Another advantage of working with logistic regression is that it calculates the log of odds of relative risk for AD compared to normal control based on the selected longitudinal predictors, rather than simply classifying patients based on cross-sectional estimates. Ultimately, however, we combine approaches and use a probability threshold to classify individual patients. We employ 49 functional predictors consisting of volumetric estimates of brain sub-regions, chosen for their established clinical significance. Moreover, the use of spike-and-slab priors ensures that many redundant predictors are dropped from the model.Finally, we present a new approach of Bayesian model-based clustering for spatiotemporal data in chapter 5 . A simple linear mixed model (LME) derived from a functional model is used to model spatiotemporal cerebral white matter data extracted from healthy aging individuals. LME provides us with prior information for spatial covariance structure and brain segmentation based on white matter intensity. This motivates us to build stochastic model-based clustering to group voxels considering their longitudinal and location information. The cluster-specific random effect causes correlation among repeated measures. The problem of finding partitions is dealt with by imposing prior structure on cluster partitions in order to derive a stochastic objective function.




Bayesian Model Selection Consistency for High-dimensional Regression


Book Description

Bayesian model selection has enjoyed considerable prominence in high-dimensional variable selection in recent years. Despite its popularity, the asymptotic theory for high-dimensional variable selection has not been fully explored yet. In this study, we aim to identify prior conditions for Bayesian model selection consistency under high-dimensional regression settings. In a Bayesian framework, posterior model probabilities can be used to quantify the importance of models given the observed data. Hence, our focus is on the asymptotic behavior of posterior model probabilities when the number of the potential predictors grows with the sample size. This dissertation contains the following three projects. In the first project, we investigate the asymptotic behavior of posterior model probabilities under the Zellner's g-prior, which is one of the most popular choices for model selection in Bayesian linear regression. We establish a simple and intuitive condition of the Zellner's g-prior under which the posterior model distribution tends to be concentrated at the true model as the sample size increases even if the number of predictors grows much faster than the sample size does. Simulation study results indicate that the satisfaction of our condition is essential for the success of Bayesian high-dimensional variable selection under the g-prior. In the second project, we extend our framework to a general class of priors. The most pressing challenge in our generalization is that the marginal likelihood cannot be expressed in a closed form. To address this problem, we develop a general form of Laplace approximation under a high-dimensional setting. As a result, we establish general sufficient conditions for high-dimensional Bayesian model selection consistency. Our simulation study and real data analysis demonstrate that the proposed condition allows us to identify the true data generating model consistently. In the last project, we extend our framework to Bayesian generalized linear regression models. The distinctive feature of our proposed framework is that we do not impose any specific form of data distribution. In this project we develop a general condition under which the true model tends to maximize the marginal likelihood even when the number of predictors increases faster than the sample size. Our condition provides useful guidelines for the specification of priors including hyperparameter selection. Our simulation study demonstrates the validity of the proposed condition for Bayesian model selection consistency with non-Gaussian data.