A Two-stage Bayesian Variable Selection Method with the Extension of Lasso for Geo-referenced Count Data


Book Description

Due to the complex nature of geo-referenced data, multicollinearity of the risk factors in public health spatial studies is a commonly encountered issue, which leads to low parameter estimation accuracy because it inflates the variance in the regression analysis. To address this issue, we proposed a two-stage variable selection method by extending the least absolute shrinkage and selection operator (Lasso) to the Bayesian spatial setting, investigating the impact of risk factors to health outcomes. Specifically, in stage I, we performed the variable selection using Bayesian Lasso and several other variable selection approaches. Then, in stage II, we performed the model selection with only the selected variables from stage I and compared again the methods. To evaluate the performance of the two-stage variable selection methods, we conducted a simulation study with different distributions for the risk factors, using geo-referenced count data as the outcome and Michigan as the research region. We considered the cases when all candidate risk factors are independently normally distributed, or follow a multivariate normal distribution with different correlation levels. Two other Bayesian variable selection methods, Binary indicator, and the combination of Binary indicator and Lasso are considered and compared as alternative methods. The simulation results indicate that the proposed two-stage Bayesian Lasso variable selection method has the best performance for both independent and dependent cases considered. When compared with the one-stage approach, and the other two alternative methods, the two-stage Bayesian Lasso approach provides the highest estimation accuracy in all scenarios considered.




Bayesian Variable Selection Using Lasso


Book Description

This thesis proposes to combine the Kuo and Mallick approach (1998) and Bayesian Lasso approach (2008) by introducing a Laplace distribution on the conditional prior of the regression parameters given the indicator variables. Gibbs Sampling will be used to sample from the joint posterior distribution. We compare these two new method to existing Bayesian variable selection methods such as Kuo and Mallick, George and McCulloch and Park and Casella and provide an overall qualitative assessment of the efficiency of mixing and separation. We will also use air pollution dataset to test the proposed methodology with the goal of identifying the main factors controlling the pollutant concentration.




A Bayesian Variable Selection Method with Applications to Spatial Data


Book Description

This thesis first describes the general idea behind Bayes Inference, various sampling methods based on Bayes theorem and many examples. Then a Bayes approach to model selection, called Stochastic Search Variable Selection (SSVS) is discussed. It was originally proposed by George and McCulloch (1993). In a normal regression model where the number of covariates is large, only a small subset tend to be significant most of the times. This Bayes procedure specifies a mixture prior for each of the unknown regression coefficient, the mixture prior was originally proposed by Geweke (1996). This mixture prior will be updated as data becomes available to generate a posterior distribution that assigns higher posterior probabilities to coefficients that are significant in explaining the response. Spatial modeling method is described in this thesis. Prior distribution for all unknown parameters and latent variables are specified. Simulated studies under different models have been implemented to test the efficiency of SSVS. A real dataset taken by choosing a small region from the Cape Floristic Region in South Africa is used to analyze the plants distribution in that region. The original multi-cateogory response is transformed into a presence and absence (binary) response for simpler analysis. First, SSVS is used on this dataset to select the subset of significant covariates. Then a spatial model is fitted using the chosen covariates and, post-estimation, predictive map of posterior probabilities of presence and absence are obtained for the study region. Posterior estimates for the true regression coefficients are also provided along with map for spatial random effects.




Bayesian Variable Selection for High Dimensional Data Analysis


Book Description

In the practice of statistical modeling, it is often desirable to have an accurate predictive model. Modern data sets usually have a large number of predictors.Hence parsimony is especially an important issue. Best-subset selection is a conventional method of variable selection. Due to the large number of variables with relatively small sample size and severe collinearity among the variables, standard statistical methods for selecting relevant variables often face difficulties. Bayesian stochastic search variable selection has gained much empirical success in a variety of applications. This book, therefore, proposes a modified Bayesian stochastic variable selection approach for variable selection and two/multi-class classification based on a (multinomial) probit regression model.We demonstrate the performance of the approach via many real data. The results show that our approach selects smaller numbers of relevant variables and obtains competitive classification accuracy based on obtained results.







Bayesian Variable Selection for Non-Gaussian Data Using Global-Local Shrinkage Priors and the Multivaraite Logit-Beta Distribution


Book Description

Variable selection methods have become an important and growing problem in Bayesian analysis. The literature on Bayesian variable selection methods tends to be applied to a single response- type, and more typically, a continuous response-type, where it is assumed that the data is Gaus- sian/symmetric. In this dissertation, we develop a novel global-local shrinkage prior in non- symmetric settings and multiple response-types settings by combining the perspectives of global- local shrinkage and the conjugate multivaraite distribution. In Chapter 2, we focus on the problem of variable selection when the data is possibly non- symmetric continuous-valued. We propose modeling continuous-valued data and the coefficient vector with the multivariate logit-beta (MLB) distribution. To perform variable selection in a Bayesian context we make use of shrinkage global-local priors to enforce sparsity. Specifically, they can be defined as a Gaussian scale mixture of a global shrinkage parameter and a local shrinkage parameter for a regression coefficient. We provide a technical discussion that illustrates that our use of the multivariate logit-beta distribution under a P ́olya-Gamma augmentation scheme has an explicit connection to a well-known global-local shrinkage method (id est, the horseshoe prior) and extends it to possibly non-symmetric data. Moreover, our method can be implemented using an efficient block Gibbs sampler. Evidence of improvements in terms of mean squared error and variable selection as compared to the standard implementation of the horseshoe prior for skewed data settings is provided in simulated and real data examples. In Chapter 3, we direct our attention to the canonical variable selection problem in multiple response-types settings, where the observed dataset consists of multiple response-types (e.g., con- tinuous, count-valued, Bernoulli trials, et cetera). We propose the same global-local shrinkage prior in Chapter 2 but for multiple response-types datasets. The implementation of our Bayesian variable selection method to such data types is straightforward given the fact that the multivariate logit-beta prior is the conjugate prior for several members from the natural exponential family of distributions, which leads to the binomial/beta and negative binomial/beta hierarchical models. Our proposed model not just allows the estimation and selection of independent regression coefficients, but also those of shared regression coefficients across-response-types, which can be used to explicitly model dependence in spatial and time-series settings. An efficient block Gibbs sampler is developed, which is found to be effective in obtaining accurate estimates and variable selection results in simulation studies and an analysis of public health and financial costs from natural disasters in the U.S.




BIVAS


Book Description

In this thesis, we consider a Bayesian bi-level variable selection problem in high-dimensional regressions. In many practical situations, it is natural to assign group membership to each predictor. Examples include that genetic variants can be grouped at the gene level and a covariate from different tasks naturally forms a group. Thus, it is of interest to select important groups as well as important members from those groups. The existing methods based on Markov Chain Monte Carlo (MCMC) are often computationally intensive and not scalable to large data sets. To address this problem, we consider variational inference for bi-level variable selection (BIVAS). In contrast to the commonly used mean-field approximation, we propose a hierarchical factorization to approximate the posterior distribution, by utilizing the structure of bi-level variable selection. Moreover, we develop a computationally efficient and fully parallelizable algorithm based on this variational approximation. We further extend the developed method to model data sets from multi-task learning. The comprehensive numerical results from both simulation studies and real data analysis demonstrate the advantages of BIVAS for variable selection, parameter estimation and computational efficiency over existing methods. The BIVAS software with support of parallelization is implemented in R package `bivas' available at https://github.com/mxcai/bivas.




Bayesian Variable Selection


Book Description




Understanding and Assessment of a Two-component G-prior in Variable Selection


Book Description

Then we present a Bayesian variable selection method based on an extension of the Zellner's g-prior in linear models. More specifically, we propose a two-component G-prior, wherein a tuning parameter, calibrated by use of pseudo variables, is introduced to adjust the distance between the two components. We Assess the impact of tuning parameter b, the distance between important and unimportant variables, on the selection of variables by controlling Bayesian false model selection rate with respect to unimportant variables based on creating pseudo variables. We show that implementing the proposed prior in variable selection is more efficient than using the Zellner's g-prior.