Bayesian Nonparametric Clusterings in Relational and High-dimensional Settings with Applications in Bioinformatics


Book Description

Recent advances in high throughput methodologies offer researchers the ability to understand complex systems via high dimensional and multi-relational data. One example is the realm of molecular biology where disparate data (such as gene sequence, gene expression, and interaction information) are available for various snapshots of biological systems. This type of high dimensional and multirelational data allows for unprecedented detailed analysis, but also presents challenges in accounting for all the variability. High dimensional data often has a multitude of underlying relationships, each represented by a separate clustering structure, where the number of structures is typically unknown a priori. To address the challenges faced by traditional clustering methods on high dimensional and multirelational data, we developed three feature selection and cross-clustering methods: 1) infinite relational model with feature selection (FIRM) which incorporates the rich information of multirelational data; 2) Bayesian Hierarchical Cross-Clustering (BHCC), a deterministic approximation to Cross Dirichlet Process mixture (CDPM) and to cross-clustering; and 3) randomized approximation (RBHCC), based on a truncated hierarchy. An extension of BHCC, Bayesian Congruence Measuring (BCM), is proposed to measure incongruence between genes and to identify sets of congruent loci with identical evolutionary histories. We adapt our BHCC algorithm to the inference of BCM, where the intended structure of each view (congruent loci) represents consistent evolutionary processes. We consider an application of FIRM on categorizing mRNA and microRNA. The model uses latent structures to encode the expression pattern and the gene ontology annotations. We also apply FIRM to recover the categories of ligands and proteins, and to predict unknown drug-target interactions, where latent categorization structure encodes drug-target interaction, chemical compound similarity, and amino acid sequence similarity. BHCC and RBHCC are shown to have improved predictive performance (both in terms of cluster membership and missing value prediction) compared to traditional clustering methods. Our results suggest that these novel approaches to integrating multi-relational information have a promising future in the biological sciences where incorporating data related to varying features is often regarded as a daunting task.




Nonparametric Bayesian Models for Unsupervised Learning


Book Description

Unsupervised learning is an important topic in machine learning. In particular, clustering is an unsupervised learning problem that arises in a variety of applications for data analysis and mining. Unfortunately, clustering is an ill-posed problem and, as such, a challenging one: no ground-truth that can be used to validate clustering results is available. Two issues arise as a consequence. Various clustering algorithms embed their own bias resulting from different optimization criteria. As a result, each algorithm may discover different patterns in a given dataset. The second issue concerns the setting of parameters. In clustering, parameter setting controls the characterization of individual clusters, and the total number of clusters in the data. Clustering ensembles have been proposed to address the issue of different biases induced by various algorithms. Clustering ensembles combine different clustering results, and can provide solutions that are robust against spurious elements in the data. Although clustering ensembles provide a significant advance, they do not address satisfactorily the model selection and the parameter tuning problem. Bayesian approaches have been applied to clustering to address the parameter tuning and model selection issues. Bayesian methods provide a principled way to address these problems by assuming prior distributions on model parameters. Prior distributions assign low probabilities to parameter values which are unlikely. Therefore they serve as regularizers for modeling parameters, and can help avoid over-fitting. In addition, the marginal likelihood is used by Bayesian approaches as the criterion for model selection. Although Bayesian methods provide a principled way to perform parameter tuning and model selection, the key question \How many clusters?" is still open. This is a fundamental question for model selection. A special kind of Bayesian methods, nonparametric Bayesian approaches, have been proposed to address this important model selection issue. Unlike parametric Bayesian models, for which the number of parameters is finite and fixed, nonparametric Bayesian models allow the number of parameters to grow with the number of observations. After observing the data, nonparametric Bayesian models t the data with finite dimensional parameters. An additional issue with clustering is high dimensionality. High-dimensional data pose a difficult challenge to the clustering process. A common scenario with high-dimensional data is that clusters may exist in different subspaces comprised of different combinations of features (dimensions). In other words, data points in a cluster may be similar to each other along a subset of dimensions, but not in all dimensions. People have proposed subspace clustering techniques, a.k.a. co-clustering or bi-clustering, to address the dimensionality issue (here, I use the term co-clustering). Like clustering, also co-clustering suffers from the ill-posed nature and the lack of ground-truth to validate the results. Although attempts have been made in the literature to address individually the major issues related to clustering, no previous work has addressed them jointly. In my dissertation I propose a unified framework that addresses all three issues at the same time. I designed a nonparametric Bayesian clustering ensemble (NBCE) approach, which assumes that multiple observed clustering results are generated from an unknown consensus clustering. The under- lying distribution is assumed to be a mixture distribution with a nonparametric Bayesian prior, i.e., a Dirichlet Process. The number of mixture components, a.k.a. the number of consensus clusters, is learned automatically. By combining the ensemble methodology and nonparametric Bayesian modeling, NBCE addresses both the ill-posed nature and the parameter setting/model selection issues of clustering. Furthermore, NBCE outperforms individual clustering methods, since it can escape local optima by combining multiple clustering results. I also designed a nonparametric Bayesian co-clustering ensemble (NBCCE) technique. NBCCE inherits the advantages of NBCE, and in addition it is effective with high dimensional data. As such, NBCCE provides a unified framework to address all the three aforementioned issues. NBCCE assumes that multiple observed co-clustering results are generated from an unknown consensus co-clustering. The underlying distribution is assumed to be a mixture with a nonparametric Bayesian prior. I developed two models to generate co-clusters in terms of row- and column- clusters. In one case row- and column-clusters are assumed to be independent, and NBCCE assumes two independent Dirichlet Process priors on the hidden consensus co-clustering, one for rows and one for columns. The second model captures the dependence between row- and column-clusters by assuming a Mondrian Process prior on the hidden consensus co-clustering. Combined with Mondrian priors, NBCCE provides more flexibility to fit the data. I have performed extensive evaluation on relational data and protein-molecule interaction data. The empirical evaluation demonstrates the effectiveness of NBCE and NBCCE and their advantages over traditional clustering and co-clustering methods.




Unsupervised Group Discovery and LInk Prediction in Relational Datasets


Book Description

Clustering represents one of the most common statistical procedures and a standard tool for pattern discovery and dimension reduction. Most often the objects to be clustered are described by a set of measurements or observables e.g. the coordinates of the vectors, the attributes of people. In a lot of cases however the available observations appear in the form of links or connections (e.g. communication or transaction networks). This data contains valuable information that can in general be exploited in order to discover groups and better understand the structure of the dataset. Since in most real-world datasets, several of these links are missing, it is also useful to develop procedures that can predict those unobserved connections. In this report we address the problem of unsupervised group discovery in relational datasets. A fundamental issue in all clustering problems is that the actual number of clusters is unknown a priori. In most cases this is addressed by running the model several times assuming a different number of clusters each time and selecting the value that provides the best fit based on some criterion (ie Bayes factor in the case of Bayesian techniques). It is easily understood that it would be preferable to develop techniques that are able to number of clusters is essentially learned from that data along with the rest of model parameters. For that purpose, we adopt a nonparametric Bayesian framework which provides a very flexible modeling environment in which the size of the model i.e. the number of clusters, can adapt to the available data and readily accommodate outliers. The latter is particularly important since several groups of interest might consist of a small number of members and would most likely be smeared out by traditional modeling techniques. Finally, the proposed framework combines all the advantages of standard Bayesian techniques such as integration of prior knowledge in a principled manner, seamless accommodation of missing data, quantification of confidence in the output etc. In the first section of this report, we review the Infinite Relational Model (IRM) which serves as the basis for further developments. The IRM assumes that each object belongs to a single group. In subsequent sections we discuss two mixed-membership models i.e. models which can account for the fact that an object can belong to several groups simultaneously in which case we are also interested in the degree of membership. For that purpose it is perhaps more natural to talk with respect to identities rather than groups. In particular we assume that each object has an unknown identity which can consist of one or more components. The terms groups and identities would therefore be considered equivalent in subsequent sections. A sub-section is also devoted to variational techniques which have the potential of accelerating the inference process. Finally we discuss possible extensions to dynamic settings in which the available data includes timestamps and the goal is to find how group sizes and group memberships evolve in time. Even though the majority of the presentation is restricted to objects of a single type (domain) and pairwise, binary links of a single type, it is shown that the framework proposed can be extended to links of various types between several domains.




Relational Data Mining


Book Description

As the first book devoted to relational data mining, this coherently written multi-author monograph provides a thorough introduction and systematic overview of the area. The first part introduces the reader to the basics and principles of classical knowledge discovery in databases and inductive logic programming; subsequent chapters by leading experts assess the techniques in relational data mining in a principled and comprehensive way; finally, three chapters deal with advanced applications in various fields and refer the reader to resources for relational data mining. This book will become a valuable source of reference for R&D professionals active in relational data mining. Students as well as IT professionals and ambitioned practitioners interested in learning about relational data mining will appreciate the book as a useful text and gentle introduction to this exciting new field.




Bayesian Methods for Nonlinear Classification and Regression


Book Description

Bei der Regressionsanalyse von Datenmaterial erhält man leider selten lineare oder andere einfache Zusammenhänge (parametrische Modelle). Dieses Buch hilft Ihnen, auch komplexere, nichtparametrische Modelle zu verstehen und zu beherrschen. Stärken und Schwächen jedes einzelnen Modells werden durch die Anwendung auf Standarddatensätze demonstriert. Verbreitete nichtparametrische Modelle werden mit Hilfe von Bayes-Verfahren in einen kohärenten wahrscheinlichkeitstheoretischen Zusammenhang gebracht.




Generalized Linear Models


Book Description

This volume describes how to conceptualize, perform, and critique traditional generalized linear models (GLMs) from a Bayesian perspective and how to use modern computational methods to summarize inferences using simulation. Introducing dynamic modeling for GLMs and containing over 1000 references and equations, Generalized Linear Models considers parametric and semiparametric approaches to overdispersed GLMs, presents methods of analyzing correlated binary data using latent variables. It also proposes a semiparametric method to model link functions for binary response data, and identifies areas of important future research and new applications of GLMs.




Cluster Analysis


Book Description




Bayesian Nonparametrics


Book Description

Bayesian nonparametrics works - theoretically, computationally. The theory provides highly flexible models whose complexity grows appropriately with the amount of data. Computational issues, though challenging, are no longer intractable. All that is needed is an entry point: this intelligent book is the perfect guide to what can seem a forbidding landscape. Tutorial chapters by Ghosal, Lijoi and Prünster, Teh and Jordan, and Dunson advance from theory, to basic models and hierarchical modeling, to applications and implementation, particularly in computer science and biostatistics. These are complemented by companion chapters by the editors and Griffin and Quintana, providing additional models, examining computational issues, identifying future growth areas, and giving links to related topics. This coherent text gives ready access both to underlying principles and to state-of-the-art practice. Specific examples are drawn from information retrieval, NLP, machine vision, computational biology, biostatistics, and bioinformatics.




Bayesian Nonparametric Data Analysis


Book Description

This book reviews nonparametric Bayesian methods and models that have proven useful in the context of data analysis. Rather than providing an encyclopedic review of probability models, the book’s structure follows a data analysis perspective. As such, the chapters are organized by traditional data analysis problems. In selecting specific nonparametric models, simpler and more traditional models are favored over specialized ones. The discussed methods are illustrated with a wealth of examples, including applications ranging from stylized examples to case studies from recent literature. The book also includes an extensive discussion of computational methods and details on their implementation. R code for many examples is included in online software pages.




Finite Mixture and Markov Switching Models


Book Description

The past decade has seen powerful new computational tools for modeling which combine a Bayesian approach with recent Monte simulation techniques based on Markov chains. This book is the first to offer a systematic presentation of the Bayesian perspective of finite mixture modelling. The book is designed to show finite mixture and Markov switching models are formulated, what structures they imply on the data, their potential uses, and how they are estimated. Presenting its concepts informally without sacrificing mathematical correctness, it will serve a wide readership including statisticians as well as biologists, economists, engineers, financial and market researchers.