Book Description
This study is about developing new clustering analysis algorithms to analyze microarray gene expression data. With the use of clustering analysis, it is possible to infer the function of genes in a cluster by referring to those with known function in the same cluster. In microarray data, thousands of genes expression profiles are observed across different experimental conditions. Due to the complex experimental designs, the observations from different experimental conditions might be correlated. To account for the correlations from different experimental conditions and correlations among different genes, new clustering algorithms have been developed which are based on Bayesian infinite mixture models in a Bayesian data analysis framework. The correlations have been taken into account by specifying accurate variance-covariance matrices in statistical model definitions. In this way when correlations are present, the new algorithms can precisely represent the observed data. Consequently, the new algorithms produce more stable and reproducible cluster results. Mathematical and computational procedures have been developed and implemented through appropriate computer programs. Gibbs sampler was used to estimate the posterior distribution of clusters. Posterior pairwise probabilities (PPP) of co-clustering of two genes are obtained based on the estimated classification variable distribution. By treating PPPs as the pairwise similarity measures, clusters are formed using traditional hierarchical cluster analysis algorithms. The new algorithms and existing clustering algorithms were applied to simulated data, as well as real-world data to compare their performance. Compared with the existing clustering algorithms, when non-zero correlations exist, the new algorithms generally obtained more accurate and stable clustering results.