Haplotype-based Statistical Inference for Case-control Genetic Association Studies with Complex Sampling


Book Description

With the advances in human genome research, it is now believed that the risks of many complex diseases are triggered by the interplay of genetic susceptibilities and environmental exposures. The population-based case-control study (PBCCS) is widely used to investigate the role of genetic variants and environmental exposures in the etiology of complex diseases. There are numerous ways to implement the selection process of cases and controls. In its simplest form, a simple random sampling (SRS) design is used to choose cases and controls from diseased and disease-free population, respectively. Though SRS is easy to conduct and relevant statistical methodologies are well developed, more sophisticated complex sampling (like stratified, clustered, and multistage sampling) for the selection of cases and/or controls are needed for a number of reasons. First, complex sampling is more time and cost efficient than SRS. Second, representative sample can be chosen by conducting complex sampling and thus biased selection of cases and/or controls could be avoided. As a result, complex sampling is now being used increasingly in large-scale population-based case-control or cross-sectional genetic association studies. The analysis of complex sampling data, however, requires special attention due to the following reasons. First, varying selection probabilities as well as adjustments for nonresponse and incomplete coverage of the population at risk result in differential population weight for each individual. Secondly, multistage clustered sampling design will induce non-negligible intra-cluster correlation. It has been well recognized that invalid inferences can be drawn if we ignore these two complications. There are very limited literature regarding PBCCS with complex sampling. Therefore there is a need to develop statistical methods for properly addressing those complication induced by complex sampling in genetic association studies. In this dissertation, we propose a series of innovative statistical methods for genetic association studies that account for various sampling designs. Robust variance estimators have been developed using the Taylor Linearization technique to incorporate di erential weighting and clustering effect. Monte-Carlo simulation studies are utilized to study the properties of the proposed estimators under various sampling designs. The application of the proposed methods is also illustrated using the U.S. Kidney Cancer Study (USKCS), which is one of the largest PBCSS with genome available so far.




Approximate Likelihood Inference for Haplotype Risks in Case-control Studies of a Rare Disease


Book Description

The standard study design is to study risk factors for rare diseases is the case-control design. Genetic association case-control studies often include haplotypes as risk factors. Haplotypes are not always observed, though observable single-locus genotypes contain partial haplotype information. Missing haplotypes lead to analysis of data with missing covariates. Maximum likelihood (ML) inference is then based on solving a set of weighted score equations. However, the weights cannot be calculated exactly. We describe three methods that approximate ML by approximating the weights: i) naive application of prospective ML (PML), which ignores the case-control sampling design, ii) an estimating equations (EE) approach and iii) a hybrid approach which is based on PML, but with improved weights suggested by EE. We investigate the statistical properties of the three methods by simulation. In our simulations the hybrid approach gave more accurate estimates of statistical interactions than PML and more accurate standard errors than EE.




Statistical Methods in Genetic Association


Book Description

Association studies offer great promise in dissecting the genetic basic of human complex diseases. The rapid expansion of genomic information and the cost-effective genotyping technologies have enabled us to systematically interrogate the role of human genetic variation in common diseases by genome-wide association (GWA) mapping. However, the scale and complexity of such studies will raise significant challenges in study design and data analysis. In this dissertation, we investigated several statistical problems that relevant to population-based association studies and the fine-scale mapping of genetic variants that influence susceptibility to complex diseases. First, we developed a variance-based effect size estimator for the locus-specific genetic effect. Comparing to the traditional measures, the proposed estimator is less sensitive to the risk allele frequency and the population prevalence of the disease. We demonstrated the sample size requirement would be considerable large to obtain an accurate estimate on moderate genetic effect and the sample size will increase exponentially with increased demand for precision. We next compared the power of different association test statistics. We observed that the genotype based single-locus tests is generally more powerful than the multi-locus or haplotype based statistics, especially for risk alleles far from additive; and the power of genotype based tests can be uniformly improved by applying the ordered restriction on genotypic risks. Finally, we tested different GWA strategies and explored the factors that may influence the power of GWA studies by extensive simulations using empirical genotype data from the HapMap ENCODE Project. Our results indicate that current commercial genome-wide typing products are capable of capturing most of the common risk variants; however, their power in detecting rare risk variants or variants within recombination hot spots is not satisfactory. We also showed that the properties of the risk variants (e.g. allele frequency, local recombination rate, and functional category) have significant impacts on the power of GWA. The results generated from this comprehensive exercise would be helpful for developing efficient GWA studies.




Analysis of Complex Disease Association Studies


Book Description

According to the National Institute of Health, a genome-wide association study is defined as any study of genetic variation across the entire human genome that is designed to identify genetic associations with observable traits (such as blood pressure or weight), or the presence or absence of a disease or condition. Whole genome information, when combined with clinical and other phenotype data, offers the potential for increased understanding of basic biological processes affecting human health, improvement in the prediction of disease and patient care, and ultimately the realization of the promise of personalized medicine. In addition, rapid advances in understanding the patterns of human genetic variation and maturing high-throughput, cost-effective methods for genotyping are providing powerful research tools for identifying genetic variants that contribute to health and disease. This burgeoning science merges the principles of statistics and genetics studies to make sense of the vast amounts of information available with the mapping of genomes. In order to make the most of the information available, statistical tools must be tailored and translated for the analytical issues which are original to large-scale association studies. Analysis of Complex Disease Association Studies will provide researchers with advanced biological knowledge who are entering the field of genome-wide association studies with the groundwork to apply statistical analysis tools appropriately and effectively. With the use of consistent examples throughout the work, chapters will provide readers with best practice for getting started (design), analyzing, and interpreting data according to their research interests. Frequently used tests will be highlighted and a critical analysis of the advantages and disadvantage complimented by case studies for each will provide readers with the information they need to make the right choice for their research. Additional tools including links to analysis tools, tutorials, and references will be available electronically to ensure the latest information is available. Easy access to key information including advantages and disadvantage of tests for particular applications, identification of databases, languages and their capabilities, data management risks, frequently used tests Extensive list of references including links to tutorial websites Case studies and Tips and Tricks




Sample Size Estimation and Type I Error Correction in Genetic Association Studies


Book Description

Background: Statistics is a key component of bioinformatics, which provides crucial insight into biological processes, such as testing genetic association with the risk of complex human diseases and variation of drug response. A lack of statistical power due to small sample size in genetic association studies increases the probability of type II error, and the determination of the correct sample size for these studies is influenced by various biological parameters. Additionally, multiple hypothesis testing, which is common in genetic association studies, leads to type I error inflation. Objective and Methods: This study focused on statistical properties that are important in genetic association studies: 1) testing effects of biological factors on sample size estimation by regression analysis; 2) developing a two-stage Bonferroni type I error correction procedure using linkage disequilibrium (LD) to define independent haplotype blocks; and 3) adjusting alpha levels in sample size estimation based on LD structure among genetic markers in different racial groups. Results: The first study showed that a recessive genetic model requires the largest sample size; the most significant factors for sample size estimation were minor allele frequency under the recessive genetic model, and genetic effect size under dominant and additive genetic models. The two-stage adjusted Bonferroni correction was less conservative than the standard Bonferroni correction, but less liberal than FDR. Sample sizes estimated using an adjusted alpha level based on LD structure could be reduced by 14% to 24% depending upon racial group, compared with the standard Bonferroni adjustment for alpha level. Conclusion and implication: Genetic inheritance model, effect size, and allele frequency significantly impact sample size estimation. The results can be applied to genetic marker selection, sample size estimation, and statistical power prediction. The two-stage adjusted Bonferroni type I error correction procedure improves statistical power, and introduces a simple way to control for type I error in genetic association studies. Using LD structure across the tested DNA region to adjust the alpha value for sample size estimation by race can reduce the required total sample sizes, improve statistical power, and lead to cost-effective outcomes. Keywords: Genetic association study; Sample size estimation; Statistical power; Genetic effect; Genetic inheritance model; Linkage disequilibrium; Type I error inflation; Bonferroni type I error correction; Haplotype block; FDR.




Detecting Rare Haplotype-environment Interaction Under Uncertainty of Gene-environment Independence Assumption with an Extension to Complex Sampling Data


Book Description

Genome-wide association studies have identified thousands of common variants associated with common diseases; however, these variants explain only a small proportion of the disease heritability, raising the question of how to find "missing heritability.'' Two critical factors in the quest for missing heritability are believed to be rare variants and gene-environment interactions (GXE). Recently, a method called Logistic Bayesian Lasso (LBL) was proposed for detecting GXE where G is a rare haplotype variant (rHTV). It is a powerful method for detecting rHTVs and their interactions. However, it is computationally intensive and assumes G-E independence, which may not hold in some situations. At the same time, complex sampling designs such as stratified random sampling are becoming increasingly popular for case-control studies, for example, the US kidney cancer study (KCS). There is currently no rHTV association method that can accommodate such a complex sampling design. First, we propose an improved version of LBL, which is computationally faster and can accommodate multiple covariates. Simulation studies show that it is equivalent to the original version in terms of accuracy of estimates and inference. We apply this improved version to a lung cancer dataset and find an rHTV with protective effect for current smokers. Next, we propose an extension that allows for G-E dependence and show that it controls type I error rates in presence of G-E dependence unlike the earlier version. However, the extension has reduced power when G-E independence holds. Therefore, we unify the two models by employing a reversible jump Markov chain Monte Carlo method. Our simulations show that the unified approach performs well under both G-E independence and dependence. We analyze a lung cancer dataset and find several significant interactions, including one between a specific rHTV and smoking. Finally, we adapt LBL to accommodate complex sampling. We show that it performs well when data are collected using stratified random sampling with matching between cases and controls while the original LBL method leads to inflated type I error rates. We then analyze the KCS data and find a significant interaction between current smoking and a specific rHTV in the N-acetyltransferase 2 gene.




Genetic Dissection of Complex Traits


Book Description

The field of genetics is rapidly evolving and new medical breakthroughs are occuring as a result of advances in knowledge of genetics. This series continually publishes important reviews of the broadest interest to geneticists and their colleagues in affiliated disciplines. Five sections on the latest advances in complex traits Methods for testing with ethical, legal, and social implications Hot topics include discussions on systems biology approach to drug discovery; using comparative genomics for detecting human disease genes; computationally intensive challenges, and more




A Statistical Approach to Genetic Epidemiology


Book Description

A Statistical Approach to Genetic Epidemiology After studying statistics and mathematics at the University of Munich and obtaining his doctoral degree from the University of Dortmund, Andreas Ziegler received the Johann-Peter-Süssmilch-Medal of the German Association for Medical Informatics, Biometry and Epidemiology for his post-doctoral work on “Model Free Linkage Analysis of Quantitative Traits” in 1999. In 2004, he was one of the recipients of the Fritz-Linder-Forum-Award from the German Association for Surgery.




Analysis of Genetic Association Studies


Book Description

Analysis of Genetic Association Studies is both a graduate level textbook in statistical genetics and genetic epidemiology, and a reference book for the analysis of genetic association studies. Students, researchers, and professionals will find the topics introduced in Analysis of Genetic Association Studies particularly relevant. The book is applicable to the study of statistics, biostatistics, genetics and genetic epidemiology. In addition to providing derivations, the book uses real examples and simulations to illustrate step-by-step applications. Introductory chapters on probability and genetic epidemiology terminology provide the reader with necessary background knowledge. The organization of this work allows for both casual reference and close study.




Genetic Analysis of Complex Disease


Book Description

Genetic Analysis of Complex Diseases An up-to-date and complete treatment of the strategies, designs and analysis methods for studying complex genetic disease in human beings In the newly revised Third Edition of Genetic Analysis of Complex Diseases, a team of distinguished geneticists delivers a comprehensive introduction to the most relevant strategies, designs and methods of analysis for the study of complex genetic disease in humans. The book focuses on concepts and designs, thereby offering readers a broad understanding of common problems and solutions in the field based on successful applications in the design and execution of genetic studies. This edited volume contains contributions from some of the leading voices in the area and presents new chapters on high-throughput genomic sequencing, copy-number variant analysis and epigenetic studies. Providing clear and easily referenced overviews of the considerations involved in genetic analysis of complex human genetic disease, including sampling, design, data collection, linkage and association studies and social, legal and ethical issues. Genetic Analysis of Complex Diseases also provides: A thorough introduction to study design for the identification of genes in complex traits Comprehensive explorations of basic concepts in genetics, disease phenotype definition and the determination of the genetic components of disease Practical discussions of modern bioinformatics tools for analysis of genetic data Reflecting on responsible conduct of research in genetic studies, as well as linkage analysis and data management New expanded chapter on complex genetic interactions This latest edition of Genetic Analysis of Complex Diseases is a must-read resource for molecular biologists, human geneticists, genetic epidemiologists and pharmaceutical researchers. It is also invaluable for graduate students taking courses in statistical genetics or genetic epidemiology.