Computational Methods for Analysis of Single Molecule Sequencing Data


Book Description

Next-generation sequencing (NGS) technologies paved the way to a significant increase in the number of sequenced genomes, both prokaryotic and eukaryotic. This increase provided an opportunity for considerable advancement in genomics and precision medicine. Although NGS technologies have proven their power in many applications such as de novo genome assembly and variation discovery, computational analysis of the data they generate is still far from being perfect. The main limitation of NGS technologies is their short read length relative to the lengths of (common) genomic repeats. Today, newer sequencing technologies (known as single-molecule sequencing or SMS) such as Pacific Biosciences and Oxford Nanopore are producing significantly longer reads, making it theoretically possible to overcome the difficulties imposed by repeat regions. For instance, for the first time, a complete human chromosome was fully assembled using ultra-long reads generated by Oxford Nanopore. Unfortunately, long reads generated by SMS technologies are characterized by a high error rate, which prevents their direct utilization in many of the standard downstream analysis pipelines and poses new computational challenges. This motivates the development of new computational tools specifically designed for SMS long reads. In this thesis, we present three computational methods that are tailored for SMS long reads. First, we present lordFAST, a fast and sensitive tool for mapping noisy long reads to a reference genome. Mapping sequenced reads to their potential genomic origin is the first fundamental step for many computational biology tasks. As an example, in this thesis, we show the success of lordFAST to be employed in structural variation discovery. Next, we present the second tool, CoLoRMap, which tackles the high level of base-level errors in SMS long reads by providing a means to correct them using a complementary set of NGS short reads. This integrative use of SMS and NGS data is known as hybrid technique. Finally, we introduce HASLR, an ultra-fast hybrid assembler that uses reads generated by both technologies to efficiently generate accurate genome assemblies. We demonstrate that HASLR is not only the fastest assembler but also the one with the lowest number of misassemblies on all the samples compared to other tested assemblers. Furthermore, the generated assemblies in terms of contiguity and accuracy are on par with the other tools on most of the samples.




Computational Methods for the Analysis of Genomic Data and Biological Processes


Book Description

In recent decades, new technologies have made remarkable progress in helping to understand biological systems. Rapid advances in genomic profiling techniques such as microarrays or high-performance sequencing have brought new opportunities and challenges in the fields of computational biology and bioinformatics. Such genetic sequencing techniques allow large amounts of data to be produced, whose analysis and cross-integration could provide a complete view of organisms. As a result, it is necessary to develop new techniques and algorithms that carry out an analysis of these data with reliability and efficiency. This Special Issue collected the latest advances in the field of computational methods for the analysis of gene expression data, and, in particular, the modeling of biological processes. Here we present eleven works selected to be published in this Special Issue due to their interest, quality, and originality.




Computational Methods for Next Generation Sequencing Data Analysis


Book Description

Introduces readers to core algorithmic techniques for next-generation sequencing (NGS) data analysis and discusses a wide range of computational techniques and applications This book provides an in-depth survey of some of the recent developments in NGS and discusses mathematical and computational challenges in various application areas of NGS technologies. The 18 chapters featured in this book have been authored by bioinformatics experts and represent the latest work in leading labs actively contributing to the fast-growing field of NGS. The book is divided into four parts: Part I focuses on computing and experimental infrastructure for NGS analysis, including chapters on cloud computing, modular pipelines for metabolic pathway reconstruction, pooling strategies for massive viral sequencing, and high-fidelity sequencing protocols. Part II concentrates on analysis of DNA sequencing data, covering the classic scaffolding problem, detection of genomic variants, including insertions and deletions, and analysis of DNA methylation sequencing data. Part III is devoted to analysis of RNA-seq data. This part discusses algorithms and compares software tools for transcriptome assembly along with methods for detection of alternative splicing and tools for transcriptome quantification and differential expression analysis. Part IV explores computational tools for NGS applications in microbiomics, including a discussion on error correction of NGS reads from viral populations, methods for viral quasispecies reconstruction, and a survey of state-of-the-art methods and future trends in microbiome analysis. Computational Methods for Next Generation Sequencing Data Analysis: Reviews computational techniques such as new combinatorial optimization methods, data structures, high performance computing, machine learning, and inference algorithms Discusses the mathematical and computational challenges in NGS technologies Covers NGS error correction, de novo genome transcriptome assembly, variant detection from NGS reads, and more This text is a reference for biomedical professionals interested in expanding their knowledge of computational techniques for NGS data analysis. The book is also useful for graduate and post-graduate students in bioinformatics.




Novel Computational Methods for Improving Functional Analysis for Long Noisy Reads


Book Description

Single-molecule, real-time sequencing (SMRT) developed by Pacific Biosciences (PacBio) and Nanopore sequencing developed by Oxford Nanopore Technologies (Nanopore) produce longer reads than second-generation sequencing technologies such as Illumina. The increased read length enables PacBio sequencing to close gaps in genome assembly, reveal structural variations, and characterize the intra-species variations. It also holds the promise to decipher the community structure in complex microbial communities because long reads help metagenomic assembly. However, compared with data produced by popular short read sequencing technologies (such as Illumina), PacBio and Nanopore data have a higher sequencing error rate and lower coverage. Therefore, new algorithms are needed to take full advantage of third-generation sequencing technologies. For example, during an alignment-based homology search, insertion or deletion errors in genes will cause frameshifts, which may lead to marginal alignment scores and short alignments. In this case, it is hard to distinguish correct alignments from random alignments, and the ambiguity will incur errors in the structural and functional annotation. Existing frameshift correction tools are designed for data with a much lower error rate, and they are not optimized for PacBio data. As an increasing number of groups are using SMRT, there is an urgent need for dedicated homology search tools for PacBio and Nanopore data. Another example is overlap detection. For both PacBio reads and Nanopore reads, there is still a need to improve the sensitivity of detecting small overlaps or overlaps with high error rates. Addressing this need will enable better assembly for metagenomic data produced by the third-generation sequencing technologies.In this article, we are going to discuss the possible method for homology search and overlap detection for the third-generation sequencing. For overlap detection, we designed and implemented an overlap detection program named GroupK. GroupK takes a group of short kmer hits, which satisfy statistically derived distance constraints to increase the sensitivity of small overlap detection. For the homology search, we designed and implemented a profile homology search tool named Frame-Pro based on the profile hidden Markov model (pHMM) and consensus sequences finding method. However, Frame-pro is still relying on multiple sequence alignment. So we implemented DeepFrame, a deep learning model that predicts the corresponding protein function for third-generation sequencing reads. In the experiment on simulated reads of protein-coding sequences and real reads from the human genome, our model outperforms pHMM-based methods and the deep learning based method. Our model can also reject unrelated DNA reads and achieves higher recall with the precision comparable to the state-of-the-art method.







High Performance Computational Methods for Biological Sequence Analysis


Book Description

High Performance Computational Methods for Biological Sequence Analysis presents biological sequence analysis using an interdisciplinary approach that integrates biological, mathematical and computational concepts. These concepts are presented so that computer scientists and biomedical scientists can obtain the necessary background for developing better algorithms and applying parallel computational methods. This book will enable both groups to develop the depth of knowledge needed to work in this interdisciplinary field. This work focuses on high performance computational approaches that are used to perform computationally intensive biological sequence analysis tasks: pairwise sequence comparison, multiple sequence alignment, and sequence similarity searching in large databases. These computational methods are becoming increasingly important to the molecular biology community allowing researchers to explore the increasingly large amounts of sequence data generated by the Human Genome Project and other related biological projects. The approaches presented by the authors are state-of-the-art and show how to reduce analysis times significantly, sometimes from days to minutes. High Performance Computational Methods for Biological Sequence Analysis is tremendously important to biomedical science students and researchers who are interested in applying sequence analyses to their studies, and to computational science students and researchers who are interested in applying new computational approaches to biological sequence analyses.




Biological Sequence Analysis


Book Description

Probabilistic models are becoming increasingly important in analysing the huge amount of data being produced by large-scale DNA-sequencing efforts such as the Human Genome Project. For example, hidden Markov models are used for analysing biological sequences, linguistic-grammar-based probabilistic models for identifying RNA secondary structure, and probabilistic evolutionary models for inferring phylogenies of sequences from different organisms. This book gives a unified, up-to-date and self-contained account, with a Bayesian slant, of such methods, and more generally to probabilistic methods of sequence analysis. Written by an interdisciplinary team of authors, it aims to be accessible to molecular biologists, computer scientists, and mathematicians with no formal knowledge of the other fields, and at the same time present the state-of-the-art in this new and highly important field.




A Computational Approach for Diagnostic Long-read Genome Sequencing


Book Description

Our understanding of the human genome has greatly expanded since the completion of the Human Genome Project. Many large-scale landmark studies have since looked at the role genetic alterations play in the predisposition to disease and identified countless disease-causing mutations. While most of genomics-based research has been made possible through the commoditization of massively parallel next-generation sequencing, recent advances in sequencing technologies have allowed long-read single-molecule sequencing to further characterize and identify genetic alterations that were previously challenging to detect through conventional sequencing. In this research, we have used accurate long-read sequencing from Pacific Biosciences to study cancer and non-cancer samples alike to identify and characterize disease-associated genetic alterations. The work has involved the development of computational methods for stream-lining analysis of such data to provide high-confidence structural variant calls. The analysis pipeline and tools have been used to accurately identify causative mutations in pediatric cancer cases, discover an internal tandem duplication in the HOXD13 gene that caused syndactyly in two unrelated families, and to expand the role that activating FGFR1 mutations may play in closed spinal dysraphism.




Revealing Translational and Fundamental Insights Via Computational Analysis of Single-cell Sequencing Data


Book Description

Single-cell sequencing has emerged as a powerful tool for dissecting cellular heterogeneity and providing cell type-specific biological insights. Single-cell sequencing technologies have rapidly proliferated over the last decade, leading to an explosion of data generated from such experiments. However, several challenges exist in the computational analysis of single-cell sequencing data due to its large and complex nature, including the need for sophisticated statistical methods to distinguish biologically meaningful signals from noise, the integration of single-cell sequencing data with other types of biological information, and the development of scalable and reproducible computational pipelines that can handle the large and complex nature of the data. In this dissertation, I present two distinct projects analyzing single-cell sequencing data. The first is of an analytical nature and tackles a translational question. In this project, I built computational pipelines for processing and analyzing single-nucleus RNA- and ATAC-sequencing datasets generated from the amygdalae of genetically diverse heterogenous stock rats, which were subjected to a behavioral protocol for studying addiction-like behaviors following cocaine self-administration. In doing so, I provide a standard reference for analyzing such data as well as reveal cell type-specific insights into the molecular underpinnings of cocaine addiction. The second project is oriented towards methods development and seeks to understand the fundamental biological question of transcriptional regulation. Here, I developed a statistical framework for simulating and modeling data from single-cell CRISPR regulatory screens and used it to perform a genome-wide interrogation of epistatic-like interactions between enhancer pairs. I found that multiple enhancers act together in a multiplicative fashion with little evidence for interactive effects between them. This work revealed novel insights into the collective behavior of multiple regulatory elements and provides a tool that can be applied to future datasets generated from such experiments. This dissertation exemplifies how computational methods can be applied in different contexts to extract meaning from a variety of single-cell sequencing modalities. By tackling both a translational and fundamental biological question, I have showcased the breadth of what can be revealed by studying single-cell sequencing data and the computational methods necessary to extract this information.




Single Molecule and Single Cell Sequencing


Book Description

This book presents an overview of the recent technologies in single molecule and single cell sequencing. These sequencing technologies are revolutionizing the way of the genomic studies and the understanding of complex biological systems. The PacBio sequencer has enabled extremely long-read sequencing and the MinION sequencer has made the sequencing possible in developing countries. New developments and technologies are constantly emerging, which will further expand sequencing applications. In parallel, single cell sequencing technologies are rapidly becoming a popular platform. This volume presents not only an updated overview of these technologies, but also of the related developments in bioinformatics. Without powerful bioinformatics software, where rapid progress is taking place, these new technologies will not realize their full potential. All the contributors to this volume have been involved in the development of these technologies and software and have also made significant progress on their applications. This book is intended to be of interest to a wide audience ranging from genome researchers to basic molecular biologists and clinicians.