Large-scale Genome Sequence Processing


Book Description

Efficient computer programs have made it possible to elucidate and analyze large-scale genomic sequences. Fundamental tasks, such as the assembly of numerous whole-genome shotgun fragments, the alignment of complementary DNA sequences with a long genome, and the design of gene-specific primers or oligomers, require efficient algorithms and state-of-the-art implementation techniques. This textbook emphasizes basic software implementation techniques for processing large-scale genome sequences and provides executable sample programs. Book jacket.




The Burrows-Wheeler Transform:


Book Description

The Burrows-Wheeler Transform is one of the best lossless compression me- ods available. It is an intriguing — even puzzling — approach to squeezing redundancy out of data, it has an interesting history, and it has applications well beyond its original purpose as a compression method. It is a relatively late addition to the compression canon, and hence our motivation to write this book, looking at the method in detail, bringing together the threads that led to its discovery and development, and speculating on what future ideas might grow out of it. The book is aimed at a wide audience, ranging from those interested in learning a little more than the short descriptions of the BWT given in st- dard texts, through to those whose research is building on what we know about compression and pattern matching. The ?rst few chapters are a careful description suitable for readers with an elementary computer science ba- ground (and these chapters have been used in undergraduate courses), but later chapters collect a wide range of detailed developments, some of which are built on advanced concepts from a range of computer science topics (for example, some of the advanced material has been used in a graduate c- puter science course in string algorithms). Some of the later explanations require some mathematical sophistication, but most should be accessible to those with a broad background in computer science.




Algorithms on Strings, Trees, and Sequences


Book Description

String algorithms are a traditional area of study in computer science. In recent years their importance has grown dramatically with the huge increase of electronically stored text and of molecular sequence data (DNA or protein sequences) produced by various genome projects. This book is a general text on computer algorithms for string processing. In addition to pure computer science, the book contains extensive discussions on biological problems that are cast as string problems, and on methods developed to solve them. It emphasises the fundamental ideas and techniques central to today's applications. New approaches to this complex material simplify methods that up to now have been for the specialist alone. With over 400 exercises to reinforce the material and develop additional topics, the book is suitable as a text for graduate or advanced undergraduate students in computer science, computational biology, or bio-informatics. Its discussion of current algorithms and techniques also makes it a reference for professionals.




Next Generation Sequencing


Book Description

Next generation sequencing (NGS) has surpassed the traditional Sanger sequencing method to become the main choice for large-scale, genome-wide sequencing studies with ultra-high-throughput production and a huge reduction in costs. The NGS technologies have had enormous impact on the studies of structural and functional genomics in all the life sciences. In this book, Next Generation Sequencing Advances, Applications and Challenges, the sixteen chapters written by experts cover various aspects of NGS including genomics, transcriptomics and methylomics, the sequencing platforms, and the bioinformatics challenges in processing and analysing huge amounts of sequencing data. Following an overview of the evolution of NGS in the brave new world of omics, the book examines the advances and challenges of NGS applications in basic and applied research on microorganisms, agricultural plants and humans. This book is of value to all who are interested in DNA sequencing and bioinformatics across all fields of the life sciences.




Mapping and Sequencing the Human Genome


Book Description

There is growing enthusiasm in the scientific community about the prospect of mapping and sequencing the human genome, a monumental project that will have far-reaching consequences for medicine, biology, technology, and other fields. But how will such an effort be organized and funded? How will we develop the new technologies that are needed? What new legal, social, and ethical questions will be raised? Mapping and Sequencing the Human Genome is a blueprint for this proposed project. The authors offer a highly readable explanation of the technical aspects of genetic mapping and sequencing, and they recommend specific interim and long-range research goals, organizational strategies, and funding levels. They also outline some of the legal and social questions that might arise and urge their early consideration by policymakers.




Genome-Scale Algorithm Design


Book Description

Guided by standard bioscience workflows in high-throughput sequencing analysis, this book for graduate students, researchers, and professionals in bioinformatics and computer science offers a unified presentation of genome-scale algorithms. This new edition covers the use of minimizers and other advanced data structures in pangenomics approaches.




Computational Genomics with R


Book Description

Computational Genomics with R provides a starting point for beginners in genomic data analysis and also guides more advanced practitioners to sophisticated data analysis techniques in genomics. The book covers topics from R programming, to machine learning and statistics, to the latest genomic data analysis techniques. The text provides accessible information and explanations, always with the genomics context in the background. This also contains practical and well-documented examples in R so readers can analyze their data by simply reusing the code presented. As the field of computational genomics is interdisciplinary, it requires different starting points for people with different backgrounds. For example, a biologist might skip sections on basic genome biology and start with R programming, whereas a computer scientist might want to start with genome biology. After reading: You will have the basics of R and be able to dive right into specialized uses of R for computational genomics such as using Bioconductor packages. You will be familiar with statistics, supervised and unsupervised learning techniques that are important in data modeling, and exploratory analysis of high-dimensional data. You will understand genomic intervals and operations on them that are used for tasks such as aligned read counting and genomic feature annotation. You will know the basics of processing and quality checking high-throughput sequencing data. You will be able to do sequence analysis, such as calculating GC content for parts of a genome or finding transcription factor binding sites. You will know about visualization techniques used in genomics, such as heatmaps, meta-gene plots, and genomic track visualization. You will be familiar with analysis of different high-throughput sequencing data sets, such as RNA-seq, ChIP-seq, and BS-seq. You will know basic techniques for integrating and interpreting multi-omics datasets. Altuna Akalin is a group leader and head of the Bioinformatics and Omics Data Science Platform at the Berlin Institute of Medical Systems Biology, Max Delbrück Center, Berlin. He has been developing computational methods for analyzing and integrating large-scale genomics data sets since 2002. He has published an extensive body of work in this area. The framework for this book grew out of the yearly computational genomics courses he has been organizing and teaching since 2015.




High-performance Processing of Next-generation Sequencing Data on CUDA-enabled GPUs


Book Description

With the technological advances in the field of genomics and sequencing, the processing of vast amounts of generated data becomes more and more challenging. Nowadays, software for processing large-scale datasets of sequencing reads may take hours to days to complete, even on high-end workstations. This explains the need for new approaches to achieve faster, high-performance applications. In contrast to traditional CPU-based software, algorithms utilizing the massively-parallel many-core architecture and fast memory of GPUs are potentially able to deliver the desired performance in many fields. In this thesis, we introduce two novel GPU-accelerated applications, CARE and CAREx, for common steps in sequence processing pipelines, error correction and read extension of Next Generation Sequencing (NGS) Illumina data, to improve the results of down-stream data analysis. To the best of our knowledge, CARE and CAREx are the first modern GPU-accelerated solutions for the respective problems. A key component of our algorithm is the identification of similar DNA sequences within a dataset. For this purpose, we developed a minhashing-based index data structure for large-scale read datasets. In conjunction with our fast bit-parallel shifted hamming distance computations, this allows for the efficient identification of similar reads. The resulting set of similar sequences is subsequently arranged into a gap-free multiple-sequence alignment to solve the problem at hand. Sequencing machines introduce both systematic errors and random errors. CARE, Context-Aware Read Error corrector, accurately removes errors introduced by NGS sequencing machines during the initial sequencing of a biological sample. With the help of a pre-trained Random Forest, CARE generates two orders-of-magnitude fewer false positives than its competitors. At the same time, it shows similar numbers of true positives. Read extension describes the process of elongating DNA sequences. The presence of longer sequences improves the resolution of more, larger structures within a genome. CAREx, Context-Aware Read Extender, produces longer sequences, so called pseudo-long reads, by connecting the two reads of read pairs which were sequenced in close proximity. Evaluation shows that CAREx produces significantly more highly accurate pseudo-long reads than the state-of-the-art. With algorithms tailored towards high-performance GPU computations, both CARE and CAREx run significantly faster than the CPU-based competitors, while, at the same time, produce more accurate results. The processing of a large Human dataset with 30x coverage with CARE requires less than 30 minutes using a single A100 GPU. This time can be further reduced down to 10 minutes on multi-GPU systems. In contrast, CPU-based tools like Musket or BFC take 3 hours and 1.5 hours, respectively. Read extension of a Human dataset with CAREx takes 3.3 hours to complete on a single GPU, whereas Konnector2 requires over a day to complete. This shows that large-scale sequence processing can greatly benefit from the usage of GPUs, and that multiple-sequence alignment-based algorithms should be considered despite their increased complexity because they provide great accuracy. While our general building blocks have been tailored towards our needs for error correction and read extension, they could also prove useful in other GPU-accelerated applications that process sequence data.




Algorithms for Next-Generation Sequencing


Book Description

Advances in sequencing technology have allowed scientists to study the human genome in greater depth and on a larger scale than ever before – as many as hundreds of millions of short reads in the course of a few days. But what are the best ways to deal with this flood of data? Algorithms for Next-Generation Sequencing is an invaluable tool for students and researchers in bioinformatics and computational biology, biologists seeking to process and manage the data generated by next-generation sequencing, and as a textbook or a self-study resource. In addition to offering an in-depth description of the algorithms for processing sequencing data, it also presents useful case studies describing the applications of this technology.




Bringing Large-scale Multiple Genome Analysis One Step Closer


Book Description

Genome sequence comparisons of exponentially growing data sets form the foundation for the comparative analysis tools provided by community biological data resources such as the Integrated Microbial Genome (IMG) system at the Joint Genome Institute (JGI). We present an example of how ScalaBLAST, a high-throughput sequence analysis program harnesses increasingly critical high-performance computing to perform sequence analysis which is a critical component of maintaining a state-of-the-art sequence data repository. The Integrated Microbial Genomes (IMG) system1 is a data management and analysis platform for microbial genomes hosted at the JGI. IMG contains both draft and complete JGI genomes integrated with other publicly available microbial genomes of all three domains of life. IMG provides tools and viewers for interactive analysis of genomes, genes and functions, individually or in a comparative context. Most of these tools are based on pre-computed pairwise sequence similarities involving millions of genes. These computations are becoming prohibitively time consuming with the rapid increase in the number of newly sequenced genomes incorporated into IMG and the need to refresh regularly the content of IMG in order to reflect changes in the annotations of existing genomes. Thus, building IMG 2.0 (released on December 1st 2006) entailed reloading from NCBI's RefSeq all the genomes in the previous version of IMG (IMG 1.6, as of September 1st, 2006) together with 1,541 new public microbial, viral and eukaryal genomes, bringing the total of IMG genomes to 2,301. A critical part of building IMG 2.0 involved using PNNL ScalaBLAST software for computing pairwise similarities for over 2.2 million genes in under 26 hours on 1,000 processors, thus illustrating the impact that new generation bioinformatics tools are poised to make in biology. The BLAST algorithm2, 3 is a familiar bioinformatics application for computing sequence similarity, and has become a workhorse in large-scale genomics projects. The rapid growth of genome resources such as IMG cannot be sustained without more powerful tools such as ScalaBLAST that use more effectively large scale computing resources to perform the core BLAST calculations. ScalaBLAST is a high performance computing algorithm designed to give high throughput BLAST results on high-end supercomputers. Other parallel sequence comparison applications have been developed4-6. However problems with scaling generally prevent these applications from being used for very large searches. ScalaBLAST7 is the first BLAST application to be both highly scaleable against the size of the database as well as the number of computer processors on high-end hardware and on commodity clusters. ScalaBLAST achieves high throughput by parsing a large collection of query sequences into independent subgroups. These smaller tasks are assigned to independent process groups. Efficient scaling is achieved by (transparently to the user) sharing only one copy of the target database across all processors using the Global Array toolkit 8, 9, which provides software implementation of shared memory interface. ScalaBLAST was initially deployed on the 1,960 processor MPP2 cluster in the Wiliam R. Wiley Environmental Molecular Sciences Laboratory at Pacific Northwest National Laboratory, and has since been ported to a variety of linux-based clusters and shared memory architectures, including SGI Altix, AMD opteron, and Intel Xeon-based clusters. Future targets include IBM BlueGene, Cray, and SGI Altix XE architectures. The importance of performing high-throughput calculations rapidly lies in the rate of growth of sequence data. For a genome sequencing center to provide multiple-genome comparison capabilities, it must keep pace with exponentially growing collection of protein data, both from its own genomes, and from the public genome information as well. As sequence data continues to grow exponentially, this challenge will only increase with time. Solving the BLAST throughput challenge for centralized data resources like IMG has the potential to unlock the power of emerging analysis methods which, until recently, were limited by the availability of multiple genome comparison data. Fig. 1 illustrates how the run-time achieved by efficient scaling in ScalaBLAST enabled the IMG all vs. all BLAST calculations to complete in roughly 1 day. Note that to keep pace with growing IMG database, we will have to double the number of processors used in these calculations during the upcoming year. Grid-based solutions for improving throughput for BLAST searches has become a popular and attractive option for some centers. The Institute for Genome Research (http://www.tigr.org/), for instance, has implemented a grid-based BLAST tool allowing users to submit requests to be farmed out to available computers on an on-demand basis.