Large-Scale Parallel Data Mining


Book Description

With the unprecedented growth-rate at which data is being collected and stored electronically today in almost all fields of human endeavor, the efficient extraction of useful information from the data available is becoming an increasing scientific challenge and a massive economic need. This book presents thoroughly reviewed and revised full versions of papers presented at a workshop on the topic held during KDD'99 in San Diego, California, USA in August 1999 complemented by several invited chapters and a detailed introductory survey in order to provide complete coverage of the relevant issues. The contributions presented cover all major tasks in data mining including parallel and distributed mining frameworks, associations, sequences, clustering, and classification. All in all, the volume presents the state of the art in the young and dynamic field of parallel and distributed data mining methods. It will be a valuable source of reference for researchers and professionals.




Mining Very Large Databases with Parallel Processing


Book Description

Mining Very Large Databases with Parallel Processing addresses the problem of large-scale data mining. It is an interdisciplinary text, describing advances in the integration of three computer science areas, namely `intelligent' (machine learning-based) data mining techniques, relational databases and parallel processing. The basic idea is to use concepts and techniques of the latter two areas - particularly parallel processing - to speed up and scale up data mining algorithms. The book is divided into three parts. The first part presents a comprehensive review of intelligent data mining techniques such as rule induction, instance-based learning, neural networks and genetic algorithms. Likewise, the second part presents a comprehensive review of parallel processing and parallel databases. Each of these parts includes an overview of commercially-available, state-of-the-art tools. The third part deals with the application of parallel processing to data mining. The emphasis is on finding generic, cost-effective solutions for realistic data volumes. Two parallel computational environments are discussed, the first excluding the use of commercial-strength DBMS, and the second using parallel DBMS servers. It is assumed that the reader has a knowledge roughly equivalent to a first degree (BSc) in accurate sciences, so that (s)he is reasonably familiar with basic concepts of statistics and computer science. The primary audience for Mining Very Large Databases with Parallel Processing is industry data miners and practitioners in general, who would like to apply intelligent data mining techniques to large amounts of data. The book will also be of interest to academic researchers and postgraduate students, particularly database researchers, interested in advanced, intelligent database applications, and artificial intelligence researchers interested in industrial, real-world applications of machine learning.




Scaling Up Machine Learning


Book Description

This integrated collection covers a range of parallelization platforms, concurrent programming frameworks and machine learning settings, with case studies.




Mining of Massive Datasets


Book Description

Now in its second edition, this book focuses on practical algorithms for mining data from even the largest datasets.




Large-Scale Computing Techniques for Complex System Simulations


Book Description

Complex systems modeling and simulation approaches are being adopted in a growing number of sectors, including finance, economics, biology, astronomy, and many more. Technologies ranging from distributed computing to specialized hardware are explored and developed to address the computational requirements arising in complex systems simulations. The aim of this book is to present a representative overview of contemporary large-scale computing technologies in the context of complex systems simulations applications. The intention is to identify new research directions in this field and to provide a communications platform facilitating an exchange of concepts, ideas and needs between the scientists and technologist and complex system modelers. On the application side, the book focuses on modeling and simulation of natural and man-made complex systems. On the computing technology side, emphasis is placed on the distributed computing approaches, but supercomputing and other novel technologies are also considered.




Data Mining in Large Sets of Complex Data


Book Description

The amount and the complexity of the data gathered by current enterprises are increasing at an exponential rate. Consequently, the analysis of Big Data is nowadays a central challenge in Computer Science, especially for complex data. For example, given a satellite image database containing tens of Terabytes, how can we find regions aiming at identifying native rainforests, deforestation or reforestation? Can it be made automatically? Based on the work discussed in this book, the answers to both questions are a sound “yes”, and the results can be obtained in just minutes. In fact, results that used to require days or weeks of hard work from human specialists can now be obtained in minutes with high precision. Data Mining in Large Sets of Complex Data discusses new algorithms that take steps forward from traditional data mining (especially for clustering) by considering large, complex datasets. Usually, other works focus in one aspect, either data size or complexity. This work considers both: it enables mining complex data from high impact applications, such as breast cancer diagnosis, region classification in satellite images, assistance to climate change forecast, recommendation systems for the Web and social networks; the data are large in the Terabyte-scale, not in Giga as usual; and very accurate results are found in just minutes. Thus, it provides a crucial and well timed contribution for allowing the creation of real time applications that deal with Big Data of high complexity in which mining on the fly can make an immeasurable difference, such as supporting cancer diagnosis or detecting deforestation.




Frontiers in Massive Data Analysis


Book Description

Data mining of massive data sets is transforming the way we think about crisis response, marketing, entertainment, cybersecurity and national intelligence. Collections of documents, images, videos, and networks are being thought of not merely as bit strings to be stored, indexed, and retrieved, but as potential sources of discovery and knowledge, requiring sophisticated analysis techniques that go far beyond classical indexing and keyword counting, aiming to find relational and semantic interpretations of the phenomena underlying the data. Frontiers in Massive Data Analysis examines the frontier of analyzing massive amounts of data, whether in a static database or streaming through a system. Data at that scale-terabytes and petabytes-is increasingly common in science (e.g., particle physics, remote sensing, genomics), Internet commerce, business analytics, national security, communications, and elsewhere. The tools that work to infer knowledge from data at smaller scales do not necessarily work, or work well, at such massive scale. New tools, skills, and approaches are necessary, and this report identifies many of them, plus promising research directions to explore. Frontiers in Massive Data Analysis discusses pitfalls in trying to infer knowledge from massive data, and it characterizes seven major classes of computation that are common in the analysis of massive data. Overall, this report illustrates the cross-disciplinary knowledge-from computer science, statistics, machine learning, and application disciplines-that must be brought to bear to make useful inferences from massive data.




Proceedings of the Fourth SIAM International Conference on Data Mining


Book Description

The Fourth SIAM International Conference on Data Mining continues the tradition of providing an open forum for the presentation and discussion of innovative algorithms as well as novel applications of data mining. This is reflected in the talks by the four keynote speakers who discuss data usability issues in systems for data mining in science and engineering, issues raised by new technologies that generate biological data, ways to find complex structured patterns in linked data, and advances in Bayesian inference techniques. This proceedings includes 61 research papers.




Parallel Processing for Scientific Computing


Book Description

Parallel processing has been an enabling technology in scientific computing for more than 20 years. This book is the first in-depth discussion of parallel computing in 10 years; it reflects the mix of topics that mathematicians, computer scientists, and computational scientists focus on to make parallel processing effective for scientific problems. Presently, the impact of parallel processing on scientific computing varies greatly across disciplines, but it plays a vital role in most problem domains and is absolutely essential in many of them. Parallel Processing for Scientific Computing is divided into four parts: The first concerns performance modeling, analysis, and optimization; the second focuses on parallel algorithms and software for an array of problems common to many modeling and simulation applications; the third emphasizes tools and environments that can ease and enhance the process of application development; and the fourth provides a sampling of applications that require parallel computing for scaling to solve larger and realistic models that can advance science and engineering.




Metaheuristics for Big Data


Book Description

Big Data is a new field, with many technological challenges to be understood in order to use it to its full potential. These challenges arise at all stages of working with Big Data, beginning with data generation and acquisition. The storage and management phase presents two critical challenges: infrastructure, for storage and transportation, and conceptual models. Finally, to extract meaning from Big Data requires complex analysis. Here the authors propose using metaheuristics as a solution to these challenges; they are first able to deal with large size problems and secondly flexible and therefore easily adaptable to different types of data and different contexts. The use of metaheuristics to overcome some of these data mining challenges is introduced and justified in the first part of the book, alongside a specific protocol for the performance evaluation of algorithms. An introduction to metaheuristics follows. The second part of the book details a number of data mining tasks, including clustering, association rules, supervised classification and feature selection, before explaining how metaheuristics can be used to deal with them. This book is designed to be self-contained, so that readers can understand all of the concepts discussed within it, and to provide an overview of recent applications of metaheuristics to knowledge discovery problems in the context of Big Data.