Mining Imperfect Data


Book Description

Data mining is concerned with the analysis of databases large enough that various anomalies, including outliers, incomplete data records, and more subtle phenomena such as misalignment errors, are virtually certain to be present. Mining Imperfect Data describes in detail a number of these problems, as well as their sources, their consequences, their detection, and their treatment. Specific strategies for data pretreatment and analytical validation that are broadly applicable are described, making them useful in conjunction with most data mining analysis methods. Examples are presented to illustrate the performance of the pretreatment and validation methods in a variety of situations, both simulation based, where "correct" results are known unambiguously, and real data examples that illustrate typical cases met in practice.




Mining Imperfect Data


Book Description

It has been estimated that as much as 80% of the total effort in a typical data analysis project is taken up with data preparation, including reconciling and merging data from different sources, identifying and interpreting various data anomalies, and selecting and implementing appropriate treatment strategies for the anomalies that are found. This book focuses on the identification and treatment of data anomalies, including examples that highlight different types of anomalies, their potential consequences if left undetected and untreated, and options for dealing with them. As both data sources and free, open-source data analysis software environments proliferate, more people and organizations are motivated to extract useful insights and information from data of many different kinds (e.g., numerical, categorical, and text). The book emphasizes the range of open-source tools available for identifying and treating data anomalies, mostly in R but also with several examples in Python. Mining Imperfect Data: With Examples in R and Python, Second Edition presents a unified coverage of 10 different types of data anomalies (outliers, missing data, inliers, metadata errors, misalignment errors, thin levels in categorical variables, noninformative variables, duplicated records, coarsening of numerical data, and target leakage). It includes an in-depth treatment of time-series outliers and simple nonlinear digital filtering strategies for dealing with them, and it provides a detailed introduction to several useful mathematical characteristics of important data characterizations that do not appear to be widely known among practitioners, such as functional equations and key inequalities. While this book is primarily for data scientists, researchers in a variety of fields—namely statistics, machine learning, physics, engineering, medicine, social sciences, economics, and business—will also find it useful.




Managing and Mining Sensor Data


Book Description

Advances in hardware technology have lead to an ability to collect data with the use of a variety of sensor technologies. In particular sensor notes have become cheaper and more efficient, and have even been integrated into day-to-day devices of use, such as mobile phones. This has lead to a much larger scale of applicability and mining of sensor data sets. The human-centric aspect of sensor data has created tremendous opportunities in integrating social aspects of sensor data collection into the mining process. Managing and Mining Sensor Data is a contributed volume by prominent leaders in this field, targeting advanced-level students in computer science as a secondary text book or reference. Practitioners and researchers working in this field will also find this book useful.




Data Preprocessing and Cleansing Through Missing Value Imputation Techniques


Book Description

With data-mining becoming a hot topic in the world of business and government, millions of dollars are being invested each year in data collection for the purpose of large scale analysis using data-mining techniques. Unfortunately, data collection is imperfect for a wide variety of reasons, which can mean this investment goes underutilised. In response to this, the field of data cleansing, particularly missing value imputation, has become increasingly important. In this dissertation, we review many existing techniques used in data mining, and a wide variety of missing value imputation techniques. This is followed by a journal article which explores some of the strengths and weaknesses in a newly categorised class of missing value imputation techniques, and gives the reader both a solid foundation in the field and a variety of research directions backed up by empirical analysis. Overall, the dissertation provides a comprehensive understanding of the past, present and future of the field of missing value imputation as it relates to data-mining, and provides a groundwork summary of the literature for both fields.




Knowledge Discovery and Data Mining: Challenges and Realities


Book Description

"This book provides a focal point for research and real-world data mining practitioners that advance knowledge discovery from low-quality data; it presents in-depth experiences and methodologies, providing theoretical and empirical guidance to users who have suffered from underlying low-quality data. Contributions also focus on interdisciplinary collaborations among data quality, data processing, data mining, data privacy, and data sharing"--Provided by publisher.




Data Mining in Public and Private Sectors: Organizational and Government Applications


Book Description

The need for both organizations and government agencies to generate, collect, and utilize data in public and private sector activities is rapidly increasing, placing importance on the growth of data mining applications and tools. Data Mining in Public and Private Sectors: Organizational and Government Applications explores the manifestation of data mining and how it can be enhanced at various levels of management. This innovative publication provides relevant theoretical frameworks and the latest empirical research findings useful to governmental agencies, practicing managers, and academicians.




Mind the Gap


Book Description

The application of predictive data mining techniques in Information Systems research has grown in recent years, likely due to their effectiveness and scalability in extracting information from large amounts of data. A number of scholars have sought to combine data mining with traditional econometric analyses. Typically, data mining methods are first used to generate new variables (e.g., text sentiment), which are added into subsequent econometric models as independent regressors. However, because prediction is almost always imperfect, variables generated from the first stage data mining models inevitably contain measurement error or misclassification. These errors, if ignored, can introduce systematic biases into the second stage econometric estimations and threaten the validity of statistical inference. In this commentary, we examine the nature of this bias, both analytically and empirically, and show that it can be severe even when data mining models exhibit relatively high performance. We then show that this bias becomes increasingly difficult to anticipate as the functional form of the measurement error grows more complex, or as the number of covariates in the econometric model increases. We review several methods for error correction and focus on two simulation-based methods, SIMEX and MC-SIMEX, which can be easily parameterized using standard performance metrics from data mining models, such as error variance or the confusion matrix, and can be applied under a wide range of econometric specifications. Finally, we demonstrate the effectiveness of SIMEX and MC-SIMEX by simulations and subsequent application of the methods to econometric estimations employing variables mined from three real world datasets related to travel, social networking, and crowdfunding campaign websites.




Networked Digital Technologies


Book Description

This two-volume-set (CCIS 293 and CCIS 294) constitutes the refereed proceedings of the International Conference on Networked Digital Technologies, NDT 2012, held in Dubai, UAE, in April 2012. The 96 papers presented in the two volumes were carefully reviewed and selected from 228 submissions. The papers are organized in topical sections on collaborative systems for e-sciences; context-aware processing and ubiquitous systems; data and network mining; grid and cloud computing; information and data management; intelligent agent-based systems; internet modeling and design; mobile, ad hoc and sensor network management; peer-to-peer social networks; quality of service for networked systems; semantic Web and ontologies; security and access control; signal processing and computer vision for networked systems; social networks; Web services.




Data Mining and Knowledge Discovery for Big Data


Book Description

The field of data mining has made significant and far-reaching advances over the past three decades. Because of its potential power for solving complex problems, data mining has been successfully applied to diverse areas such as business, engineering, social media, and biological science. Many of these applications search for patterns in complex structural information. In biomedicine for example, modeling complex biological systems requires linking knowledge across many levels of science, from genes to disease. Further, the data characteristics of the problems have also grown from static to dynamic and spatiotemporal, complete to incomplete, and centralized to distributed, and grow in their scope and size (this is known as big data). The effective integration of big data for decision-making also requires privacy preservation. The contributions to this monograph summarize the advances of data mining in the respective fields. This volume consists of nine chapters that address subjects ranging from mining data from opinion, spatiotemporal databases, discriminative subgraph patterns, path knowledge discovery, social media, and privacy issues to the subject of computation reduction via binary matrix factorization.




Transactions on Rough Sets V


Book Description

The LNCS journal Transactions on Rough Sets is devoted to the entire spectrum of rough sets related issues, from logical and mathematical foundations, through all aspects of rough set theory and its applications, such as data mining, knowledge discovery, and intelligent information processing, to relations between rough sets and other approaches to uncertainty, vagueness, and incompleteness, such as fuzzy sets and theory of evidence.This fifth volume of the Transactions on Rough Sets is dedicated to the monumental life, work and creative genius of Zdzis{l}aw Pawlak, the originator of rough sets, who passed away in April 2006. It opens with a commemorative article that gives a brief coverage of Pawlak's works in rough set theory, molecular computing, philosophy, painting and poetry. Fifteen papers explore the theory of rough sets in various domains as well as new applications of rough sets. In addition, this volume of the TRS includes a complete monograph on rough sets and approximate Boolean reasoning systems that includes both the foundations as well as applications of data mining.