Query-driven Analysis and Visualization for Large-scale Scientific Dataset Using Geometry Summarization and Bitmap Indexing


Book Description

The computational power of modern supercomputers grows rapidly, and it facilitates scientists to produce high-resolution datasets when simulating physical or weather models, which generate extreme scale data with multiple variables most of the time. However, storage, transmission, or exploration of such large-scale data is challenging. In the past decades, several visualization approaches have been developed to effectively explore datasets by displaying underlying information of datasets. Query-driven visualization is one of the prominent approaches, as it significantly reduces visual exploration time by only focusing on interesting or important features for further analysis and decision making. However, as the size of scientific datasets becomes too large, traditional data exploration approaches become ineffective. An emerging approach is to create data summarizations to first reduce the size of the dataset, and then perform data exploration on the data summarization. An ideal data summarization aims at preserving the characteristics of the raw data as much as possible while keeping the size small. However, to retrieve salient features from the raw data and create such importance-based data summarizations is challenging. In this dissertation, we address the issues that need to be solved when applying query-driven analysis and visualization using data summarizations.




Distribution-based Exploration and Visualization of Large-scale Vector and Multivariate Fields


Book Description

Due to the ever increasing of computing power in the last few decades, the size of scientific data produced by various scientific simulations has been growing rapidly. As a result, effective techniques to visualize and explore those large-scale scientific data are becoming more and more important in understanding the data. However, for data at such a large scale, effective analysis and visualization is a non-trivial task due to several reasons. First, it is often time consuming and memory intensive to perform visualization and analysis directly on the original data. Second, as the data become large and complex, visualization usually suffers from visual cluttering and occlusion, which makes it difficult for users to understand the data. In order to address the aforementioned challenges, in this dissertation, a distribution-based query-driven framework to visualize and analyze large-scale scientific data is proposed. We propose to use statistical distributions to summarize large-scale data sets. The summarized data is then used to substitute the original data to support efficient and interactive query-driven visualization which is often free of occlusion. In this dissertation, the proposed framework is applied to flow fields and multivariate scalar fields. We first demonstrate the application of the proposed framework to flow fields. For a flow field, the statistical data summarization is computed from geometries such as streamlines and stream surfaces computed from the flow field. Stream surfaces and streamlines are two popular methods for visualizing flow fields. When the data size is large, distributed memory parallelism usually is needed. In this dissertation, a new scalable algorithm is proposed to compute stream surfaces from large-scale flow fields efficiently on distributed memory machines. After we obtain a large number of computed streamlines or stream surfaces, a direct visualization of all the densely computed geometries is seldom useful due to visual cluttering and occlusion. To solve the visual cluttering problem, a distribution-based query-driven framework to explore those densely computed streamlines is presented. Then, the proposed framework is applied to multivariate scalar fields. When dealing with multivariate data, in order to understand the data, it is often useful to show the regions of interest based on user specified criteria. In the presence of large-scale multivariate data, efficient techniques to summarize the data and answer users’ queries are needed. In this dissertation, we first propose to use multivariate histograms to summarize the data and demonstrate how effective query-driven visualization can be achieved based on those multivariate histograms. However, storing multivariate histograms in the form of multi-dimensional arrays is very expensive. To enable efficient visualization and exploration of multivariate data sets, we present a compact structure to store multivariate histograms to reduce their huge space cost while supporting different kinds of histogram query operations efficiently. We also present an interactive system to assist users to effectively design multivariate transfer functions. Multiple regions of interest could be highlighted through multivariate volume rendering based on the user specified multivariate transfer function.




Query-driven Visualization Strategies for the Analysis and Visualization of Large, Complex Datasets


Book Description

There is an urgent need in scientific communities, driven by their ability to generate ever-larger, increasingly complex data, for scalable analysis methods that rapidly identify salient trends in scientific data. Query-Driven Visualization (QDV) methods are among the small subset of techniques that are able to address both large and highly complex datasets---e.g. multivariate, multitemporal, and multiresolution representations of scalar, vector, and function field data. This dissertation presents new methods that either directly extend the utility and accelerate the performance of QDV as a whole, or enable QDV's substantial and flexible analysis strengths to be applied to new areas of scientific research. The first part of this dissertation presents a new data-parallel strategy that accelerates the most fundamental task performed by QDV: the evaluation of user defined, ad hoc queries. The second part of this dissertation extends QDV strategies to analyze and visualize time-varying adaptive mesh refinement (AMR) data. AMR techniques are used in many scientific communities to efficiently and accurately model complex, continuous physical phenomena. By extending QDV methods to address the dynamic spatiotemporal properties of time-varying AMR data, I provide scientists with a powerful tool for visually analyzing the data generated from these important simulations. The final part of this dissertation leverages statistical analysis methods to generate deeper insight into the regions that are selected by a user's query. In this effort I introduce two new methods that increase the utility of query-driven strategies. The first strategy uses correlation fields, created between pairs of variables, in conjunction with the cumulative distribution functions (CDF) of variables expressed in a user's query. This strategy identifies important variable interactions within query regions. The second strategy forms a statistical-based segmentation within the query-region to generate deeper insight into the ``statistical structure'' of a user's query. In this approach, segments indicate which variable contributes most to the underlying joint density distribution of the user's query. These segments, when used in conjunction with each variable's CDF, intuitively aid users in refining the constraints over the variables in their query.




Enhancing Query-driven Visualization with Distribution Information


Book Description

Scientific visualization is concerned with the graphical portrayal of data. Using symbols, color, and natural perceptual cues, humans gain insight into collections of raw numbers that may not be as efficiently processed in non-graphical formats. For simple data, visualization may require only a simple mapping between numeric values and a color scale. But modern scientific, economic, and social data is far from simple. Multiple variables, duration of time, fine resolutions, and wide sampling have yielded data sets of unprecedented complexity. The mapping between such data and its visual appearance is difficult to define. The works described here attempt to make defining this mapping more manageable to users of visualization software. We describe three tools to assist users in finding features of interest in familiar, probabilistic terms. The first tool is an expressive query language in which the user can describe features using criteria based on the frequency distribution of values in local neighborhoods. The query language is closely integrated with the visualization to provide meaningful insight into the matching data and feedback to guide modification of query parameters. The second tool targets the investigation of high-dimensional aspatial data. A common technique called parallel coordinates is extended to three dimensions in order to make relationships between variables more apparent. A semiautomatic method of finding compelling viewpoints of the data in 3-D space is introduced. The user defines what features are compelling in terms of a view's image space distribution of color and depth values. Lastly, we describe a novel frequency histogram called a periograph for displaying cyclic temporal data. Through periographs, users inspect seasonal trends and their variations through time.




High Performance Visualization Using Query-Driven Visualizationand Analytics


Book Description

Query-driven visualization and analytics is a unique approach for high-performance visualization that offers new capabilities for knowledge discovery and hypothesis testing. The new capabilities akin to finding needles in haystacks are the result of combining technologies from the fields of scientific visualization and scientific data management. This approach is crucial for rapid data analysis and visualization in the petascale regime. This article describes how query-driven visualization is applied to a hero-sized network traffic analysis problem.




Scalable Extraction and Visualization of Scientific Features with Load-balanced Parallelism


Book Description

Extracting and visualizing features from scientific data can help scientists derive valuable insights. An extraction and visualization pipeline usually includes three steps: (1) scientific feature detection, (2) union-find for features' connected component labeling, and (3) visualization and analysis. As the scale of scientific data generated by experiments and simulations grows, it becomes a common practice to use distributed computing to handle large-scale data with data-parallelism, where data is partitioned and distributed over parallel processors. Three challenges arise for feature extraction and visualization on scientific applications. First, traditional feature detectors may not be effective and robust enough to capture features of interest across different scientific settings, because scientific features usually are highly nonlinear and recognized by domain scientists' soft knowledge. Second, existing union-find algorithms are either serial or not scalable enough to deal with extreme-scale datasets generated in the modern era. Third, existing parallel feature extraction and visualization algorithms fail to automatically reduce communication costs when optimizing the performance of processing units. This dissertation studies scalable scientific feature extraction and visualization to tackle the three challenges. First, we design human-centric interactive visual analytics based on scientists' requirements to address domain-specific feature detection and tracking. We focus on an essential problem in earth sciences: spatiotemporal analysis of viscous and gravitational fingers. Viscous and gravitational flow instabilities cause a displacement front to break up into finger-like fluids. Previously, scientists mainly detected the finger features using density thresholding, where scientists specify certain density thresholds and extract super-level sets from input density scalar fields. However, the results of density thresholding are sensitive to the selected threshold values, and a few single threshold values are usually not sufficient to extract and track satisfied time-varying finger features. In our study, scientists can detect and visualize spatiotemporal fingers interactively to elucidate the dynamics of the flow instabilities. Our study has two main contributions. (1) We propose a ridge-guided detection to extract curvilinear geometry and branching topology of fingers, which provides richer geometric structures than the density thresholding. (2) We devise an interactive visual-analytics system with geometric-glyph augmented tracking graphs to allow scientists to navigate how the fingers and their branches grow, merge, and split over both space and time. Feedback from earth scientists demonstrates the efficacy of our approach for spatiotemporal geometry-driven analyses of fingers. Second, we improve the scalability of union-find algorithms using asynchronous and load-balanced parallelism. Union-find is widely used in scientific feature extraction and visualization techniques, such as tracking critical points and extracting level sets. However, distributed and parallel union-find can suffer from high synchronization costs and imbalanced workloads of participating processors. In our study, we present a novel distributed union-find algorithm that features asynchronous parallelism and k-d tree based load balancing for scalable scientific feature extraction and visualization. We prove that global synchronizations in existing distributed union-find can be eliminated without changing final results, allowing overlapped communications and computations for scalable processing. We also use a k-d tree decomposition to redistribute inputs in order to improve workload balancing. We benchmark the scalability of our algorithm with up to 1,024 processors using both synthetic and application data. We demonstrate the use of our algorithm in critical point tracking and super-level set extraction with high-speed imaging experiments and fusion plasma simulations, respectively. Third, we take communication costs into account of parallel algorithm design. We explore an online reinforcement learning (RL) paradigm to optimize parallel particle tracing performance dynamically in distributed-memory systems with the reduction of I/O and communication costs. Our method combines three novel components: (1) a workload donation model, (2) a high-order workload estimation model, and (3) a communication cost model. First, our RL-based workload donation model monitors the workloads of processors and creates RL agents to donate particles and data blocks from high-workload processors to low-workload processors to minimize the execution time. The RL agents learn the donation strategy on-the-fly based on reward and cost functions. The reward and cost functions are designed to consider processors' workload changes and data transfer costs for every donation action. Second, we propose an online workload estimation model to help our RL model estimate the workload distribution of processors in future computations. Third, we use the communication cost model that considers both block and particle data exchange costs to help the agents make effective decisions with minimized communication costs. We demonstrate that our algorithm adapts to different flow behaviors in large-scale fluid dynamics, ocean, and weather simulation data. Our algorithm improves parallel particle tracing performance in terms of parallel efficiency, load balance, and costs of I/O and communication for evaluations up to 16,384 processors.




Graphics of Large Datasets


Book Description

This book shows how to look at ways of visualizing large datasets, whether large in numbers of cases, or large in numbers of variables, or large in both. All ideas are illustrated with displays from analyses of real datasets and the importance of interpreting displays effectively is emphasized. Graphics should be drawn to convey information and the book includes many insightful examples. New approaches to graphics are needed to visualize the information in large datasets and most of the innovations described in this book are developments of standard graphics. The book is accessible to readers with some experience of drawing statistical graphics.




An Application of Multivariate Statistical Analysis for Query-Driven Visualization


Book Description

Abstract?Driven by the ability to generate ever-larger, increasingly complex data, there is an urgent need in the scientific community for scalable analysis methods that can rapidly identify salient trends in scientific data. Query-Driven Visualization (QDV) strategies are among the small subset of techniques that can address both large and highly complex datasets. This paper extends the utility of QDV strategies with a statistics-based framework that integrates non-parametric distribution estimation techniques with a new segmentation strategy to visually identify statistically significant trends and features within the solution space of a query. In this framework, query distribution estimates help users to interactively explore their query's solution and visually identify the regions where the combined behavior of constrained variables is most important, statistically, to their inquiry. Our new segmentation strategy extends the distribution estimation analysis by visually conveying the individual importance of each variable to these regions of high statistical significance. We demonstrate the analysis benefits these two strategies provide and show how they may be used to facilitate the refinement of constraints over variables expressed in a user's query. We apply our method to datasets from two different scientific domains to demonstrate its broad applicability.







Concept-driven Visualization for Terascale Data Analytics


Book Description

Over the past couple of decades the amount of scientific data sets has exploded. The science community has since been facing the common problem of being drowned in data, and yet starved of information. Identification and extraction of meaningful features from large data sets has become one of the central problems of scientific research, for both simulation as well as sensory data sets. The problems at hand are multifold and need to be addressed concurrently to provide scientists with the necessary tools, methods, and systems. Firstly, the underlying data structures and management need to be optimized for the kind of data most commonly used in scientific research, i.e. terascale time-varying, multi-dimensional, multi-variate, and potentially non-uniform grids. This implies avoidance of data duplication, utilization of a transparent query structure, and use of sophisticated underlying data structures and algorithms. Secondly, in the case of scientific data sets, simplistic queries are not a sufficient method to describe subsets or features. For time-varying data sets, many features can generally be described as local events, i.e. spatially and temporally limited regions with characteristic properties in value space. While most often scientists know quite well what they are looking for in a data set, at times they cannot formally or definitively describe their concept well to computer science experts, especially when based on partially substantiated knowledge. Scientists need to be enabled to query and extract such features or events directly and without having to rewrite their hypothesis into an inadequately simple query language. Thirdly, tools to analyze the quality and sensitivity of these event queries itself are required. Understanding local data sensitivity is a necessity for enabling scientists to refine query parameters as needed to produce more meaningful findings. Query sensitivity analysis can also be utilized to establish trends for event-driven queries, i.e. how does the query sensitivity differ between locations and over a series of data sets. In this dissertation, we present an approach to apply these interdependent measures to aid scientists in better understanding their data sets. An integrated system containing all of the above tools and system parts is presented.