The Estimation and Inference of Complex Models


Book Description

In this thesis, we investigate the estimation problem and inference problem for the complex models. Two major categories of complex models are emphasized by us, one is generalized linear models, the other is time series models. For the generalized linear models, we consider one fundamental problem about sure screening for interaction terms in ultra-high dimensional feature space; for time series models, an important model assumption about Markov property is considered by us. The first part of this thesis illustrates the significant interaction pursuit problem for ultra-high dimensional models with two-way interaction effects. We propose a simple sure screening procedure (SSI) to detect significant interactions between the explanatory variables and the response variable in the high or ultra-high dimensional generalized linear regression models. Sure screening method is a simple, but powerful tool for the first step of feature selection or variable selection for ultra-high dimensional data. We investigate the sure screening properties of the proposal method from theoretical insight. Furthermore, we indicate that our proposed method can control the false discovery rate at a reasonable size, so the regularized variable selection methods can be easily applied to get more accurate feature selection in the following model selection procedures. Moreover, from the viewpoint of computational efficiency, we suggest a much more efficient algorithm-discretized SSI (DSSI) to realize our proposed sure screening method in practice. And we also investigate the properties of these two algorithms SSI and DSSI in simulation studies and apply them to some real data analyses for illustration. For the second part, our concern is the testing of the Markov property in time series processes. Markovian assumption plays an extremely important role in time series analysis and is also a fundamental assumption in economic and financial models. However, few existing research mainly focused on how to test the Markov properties for the time series processes. Therefore, for the Markovian assumption, we propose a new test procedure to check if the time series with beta-mixing possesses the Markov property. Our test is based on the Conditional Distance Covariance (CDCov). We investigate the theoretical properties of the proposed method. The asymptotic distribution of the proposed test statistic under the null hypothesis is obtained, and the power of the test procedure under local alternative hypothesizes have been studied. Simulation studies are conducted to demonstrate the finite sample performance of our test.




Targeted Learning in Data Science


Book Description

This textbook for graduate students in statistics, data science, and public health deals with the practical challenges that come with big, complex, and dynamic data. It presents a scientific roadmap to translate real-world data science applications into formal statistical estimation problems by using the general template of targeted maximum likelihood estimators. These targeted machine learning algorithms estimate quantities of interest while still providing valid inference. Targeted learning methods within data science area critical component for solving scientific problems in the modern age. The techniques can answer complex questions including optimal rules for assigning treatment based on longitudinal data with time-dependent confounding, as well as other estimands in dependent data structures, such as networks. Included in Targeted Learning in Data Science are demonstrations with soft ware packages and real data sets that present a case that targeted learning is crucial for the next generation of statisticians and data scientists. Th is book is a sequel to the first textbook on machine learning for causal inference, Targeted Learning, published in 2011. Mark van der Laan, PhD, is Jiann-Ping Hsu/Karl E. Peace Professor of Biostatistics and Statistics at UC Berkeley. His research interests include statistical methods in genomics, survival analysis, censored data, machine learning, semiparametric models, causal inference, and targeted learning. Dr. van der Laan received the 2004 Mortimer Spiegelman Award, the 2005 Van Dantzig Award, the 2005 COPSS Snedecor Award, the 2005 COPSS Presidential Award, and has graduated over 40 PhD students in biostatistics and statistics. Sherri Rose, PhD, is Associate Professor of Health Care Policy (Biostatistics) at Harvard Medical School. Her work is centered on developing and integrating innovative statistical approaches to advance human health. Dr. Rose’s methodological research focuses on nonparametric machine learning for causal inference and prediction. She co-leads the Health Policy Data Science Lab and currently serves as an associate editor for the Journal of the American Statistical Association and Biostatistics.




Complex Models and Computational Methods in Statistics


Book Description

The use of computational methods in statistics to face complex problems and highly dimensional data, as well as the widespread availability of computer technology, is no news. The range of applications, instead, is unprecedented. As often occurs, new and complex data types require new strategies, demanding for the development of novel statistical methods and suggesting stimulating mathematical problems. This book is addressed to researchers working at the forefront of the statistical analysis of complex systems and using computationally intensive statistical methods.




Advances in Complex Data Modeling and Computational Methods in Statistics


Book Description

The book is addressed to statisticians working at the forefront of the statistical analysis of complex and high dimensional data and offers a wide variety of statistical models, computer intensive methods and applications: network inference from the analysis of high dimensional data; new developments for bootstrapping complex data; regression analysis for measuring the downsize reputational risk; statistical methods for research on the human genome dynamics; inference in non-euclidean settings and for shape data; Bayesian methods for reliability and the analysis of complex data; methodological issues in using administrative data for clinical and epidemiological research; regression models with differential regularization; geostatistical methods for mobility analysis through mobile phone data exploration. This volume is the result of a careful selection among the contributions presented at the conference "S.Co.2013: Complex data modeling and computationally intensive methods for estimation and prediction" held at the Politecnico di Milano, 2013. All the papers published here have been rigorously peer-reviewed.




Model-Free Prediction and Regression


Book Description

The Model-Free Prediction Principle expounded upon in this monograph is based on the simple notion of transforming a complex dataset to one that is easier to work with, e.g., i.i.d. or Gaussian. As such, it restores the emphasis on observable quantities, i.e., current and future data, as opposed to unobservable model parameters and estimates thereof, and yields optimal predictors in diverse settings such as regression and time series. Furthermore, the Model-Free Bootstrap takes us beyond point prediction in order to construct frequentist prediction intervals without resort to unrealistic assumptions such as normality. Prediction has been traditionally approached via a model-based paradigm, i.e., (a) fit a model to the data at hand, and (b) use the fitted model to extrapolate/predict future data. Due to both mathematical and computational constraints, 20th century statistical practice focused mostly on parametric models. Fortunately, with the advent of widely accessible powerful computing in the late 1970s, computer-intensive methods such as the bootstrap and cross-validation freed practitioners from the limitations of parametric models, and paved the way towards the `big data' era of the 21st century. Nonetheless, there is a further step one may take, i.e., going beyond even nonparametric models; this is where the Model-Free Prediction Principle is useful. Interestingly, being able to predict a response variable Y associated with a regressor variable X taking on any possible value seems to inadvertently also achieve the main goal of modeling, i.e., trying to describe how Y depends on X. Hence, as prediction can be treated as a by-product of model-fitting, key estimation problems can be addressed as a by-product of being able to perform prediction. In other words, a practitioner can use Model-Free Prediction ideas in order to additionally obtain point estimates and confidence intervals for relevant parameters leading to an alternative, transformation-based approach to statistical inference.




Estimation and Inference in High-dimensional Models


Book Description

A wide variety of problems that are encountered in different fields can be formulated as an inference problem. Common examples of such inference problems include estimating parameters of a model from some observations, inverse problems where an unobserved signal is to be estimated based on a given model and some measurements, or a combination of the two where hidden signals along with some parameters of the model are to be estimated jointly. For example, various tasks in machine learning such as image inpainting and super-resolution can be cast as an inverse problem over deep neural networks. Similarly, in computational neuroscience, a common task is to estimate the parameters of a nonlinear dynamical system from neuronal activities. Despite wide application of different models and algorithms to solve these problems, our theoretical understanding of how these algorithms work is often incomplete. In this work, we try to bridge the gap between theory and practice by providing theoretical analysis of three different estimation problems. First, we consider the problem of estimating the input and hidden layer signals in a given multi-layer stochastic neural network with all the signals being matrix valued. Various problems such as multitask regression and classification, and inverse problems that use deep generative priors can be modeled as inference problem over multi-layer neural networks. We consider different types of estimators for such problems and exactly analyze the performance of these estimators in a certain high-dimensional regime known as the large system limit. Our analysis allows us to obtain the estimation error of all the hidden signals in the deep neural network as expectations over low-dimensional random variables that are characterized via a set of equations called the state evolution. Next, we analyze the problem of estimating a signal from convolutional observations via ridge estimation. Such convolutional inverse problems arise naturally in several fields such as imaging and seismology. The shared weights of the convolution operator introduces dependencies in the observations that makes analysis of such estimators difficult. By looking at the problem in the Fourier domain and using results about Fourier transform of a class of random processes, we show that this problem can be reduced to analysis of multiple ordinary ridge estimators, one for each frequency. This allows us to write the estimation error of the ridge estimator as an integral that depends on the spectrum of the underlying random process that generates the input features. Finally, we conclude this work by considering the problem of estimating the parameters of a multi-dimensional autoregressive generalized linear model with discrete values. Such processes take a linear combination of the past outputs of the process as the mean parameter of a generalized linear model that generates the future values. The coefficients of the linear combination are the parameters of the model and we seek to estimate these parameters under the assumption that they are sparse. This model can be used for example to model the spiking activity of neurons. In this problem, we obtain a high-probability upper bound for the estimation error of the parameters. Our experiments further support these theoretical results.




Maximum Likelihood Estimation and Inference


Book Description

This book takes a fresh look at the popular and well-established method of maximum likelihood for statistical estimation and inference. It begins with an intuitive introduction to the concepts and background of likelihood, and moves through to the latest developments in maximum likelihood methodology, including general latent variable models and new material for the practical implementation of integrated likelihood using the free ADMB software. Fundamental issues of statistical inference are also examined, with a presentation of some of the philosophical debates underlying the choice of statistical paradigm. Key features: Provides an accessible introduction to pragmatic maximum likelihood modelling. Covers more advanced topics, including general forms of latent variable models (including non-linear and non-normal mixed-effects and state-space models) and the use of maximum likelihood variants, such as estimating equations, conditional likelihood, restricted likelihood and integrated likelihood. Adopts a practical approach, with a focus on providing the relevant tools required by researchers and practitioners who collect and analyze real data. Presents numerous examples and case studies across a wide range of applications including medicine, biology and ecology. Features applications from a range of disciplines, with implementation in R, SAS and/or ADMB. Provides all program code and software extensions on a supporting website. Confines supporting theory to the final chapters to maintain a readable and pragmatic focus of the preceding chapters. This book is not just an accessible and practical text about maximum likelihood, it is a comprehensive guide to modern maximum likelihood estimation and inference. It will be of interest to readers of all levels, from novice to expert. It will be of great benefit to researchers, and to students of statistics from senior undergraduate to graduate level. For use as a course text, exercises are provided at the end of each chapter.




Semi-Parametric Estimation in Network Data and Tools for Conducting Complex Simulation Studies in Causal Inference


Book Description

This dissertation is concerned with application of robust semi-parametric methods to problems of estimation in network-dependent data and the conduct of large-scale simulation studies for causal inference research in epidemiological and medical data. Specifically, Chapter 1 presents a modern semi-parametric approach to estimation of causal effects in a population connected by a single social network. The connectivity of the population units will typically imply that the observed data on these units is no longer independent and identically distributed. Moreover, such social settings typically result in highly dimensional data. This chapter contributes to current statistical methodology by presenting an approach that allows valid estimation and inference and addresses the statistical issues specific to such networked population datasets. The framework of semi-parametric estimation, called the targeted maximum likelihood estimation (TMLE), is presented. This framework improves upon the existing methods by offering robustness, weakened sensitivity to near positivity violations, as well as the ability to deal with high-dimensionality issues of social network data. In particular, this approach relies on the accurate reflection of the background knowledge available for a given scientific problem, allowing estimation and inference without having to make unrealistic assumptions about the structure of the data. In addition, this chapter generalizes previous work describing estimation of complex causal parameters, such as the direct treatment effects under interference and the causal effects of interventions on social network structure. Although the past decade has produced many contributions towards estimation of causal effects in social network settings, there has been considerably less research on the topic of variance estimation for such highly-dependent data. This chapter presents an approach to constructing valid inference, providing a variance estimator that is scalable to very large datasets with highly-connected observations. The efficient open-source software implementation of these methods also accompanies this chapter. Chapter 2 presents open-source software tools for conduct of reproducible simulation studies for complex parameters that emerge from application of causal inference methods in epidemiological and medical research. This simulation software is build on the framework of non-parametric structural equation modeling. This chapter also studies simulation-based testing of statistical methods in causal inference for longitudinal data with time-varying exposure and confounding. It contributes to existing literature by presenting a unified syntax for non-parametrically defining complex causal parameters, which can be used as the model-free and agnostic gold standard for comparison of different statistical methods for causal inference. For instance, this chapter provides various examples of specification and evaluation of causal parameters that arise naturally in longitudinal causal effect analyses when using marginal structural models (MSMs). The application of these newly developed software tools to replication of several previously published simulation studies in causal inference are also described. Chapter 3 builds on the work described in Chapter 2 and addresses the issue of dependent data simulation for causal inference research in social network data. In particular, it provides a model-free approach to test the validity of various estimation procedures in simulated network-settings. This chapter first outlines a non-parametric causal model for units connected by a network and provides various applied examples of simulations with social network data. This chapter also showcases a possible application of the highly scalable open-source software implementation of the semi-parametric estimation methods described in Chapter 1. In particular, a large scale social network simulation study is described, and the performance of three dependent-data estimators from Chapter 1 is examined. This simulation study also examines the problem of inference for network-dependent data, specifically, by comparing the performance of the dependent-data TMLE variance estimator from Chapter 1 to the true TMLE variance derived from simulations. Finally, Chapter 3 concludes with a simulation study of an HIV epidemic described in terms of a longitudinal process which evolves over a static network in discrete time-steps among several highly inter-connected communities. The abstracts of the three works which make up this dissertation are reproduced below. Chapter 1: This chapter describes the robust semi-parametric approach towards estimation and inference for the sample average treatment-specific mean in observational settings where data are collected on a single network of connected units (e.g., in the presence of interference or spillover). Despite recent advances, many of the currently used statistical methods rely on assumption of a specific parametric model for the outcome, even though some of the most important statistical assumptions required by these models are most likely violated in the observational network data settings, resulting in invalid and anti-conservative statistical inference. In this chapter, we rely on the recent methodological advances for the targeted maximum likelihood estimation (TMLE) for data collected on a single population of causally connected units, to describe an estimation approach that permits for more realistic classes of data-generative models and provides valid statistical inference in the context of such network-dependent data. The approach is applied to an observational setting with a single time point stochastic intervention. We start by assuming that the true observed data-generating distribution belongs to a large class of semi-parametric statistical models. We then impose some restrictions on the possible set of the data-generative distributions that may belong to our statistical model. For example, we assume that the dependence among units can be fully described by the known network, and that the dependence on other units can be summarized via some known (but otherwise arbitrary) summary measures. We show that under our modeling assumptions, our estimand is equivalent to an estimand in a hypothetical IID data distribution, where the latter distribution is a function of the observed network data-generating distribution. With this key insight in mind, we show that the TMLE for our estimand in dependent network data can be described as a certain IID data TMLE algorithm, also resulting in a new simplified approach to conducting statistical inference. We demonstrate the validity of our approach in a network simulation study. We also extend prior work on dependent-data TMLE towards estimation of novel causal parameters, e.g., the unit-specific direct treatment effects under interference and the effects of interventions that modify the initial network structure. Chapter 2: This chapter introduces the \pkg{simcausal} \proglang{R} package - an open-source software tool for specification and simulation of complex longitudinal data structures that are based on non-parametric structural equation models. The package aims to provide a flexible tool for simplifying the conduct of transparent and reproducible simulation studies, with a particular emphasis on the types of data and interventions frequently encountered in real-world causal inference problems, such as, observational data with time-dependent confounding, selection bias, and random monitoring processes. The package interface allows for concise expression of complex functional dependencies between a large number of nodes, where each node may represent a measurement at a specific time point. The package allows for specification and simulation of counterfactual data under various user-specified interventions (e.g., static, dynamic, deterministic, or stochastic). In particular, the interventions may represent exposures to treatment regimens, the occurrence or non-occurrence of right-censoring events, or of clinical monitoring events. Finally, the package enables the computation of a selected set of user-specified features of the distribution of the counterfactual data that represent common causal quantities of interest, such as, treatment-specific means, the average treatment effects and coefficients from working marginal structural models. The applicability of \pkg{simcausal} is demonstrated by replicating the results of two published simulation studies. Chapter 3: The past decade has seen an increasing body of literature devoted to the estimation of causal effects in network-dependent data. However, the validity of many classical statistical methods in such data is often questioned. There is an emerging need for objective and practical ways to assess which causal methodologies might be applicable and valid in such novel network-based datasets. In this chapter we describe a set of tools implemented as part of the \pkg{simcausal} \proglang{R} package that allow simulating data based on the non-parametric structural equation model for connected units. We also provide examples of how these simulations may be applied to evaluation of different statistical methods for estimation of causal effects in such data. In particular, these simulation tools are targeted to the types of data and interventions frequently encountered in real-world causal inference research in social networks, such as, observational studies with spill-over or interference. We developed a novel \proglang{R} language interface which simplifies the specification of network-based functional relationships between connected units. Moreover, this network-based syntax can be combined with.




Models for Probability and Statistical Inference


Book Description

This concise, yet thorough, book is enhanced with simulations and graphs to build the intuition of readers Models for Probability and Statistical Inference was written over a five-year period and serves as a comprehensive treatment of the fundamentals of probability and statistical inference. With detailed theoretical coverage found throughout the book, readers acquire the fundamentals needed to advance to more specialized topics, such as sampling, linear models, design of experiments, statistical computing, survival analysis, and bootstrapping. Ideal as a textbook for a two-semester sequence on probability and statistical inference, early chapters provide coverage on probability and include discussions of: discrete models and random variables; discrete distributions including binomial, hypergeometric, geometric, and Poisson; continuous, normal, gamma, and conditional distributions; and limit theory. Since limit theory is usually the most difficult topic for readers to master, the author thoroughly discusses modes of convergence of sequences of random variables, with special attention to convergence in distribution. The second half of the book addresses statistical inference, beginning with a discussion on point estimation and followed by coverage of consistency and confidence intervals. Further areas of exploration include: distributions defined in terms of the multivariate normal, chi-square, t, and F (central and non-central); the one- and two-sample Wilcoxon test, together with methods of estimation based on both; linear models with a linear space-projection approach; and logistic regression. Each section contains a set of problems ranging in difficulty from simple to more complex, and selected answers as well as proofs to almost all statements are provided. An abundant amount of figures in addition to helpful simulations and graphs produced by the statistical package S-Plus(r) are included to help build the intuition of readers.




Mixed Effects Models for Complex Data


Book Description

Although standard mixed effects models are useful in a range of studies, other approaches must often be used in correlation with them when studying complex or incomplete data. Mixed Effects Models for Complex Data discusses commonly used mixed effects models and presents appropriate approaches to address dropouts, missing data, measurement errors, censoring, and outliers. For each class of mixed effects model, the author reviews the corresponding class of regression model for cross-sectional data. An overview of general models and methods, along with motivating examples After presenting real data examples and outlining general approaches to the analysis of longitudinal/clustered data and incomplete data, the book introduces linear mixed effects (LME) models, generalized linear mixed models (GLMMs), nonlinear mixed effects (NLME) models, and semiparametric and nonparametric mixed effects models. It also includes general approaches for the analysis of complex data with missing values, measurement errors, censoring, and outliers. Self-contained coverage of specific topics Subsequent chapters delve more deeply into missing data problems, covariate measurement errors, and censored responses in mixed effects models. Focusing on incomplete data, the book also covers survival and frailty models, joint models of survival and longitudinal data, robust methods for mixed effects models, marginal generalized estimating equation (GEE) models for longitudinal or clustered data, and Bayesian methods for mixed effects models. Background material In the appendix, the author provides background information, such as likelihood theory, the Gibbs sampler, rejection and importance sampling methods, numerical integration methods, optimization methods, bootstrap, and matrix algebra. Failure to properly address missing data, measurement errors, and other issues in statistical analyses can lead to severely biased or misleading results. This book explores the biases that arise when naïve methods are used and shows which approaches should be used to achieve accurate results in longitudinal data analysis.