Chapter 1

Putting Saddle on the Beast: Saddlepoint Approximation to the Wild
Bootstrap
(job market paper) studies the applications of the saddlepoint approximation to bootstrapping and proposes a method to save significant computation time for practitioners. Bootstrapping is typically employed as a simulation approach to statistical inference and has been shown to be remarkably more reliable than asymptotic theory in many settings. Despite this superiority, bootstrap methods (especially more sophisticated ones such as the iterated bootstrap or when applied to large datasets) often pose computational challenges and can take a significant amount of time and computing power. Depending on the contexts, a bootstrap simulation can take hours, days, or even weeks.


In such context, I observe that the results of Lieberman (Econometric Theory, 1997) can have wide applications to the bootstrap, a potential that has received little interest by Lieberman himself, in authors that cite him, and in the recent literature in general. The idea is that for numerous combinations of bootstrap schemes (which include the parametric bootstrap, the semiparametric bootstrap, the wild bootstrap, and the sieve bootstrap) and statistical tests, the bootstrap test statistic can be written as a ratio of two quadratic forms. In a typical practical situation, a simulation is executed with a loop to estimate the probability that this ratio is less than or equal to the value of the test statistic computed from the observed data. By design, however, the bootstrap data generating process is known, so this probability can in fact be estimated – without a loop – by the saddlepoint approximation. The strength of the latter is that the error in estimating a probability is relative to that probability and therefore the saddlepoint approximation is particularly reliable in estimating probabilities in the tails of a distribution.


I explore the insight above through extensive simulations. To focus the presentation, these simulations are all about variants of the wild bootstrap which is applicable when there is heteroskedasticity of unknown form. I find that the saddlepoint approximation is very accurate in mimicking the bootstrap and it can cut the simulation time by between 10 and 25 times. This means a bootstrap simulation that takes 2 hours can now be replaced by a procedure that can take 5 minutes and at the same time yields essentially the same statistical inferences. In the context when the bootstrap is iterated, I show that the saddlepoint approximation can in fact be more reliable than the fast double bootstrap, an existing method developed also to handle the computational costs of the bootstrap.


In addition to the simulations, I provide an empirical illustration using the data of Angrist and Kugler (Review of Economics and Statistics, 2008) and the wild cluster bootstrap to study the economic impacts of exogenous shock of domestic supply of coca in Colombia. The data set consists of hundreds of thousands of observations and I show that the saddlepoint approximation to
the wild cluster bootstrap saves time considerably.

Download link: will be provided in a near future.


Chapter 2

In my second chapter, Identification Robust Inferences for Endogeneity Parameters in Simultaneous Equation Models with Incomplete Reduced Form, I address another problem faced in empirical work: the presence of endogenous variables in linear regressions and having instruments some of which are weak and some of which are unobserved.

When there are endogenous variables (those that are correlated with the structural errors), the least squares estimators are unbiased and inconsistent. In such contexts, instruments are the supposed remedies. However, when the instruments are weak (that is, they are barely correlated with the endogenous variables), instrumental variable results can be very misleading. As such, there has been an interest since the early 1990’s to draw statistical inferences in a way that are robust to the strength of the instruments. (This kind of robustness is typically called “identification robustness.”) In particular, Tchatoka and Dufour (Econometrics Journal, 2014) assess the degree of endogeneity of the endogenous variables in a framework that is identification robust. However, they assume a complete reduced-form equation: that is, the linear dependence of the endogenous variables on the instruments are explicitly modelled and all relevant instruments are observed.

In practice, a complete reduced-form equation may be unreasonable: some instruments may not be observed and there may be no good reason to suppose that the dependence of the endogenous variables on the instruments is linear in nature. In recognition of this, I propose a two-staged framework that retains identification robustness when drawing inferences on the degree of endogeneity of the endogenous variables but does not impose a complete reduced-form equation. To my knowledge, such a framework is new. The two stages work as follows. In the first stage, one draws inferences on the regression parameters of the endogenous variables in a way that is identification robust. In the second stage, one uses the results of the first stage to draw inferences on the degree of endogeneity. To connect the stages, I construct a flexible asymptotic theory that uses Isserlis (Biometrika, 1918) and that allows for joint inferences of the regression parameters of the endogenous variables and the degree of endogeneity, and I employ the projection method of Dufour (Econometrica, 1990).


The theoretical contributions are illustrated by extensive simulations and an empirical study of the classic returns to schooling problem (for a survey, see Card (Econometrica, 2000)) where some of the instruments used in the literature suffer the weak instrument problem

Download link: will be provided in a near future.


Chapter 3

My third chapter, Missing Variables and Causal Analysis in Linear Regression: Interpretation and Distributional Theory, (with Jean-Marie Dufour) takes a different approach to the endogenous variables problem. There, we do not assume the availability of any instruments, which in fact may be tricky to obtain in practice. In the absence of instruments, it is typically argued that the total effect of the endogenous variables may be inferred by treating the endogeneity problem as a missing variable problem. We demonstrate rigorously that this can in fact be done but not all the times since care must be given to underlying statistical assumptions. Our framework builds on the properties of Chi-squared distributions, including the Chi-squared distribution with zero degree of freedom, an object not typically encountered in the econometrics literature. In particular, we show the striking result that a test with missing variables can in fact be more powerful that the one with those variables observed. Our theoretical results are confirmed in simulations and we also illustrate them in an emprical study, using data from Tal-Or, Cohen, Tsfati, and Gunther (Communication Research, 2010) to study the effect of presumed media influence.

Download link: will be provided in a near future.