Treatment effects & delayed outcomes

We have a new paper with my colleague Federico Bugni and Steve McBride at ROBLOX. The paper is called “Decomposition and Interpretation of Treatment Effects in Settings with Delayed Outcomes”  and studies settings where the analyst is interested in identifying and estimating the average causal effect of a binary treatment on an outcome. We consider a setup in which the outcome realization does not get immediately realized after the treatment assignment, a feature that is ubiquitous in empirical settings. The period between the treatment and the realization of the outcome allows other observed actions to occur and affect the outcome, as featured below in Figure 1. In this context, we study several regression-based estimands routinely used in empirical work to capture the average treatment effect and shed light on interpreting them in terms of ceteris paribus effects, indirect causal effects, and selection terms. We obtain three main and related takeaways. First, the three most popular estimands do not generally satisfy what we call strong sign preservation, in the sense that these estimands may be negative even when the treatment positively affects the outcome conditional on any possible combination of other actions. Second, the most popular regression that includes the other actions as controls satisfies strong sign preservation if and only if these actions are mutually exclusive binary variables. Finally, we show that a linear regression that fully stratifies the other actions leads to estimands that satisfy strong sign preservation.

Non-Ignorable Clusters and CAR

In collaboration with Federico Bugni, Azeem Shaikh, and Max Tabord-Meehan, we have released a new paper that considers the problem of inference in cluster randomized experiments when cluster sizes are non-ignorable. Here, by a cluster randomized experiment, we mean one in which treatment is assigned at the level of the cluster; by non-ignorable cluster sizes we mean that “large” clusters and “small” clusters may be heterogeneous, and, in particular, the effects of the treatment may vary across clusters of differing sizes.
In order to permit this sort of flexibility, we consider a sampling framework in which cluster sizes themselves are random. In this way, our analysis departs from earlier analyses of cluster randomized experiments in which cluster sizes are treated as non-random. We distinguish between two different parameters of interest: the equally-weighted cluster-level average treatment effect, and the size- weighted cluster-level average treatment effect. For each parameter, we provide methods for inference in an asymptotic framework where the number of clusters tends to infinity and treatment is assigned using a covariate-adaptive stratified randomization procedure. We additionally permit the experimenter to sample only a subset of the units within each cluster rather than the entire cluster and demonstrate the implications of such sampling for some commonly used estimators. A small simulation study and empirical demonstration show the practical relevance of our theoretical results. You can download the paper here.

Guide to moment inequalities

Models defined by moment inequalities have become a standard modeling framework for empirical economists, spreading over a wide range of fields within economics. From the point of view of an empirical researcher, the literature on inference in moment inequality models is large and complex, including multiple survey papers that document the non-standard features these models possess, the main novel concepts behind inference in these models, and the most recent developments that bring advances in accuracy and computational tractability. In a recent paper with my colleague Gaston Illanes and my student Amilcar Velez we present a guide to empirical practice intended to help applied researchers navigate all the decisions required to frame a model as a moment inequality model and then to construct confidence intervals for the parameters of interest. We divide our template into four main steps: (a) a behavioral decision model, (b) moving from the decision model to a moment inequality model, (c) choosing a test statistic and critical value, and (d) accounting for computational challenges. We split each of these steps into a discussion of the “how” and the “why”, and then illustrate how to take these steps to practice in an empirical application that studies identification of expected sunk costs of offering a product in a market. A Github repository with all necessary codes to implement our recommendations in R, Matlab, and Python will be available soon.

On Outcome Tests for Detecting Bias

The paper On the Use of Outcome Tests for Detecting Bias in Decision Making, joint with Magne Mogstad and Jack Mountjoy is now available. This paper starts with the observation that the decisions of judges, lenders, journal editors, and other gatekeepers often lead to disparities in outcomes across affected groups. An important question is whether, and to what extent, these group-level disparities are driven by relevant differences in underlying individual characteristics, or by biased decision makers. Becker (1957) proposed an outcome test for bias leading to a large body of related empirical work, with recent innovations in settings where decision makers are exogenously assigned to cases and vary progressively in their decision tendencies. We carefully examine what can be learned about bias in decision making in such settings. Our results call into question recent conclusions about racial bias among bail judges, and, more broadly, yield four lessons for researchers considering the use of outcome tests of bias. First, the so-called generalized Roy model, which is a workhorse of applied economics, does not deliver a logically valid outcome test without further restrictions, since it does not require an unbiased decision maker to equalize marginal outcomes across groups. Second, the more restrictive “extended” Roy model, which isolates potential outcomes as the sole admissible source of analyst-unobserved variation driving decisions, delivers both a logically valid and econometrically viable outcome test. Third, this extended Roy model places strong restrictions on behavior and the data generating process, so detailed institutional knowledge is essential for justifying such restrictions. Finally, because the extended Roy model imposes restrictions beyond those required to identify marginal outcomes across groups, it has testable implications that may help assess its suitability across empirical settings.

A few days after our paper became public, the authors of the paper “Racial Bias in Bail Decisions,” The Quarterly Journal of Economics 133.4 (November 2018): 1885-1932, wrote a correction appendix to their paper and a note with comments on our paper. You can find both files on the authors’ websites or appended to the end of the reply we discuss below. We found these comments unclear and so we decided to write a reply in the note linked below to help the interested reader understand both sides of the argument:

Reply to “Comment on Canay, Mogstad, and Mountjoy (2020)” by Arnold, Dobbie, and Yang (ADY).

We divide the arguments into three points. First, we do not mischaracterize the definition of racial bias in the published version of ADY. If the authors wrote the published definition, but actually meant a substantially different definition (such as the one that now appears in the new “Correction Appendix,” also appended to this reply), then that is clearly the relevant mischaracterization. Second, focusing on clear-cut cases of (un)biased behavior is a feature of our argument, not a bug. The point is that even in the starkest, most unambiguous cases of unbiased and biased behavior, the outcome test can deliver the wrong conclusion. This logical invalidity of the outcome test also extends to intermediate cases where judges are biased against some defendants but not others. Third, to restore the logical validity of the outcome test, instead of invoking a decision model that justifies the test, ADY choose to redefine racial bias. Problematically, their substantial post-publication change in the definition of (un)biased judge behavior matters greatly for the interpretation and implications of their findings. The new definition is reverse-engineering, difficult to justify, and at odds not only with the work by Becker that ADY cite frequently, but also with more recent work by a subset of the authors of ADY.

New paper and Stata package for continuity in RDD

In the regression discontinuity design (RDD), it is common practice to assess the credibility of the design by testing the continuity of the density of the running variable at the cut-off, e.g., McCrary (2008). In joint work with Federico Bugni, we propose a new test for continuity of a density at a point based on the so-called g-order statistics, and study its properties under a novel asymptotic framework. The asymptotic framework is intended to approximate a small sample phenomenon: even though the total number n of observations may be large, the number of effective observations local to the cut-off is often small. Thus, while traditional asymptotics in RDD require a growing number of observations local to the cut-off as n grows, our framework allows for the number q of observations local to the cut-off to be fixed as n grows. The new test is easy to implement, asymptotically valid under weaker conditions than those used by competing methods, exhibits finite sample validity under stronger conditions than those needed for its asymptotic validity, and has favorable power properties against certain alternatives. You can find a copy of the paper here.

We have also finished the first version of a Stata package that implements the new test we propose. You can download the package from the Bitbucket repository (Rdcont), which includes the ado file with an example of how to use it. Visit the software page here for additional Stata and R packages.

Wild Bootstrap and Few Clusters

We just finished a paper, joint with Azeem Shaikh and Andres Santos, on the formal properties of the Wild Cluster Bootstrap when the data contains few, but large, clusters [See paper here].

Cameron et al. (2008) provide simulations that suggest the wild bootstrap test works well even in settings with as few as five clusters, but existing theoretical analyses of its properties all rely on an asymptotic framework in which the number of clusters is “large.”

In contrast to these analyses, we employ an asymptotic framework in which the number of clusters is “small,” but the number of observations per cluster is “large.” In this framework, we provide conditions under which the limiting rejection probability of an un-Studentized version of the test does not exceed the nominal level. Importantly, these conditions require, among other things, certain homogeneity restrictions on the distribution of covariates. The practical relevance of these conditions in finite samples is confirmed via a small simulation study. In addition, our results can help explain the remarkable behavior of these tests in the simulations of Cameron et al. (2008). It follows from our results that when these conditions are implausible and there are few clusters, researchers may wish to consider methods that do not impose such conditions, such as Ibragimov and Muller (2010) and Canay, Romano, and Shaikh (2017).

Paper for the 11th World Congress

We have finished the paper that was prepared for an invited talk at the 11th World Congress of the Econometric Society in Montreal, Canada. A PDF copy of the paper is available here and the slides of the talk are here. This paper surveys some of the recent literature on inference in partially identified models. After reviewing some basic concepts, including the definition of a partially identified model and the identified set, we turn our attention to the construction of confidence regions in partially identified settings. In our discussion, we emphasize the importance of requiring confidence regions to be uniformly consistent in level over relevant classes of distributions. Due to space limitations, our survey is mainly limited to the class of partially identified models in which the identified set is characterized by a finite number of moment inequalities or the closely related class of partially identified models in which the identified set is a function of a such a set. The latter class of models most commonly arise when interest focuses on a subvector of a vector- valued parameter, whose values are limited by a finite number of moment inequalities. We then rapidly review some important parts of the broader literature on inference in partially identified models and conclude by providing some thoughts on fruitful directions for future research.

Inference under Covariate-Adaptive Randomization

Screen Shot 2015-08-06 at 10.31.49 AM

The paper Inference under Covariate-Adaptive Randomization, joint with Federico Bugni and Azeem Shaikh, is now available. This paper studies inference for the average treatment effect in randomized controlled trials with covariate-adaptive randomization.  Here, by covariate-adaptive randomization, we mean randomization schemes that first stratify according to baseline covariates and then assign treatment status so as to achieve “balance” within each stratum.  Such schemes include, for example, Efron’s biased-coin design and stratified block randomization.  When testing the null hypothesis that the average treatment effect equals a pre-specified value in such settings, we first show that the usual two-sample t-test is conservative in the sense that it has limiting rejection probability under the null hypothesis no greater than and typically strictly less than the nominal level.  In a simulation study, we find that the rejection probability may in fact be dramatically less than the nominal level.  We show further that these same conclusions remain true for a na\”ive permutation test, but that a modified version of the permutation test yields a test that is non-conservative in the sense that its limiting rejection probability under the null hypothesis equals the nominal level.  The modified version of the permutation test has the additional advantage that it has rejection probability exactly equal to the nominal level for some distributions satisfying the null hypothesis.  Finally, we show that the usual t-test (on the coefficient on treatment assignment) in a linear regression of outcomes on treatment assignment and indicators for each of the strata yields a non-conservative test as well.  In a simulation study, we find that the non-conservative tests have substantially greater power than the usual two-sample t-test.

Approximate Permutation tests for RDD

We just finished the first draft of the paper Approximate Permutation Tests and Induced Order Statistics in the Regression Discontinuity Design, which is joint work with my student Vishal Kamat. This paper proposes an asymptotically valid permutation test for a testable implication of the identification assumption in the regression discontinuity design (RDD). Here, by testable implication, we mean the requirement that the distribution of observed baseline covariates should not change discontinuously at the threshold of the so-called running variable. This contrasts to the common practice of testing the weaker implication of continuity of the means of the covariates at the threshold. When testing our null hypothesis using observations that are “close” to the threshold, the standard requirement for the finite sample validity of a permutation does not necessarily hold. We therefore propose an asymptotic framework where there is a fixed number of closest observations to the threshold with the sample size going to infinity, and propose a permutation test based on the so-called induced order statistics that controls the limiting rejection probability under the null hypothesis. In a simulation study, we find that the new test controls size remarkably well in most designs. Finally, we use our test to evaluate the validity of the design in Lee (2008), a well-known application of the RDD to study incumbency advantage.