Conditional Stochastic Dominance

I’m excited to share a new working paper that I’ve recently completed, which is joint work with Federico Bugni and Deborah Kim. The paper delves into an important aspect of stochastic dominance: testing conditional stochastic dominance (CSD) at specific values of a conditioning covariate. This concept is crucial in many applied fields, including evaluating treatment effects in social programs, investigating economic disparities, and exploring potential discrimination in decision-making processes.

What’s New?

In this paper, we focus on the problem of testing whether the conditional cumulative distribution function (CDF) of one variable stochastically dominates another at specific values of a conditioning variable. Formally, we test the null hypothesis:


H_0: F_Y(t | z) ≤ F_X(t | z) for all (t, z) ∈ R × Z

against the alternative hypothesis:


H_1: F_Y(t | z) > F_X(t | z) for some (t, z) ∈ R × Z

The target points (denoted by 𝒵) are a finite set of values, not the entire support of the conditioning variable Z, and the paper focuses on the case where 𝒵 consists of a small number of specific points.

Key Contributions

  • A Novel Test for CSD: The primary contribution of this paper is the introduction of a novel test statistic based on induced order statistics. The test uses empirical CDFs, leveraging observations closest to the target points. Unlike traditional tests, our method does not require kernel smoothing or parametric assumptions on the conditional distributions, ensuring computational simplicity.
  • Asymptotic Properties: we establish the asymptotic validity of the proposed test, showing that it controls size in large samples under mild regularity conditions. This result is significant as many existing methods assume continuous conditional distributions, but our approach is more flexible and accommodates finitely many discontinuities in the distributions.
  • Connection to Permutation-Based Inference: we show that, when the random variables Y and X are both continuous, the critical value for our test coincides with that of a permutation-based test, thereby establishing a formal connection between our method and the broader literature on rank-based inference.
  • Refinement for Discrete Data: For cases where Y or X is discrete, we introduce a refined critical value that enhances power, albeit with increased computational complexity.

What’s Next?

The remainder of the paper explores the theory, provides extensions, and discusses practical implementation through Monte Carlo simulations. The methods developed here have implications for fields ranging from economics to political science and public policy, offering a robust and computationally efficient approach to testing stochastic dominance. If this sounds interesting, please read the paper.

Treatment effects & delayed outcomes

We have a new paper with my colleague Federico Bugni and Steve McBride at ROBLOX. The paper is called “Decomposition and Interpretation of Treatment Effects in Settings with Delayed Outcomes”  and studies settings where the analyst is interested in identifying and estimating the average causal effect of a binary treatment on an outcome. We consider a setup in which the outcome realization does not get immediately realized after the treatment assignment, a feature that is ubiquitous in empirical settings. The period between the treatment and the realization of the outcome allows other observed actions to occur and affect the outcome, as featured below in Figure 1. In this context, we study several regression-based estimands routinely used in empirical work to capture the average treatment effect and shed light on interpreting them in terms of ceteris paribus effects, indirect causal effects, and selection terms. We obtain three main and related takeaways. First, the three most popular estimands do not generally satisfy what we call strong sign preservation, in the sense that these estimands may be negative even when the treatment positively affects the outcome conditional on any possible combination of other actions. Second, the most popular regression that includes the other actions as controls satisfies strong sign preservation if and only if these actions are mutually exclusive binary variables. Finally, we show that a linear regression that fully stratifies the other actions leads to estimands that satisfy strong sign preservation.

Non-Ignorable Clusters and CAR

In collaboration with Federico Bugni, Azeem Shaikh, and Max Tabord-Meehan, we have released a new paper that considers the problem of inference in cluster randomized experiments when cluster sizes are non-ignorable. Here, by a cluster randomized experiment, we mean one in which treatment is assigned at the level of the cluster; by non-ignorable cluster sizes we mean that “large” clusters and “small” clusters may be heterogeneous, and, in particular, the effects of the treatment may vary across clusters of differing sizes.
In order to permit this sort of flexibility, we consider a sampling framework in which cluster sizes themselves are random. In this way, our analysis departs from earlier analyses of cluster randomized experiments in which cluster sizes are treated as non-random. We distinguish between two different parameters of interest: the equally-weighted cluster-level average treatment effect, and the size- weighted cluster-level average treatment effect. For each parameter, we provide methods for inference in an asymptotic framework where the number of clusters tends to infinity and treatment is assigned using a covariate-adaptive stratified randomization procedure. We additionally permit the experimenter to sample only a subset of the units within each cluster rather than the entire cluster and demonstrate the implications of such sampling for some commonly used estimators. A small simulation study and empirical demonstration show the practical relevance of our theoretical results. You can download the paper here.

Guide to moment inequalities

Models defined by moment inequalities have become a standard modeling framework for empirical economists, spreading over a wide range of fields within economics. From the point of view of an empirical researcher, the literature on inference in moment inequality models is large and complex, including multiple survey papers that document the non-standard features these models possess, the main novel concepts behind inference in these models, and the most recent developments that bring advances in accuracy and computational tractability. In a recent paper with my colleague Gaston Illanes and my student Amilcar Velez we present a guide to empirical practice intended to help applied researchers navigate all the decisions required to frame a model as a moment inequality model and then to construct confidence intervals for the parameters of interest. We divide our template into four main steps: (a) a behavioral decision model, (b) moving from the decision model to a moment inequality model, (c) choosing a test statistic and critical value, and (d) accounting for computational challenges. We split each of these steps into a discussion of the “how” and the “why”, and then illustrate how to take these steps to practice in an empirical application that studies identification of expected sunk costs of offering a product in a market. A Github repository with all necessary codes to implement our recommendations in R, Matlab, and Python will be available soon.

On Outcome Tests for Detecting Bias

The paper On the Use of Outcome Tests for Detecting Bias in Decision Making, joint with Magne Mogstad and Jack Mountjoy is now available. This paper starts with the observation that the decisions of judges, lenders, journal editors, and other gatekeepers often lead to disparities in outcomes across affected groups. An important question is whether, and to what extent, these group-level disparities are driven by relevant differences in underlying individual characteristics, or by biased decision makers. Becker (1957) proposed an outcome test for bias leading to a large body of related empirical work, with recent innovations in settings where decision makers are exogenously assigned to cases and vary progressively in their decision tendencies. We carefully examine what can be learned about bias in decision making in such settings. Our results call into question recent conclusions about racial bias among bail judges, and, more broadly, yield four lessons for researchers considering the use of outcome tests of bias. First, the so-called generalized Roy model, which is a workhorse of applied economics, does not deliver a logically valid outcome test without further restrictions, since it does not require an unbiased decision maker to equalize marginal outcomes across groups. Second, the more restrictive “extended” Roy model, which isolates potential outcomes as the sole admissible source of analyst-unobserved variation driving decisions, delivers both a logically valid and econometrically viable outcome test. Third, this extended Roy model places strong restrictions on behavior and the data generating process, so detailed institutional knowledge is essential for justifying such restrictions. Finally, because the extended Roy model imposes restrictions beyond those required to identify marginal outcomes across groups, it has testable implications that may help assess its suitability across empirical settings.

A few days after our paper became public, the authors of the paper “Racial Bias in Bail Decisions,” The Quarterly Journal of Economics 133.4 (November 2018): 1885-1932, wrote a correction appendix to their paper and a note with comments on our paper. You can find both files on the authors’ websites or appended to the end of the reply we discuss below. We found these comments unclear and so we decided to write a reply in the note linked below to help the interested reader understand both sides of the argument:

Reply to “Comment on Canay, Mogstad, and Mountjoy (2020)” by Arnold, Dobbie, and Yang (ADY).

We divide the arguments into three points. First, we do not mischaracterize the definition of racial bias in the published version of ADY. If the authors wrote the published definition, but actually meant a substantially different definition (such as the one that now appears in the new “Correction Appendix,” also appended to this reply), then that is clearly the relevant mischaracterization. Second, focusing on clear-cut cases of (un)biased behavior is a feature of our argument, not a bug. The point is that even in the starkest, most unambiguous cases of unbiased and biased behavior, the outcome test can deliver the wrong conclusion. This logical invalidity of the outcome test also extends to intermediate cases where judges are biased against some defendants but not others. Third, to restore the logical validity of the outcome test, instead of invoking a decision model that justifies the test, ADY choose to redefine racial bias. Problematically, their substantial post-publication change in the definition of (un)biased judge behavior matters greatly for the interpretation and implications of their findings. The new definition is reverse-engineering, difficult to justify, and at odds not only with the work by Becker that ADY cite frequently, but also with more recent work by a subset of the authors of ADY.

New paper and Stata package for continuity in RDD

In the regression discontinuity design (RDD), it is common practice to assess the credibility of the design by testing the continuity of the density of the running variable at the cut-off, e.g., McCrary (2008). In joint work with Federico Bugni, we propose a new test for continuity of a density at a point based on the so-called g-order statistics, and study its properties under a novel asymptotic framework. The asymptotic framework is intended to approximate a small sample phenomenon: even though the total number n of observations may be large, the number of effective observations local to the cut-off is often small. Thus, while traditional asymptotics in RDD require a growing number of observations local to the cut-off as n grows, our framework allows for the number q of observations local to the cut-off to be fixed as n grows. The new test is easy to implement, asymptotically valid under weaker conditions than those used by competing methods, exhibits finite sample validity under stronger conditions than those needed for its asymptotic validity, and has favorable power properties against certain alternatives. You can find a copy of the paper here.

We have also finished the first version of a Stata package that implements the new test we propose. You can download the package from the Bitbucket repository (Rdcont), which includes the ado file with an example of how to use it. Visit the software page here for additional Stata and R packages.

Wild Bootstrap and Few Clusters

We just finished a paper, joint with Azeem Shaikh and Andres Santos, on the formal properties of the Wild Cluster Bootstrap when the data contains few, but large, clusters [See paper here].

Cameron et al. (2008) provide simulations that suggest the wild bootstrap test works well even in settings with as few as five clusters, but existing theoretical analyses of its properties all rely on an asymptotic framework in which the number of clusters is “large.”

In contrast to these analyses, we employ an asymptotic framework in which the number of clusters is “small,” but the number of observations per cluster is “large.” In this framework, we provide conditions under which the limiting rejection probability of an un-Studentized version of the test does not exceed the nominal level. Importantly, these conditions require, among other things, certain homogeneity restrictions on the distribution of covariates. The practical relevance of these conditions in finite samples is confirmed via a small simulation study. In addition, our results can help explain the remarkable behavior of these tests in the simulations of Cameron et al. (2008). It follows from our results that when these conditions are implausible and there are few clusters, researchers may wish to consider methods that do not impose such conditions, such as Ibragimov and Muller (2010) and Canay, Romano, and Shaikh (2017).

Paper for the 11th World Congress

We have finished the paper that was prepared for an invited talk at the 11th World Congress of the Econometric Society in Montreal, Canada. A PDF copy of the paper is available here and the slides of the talk are here. This paper surveys some of the recent literature on inference in partially identified models. After reviewing some basic concepts, including the definition of a partially identified model and the identified set, we turn our attention to the construction of confidence regions in partially identified settings. In our discussion, we emphasize the importance of requiring confidence regions to be uniformly consistent in level over relevant classes of distributions. Due to space limitations, our survey is mainly limited to the class of partially identified models in which the identified set is characterized by a finite number of moment inequalities or the closely related class of partially identified models in which the identified set is a function of a such a set. The latter class of models most commonly arise when interest focuses on a subvector of a vector- valued parameter, whose values are limited by a finite number of moment inequalities. We then rapidly review some important parts of the broader literature on inference in partially identified models and conclude by providing some thoughts on fruitful directions for future research.

Inference under Covariate-Adaptive Randomization

Screen Shot 2015-08-06 at 10.31.49 AM

The paper Inference under Covariate-Adaptive Randomization, joint with Federico Bugni and Azeem Shaikh, is now available. This paper studies inference for the average treatment effect in randomized controlled trials with covariate-adaptive randomization.  Here, by covariate-adaptive randomization, we mean randomization schemes that first stratify according to baseline covariates and then assign treatment status so as to achieve “balance” within each stratum.  Such schemes include, for example, Efron’s biased-coin design and stratified block randomization.  When testing the null hypothesis that the average treatment effect equals a pre-specified value in such settings, we first show that the usual two-sample t-test is conservative in the sense that it has limiting rejection probability under the null hypothesis no greater than and typically strictly less than the nominal level.  In a simulation study, we find that the rejection probability may in fact be dramatically less than the nominal level.  We show further that these same conclusions remain true for a na\”ive permutation test, but that a modified version of the permutation test yields a test that is non-conservative in the sense that its limiting rejection probability under the null hypothesis equals the nominal level.  The modified version of the permutation test has the additional advantage that it has rejection probability exactly equal to the nominal level for some distributions satisfying the null hypothesis.  Finally, we show that the usual t-test (on the coefficient on treatment assignment) in a linear regression of outcomes on treatment assignment and indicators for each of the strata yields a non-conservative test as well.  In a simulation study, we find that the non-conservative tests have substantially greater power than the usual two-sample t-test.