Neyman-Rubin Potential Outcomes Framework

In the potential outcomes framework we assume that there is a population of units, typically denoted by i, and that each unit can potentially experience two outcomes:

  • If the unit is assigned to treatment, then it would have an outcome Yt(i)
  • If the unit is assigned to control, then it would have an outcome Yc(i)

It is common to write either Yt(i)/Yc(i) or Y1(i)/Y0(i). In practice, we can only every observe one outcome for each unit, it is impossible to observe both, this is the fundamental problem of causal inference.

Unit-level Treatment Effect

For any unit i in the population, its treatment effect is the difference between the outcome under treatment and the outcome under control, TE(i)=yt(i)yc(i). For TE(i) to be a proper treatment effect we require some assumptions:

  1. There must be a non-zero probability that unit i is assigned to either the treatment or the control. If there is no possibility, then we are not in the Potential Outcomes Framework and it makes no sense to talk about Treatment Effect within the Rubin causal model.
  2. All other variables, except treatment assignment, are held constant or they are irrelevant.

The second assumption is very important: if there are variables that are not held constant between the two “hypothetical” scenarios and these variables are not irrelevant, then these variables are confounders. This second assumption is often known as no causation without manipulation.

Average Treatment Effect (ATE)

If we could observe both alternative futures for each unit, then we would be able to compute the Average Treatment Effect ATE=1Npopulationi=1Npopulationyt(i)yc(i), where Npopulation is the total number of units in the population. ATE is simply the average of all unit-level treatment effects in the population. A typical assumption of the Rubin Causal Model is known as the SUTVA assumption: y(i) is unaffected by the mechanism used to assign the treatment, and it is unaffected also by the treatment assignment of other individuals.

Of course, this is impossible to compute, however it can be estimated under certain conditions, which we will see later.

Conditional Average Treatment Effect (CATE)

If we divide the population into subgroups (e.g. by gender, age, etc) and compute ATE for these groups, then we we obtained the Conditional ATE CATE(subgroup)=1Nsubgroupi=1Nsubgroupyt(i)yc(i), where Nsubgroup is the number of units in the subgroup. As mentioned earlier, the Rubin model uses the SUTVA assumption. One way to check if the assumption holds is to divide the population into subgroups and compute CATE, if CATE is different for different subgroups, then SUTVA is violated and we say that there are heterogeneous treatment effects.

Factuals and Counterfactuals

In practice, we only observe one outcome for each unit. Suppose that unit i gets assigned to the treatment and therefore we only observe Yt(i). Then the observed outcome yt(i) is the actual outcome, whereas the outcome that we have not (and will never) observe is the counterfactual outcome.

Estimating ATE

  1. In a Randomized Controlled Trial (RCT) units are randomly assigned to treatment or control, and in this case estimating ATE is possible by simply taking the difference of the means for treatment and control units ATE^=(1Nti=1Ntyt(i))(1Nci=1Ncyc(i))
  2. With Observational Data, if one can show (through various techniques) that the data is as-if coming from a RCT, then the same technique can be used.

Otherwise, it is harder to estimate ATE and there are many techniques to do so.

Propensity Score Estimation

  1. Propensity Estimation: Use logistic regression to estimate the probability of a unit belonging to treatment or control.
  2. Matching: Match each unit in the treatment group to one (or more) unit(s) in the control group. There are multiple ways of doing this, one is based on Nearest Neighbors.
  3. Stratify: Stratify the propensity scores into groups (e.g. [0.0,0.33),[0.33,0.66),[0.66,1.0)) and check that covariates ara balanced within these strata.
  4. Estimate ATE: Estimate ATE using a weighted mean within each strata.

R-Learning

Suppose that we have available n units, each having an associated p-dimensional vector of covariates xi which we store in a n×p matrix X. CATE, also known as treatment effect heterogeneity can be defined as τ(x):=E[Y1Y0X=x] where Y1 is the potential outcome subject to the treatment. One tries to learn this function τ(x) with a variety of ways. A popular approach is to use meta-learners such as S, T and R-learners. They are called meta-learners because they simply perform regression to learn τ and are not model-dependent. R-learning tries to learn τ directly compared to the other approaches.

Treatment propensities and conditional response surfaces are defined respectively as e(x)=P(W=1X=x)μ(w)(x)=E[Y(w)X=x] where we assume we have access to IID and unconfoundable observations (Xi,Wi,Yi) where Xi are covariates, Wi is treatment assignment and Yi is the outcome. The propensities are just the distribution of the treatment assignment conditional on the covariates. The two conditional response surfaces μ(1)(x) and μ(0)(x) are the conditional expectation of the potential outcome corresponding to the treatment or control respectively, given X=x and so we can write the CATE function as τ(x)=μ(1)(x)μ(0)(x). We also define observation noises ϵi(w)=Yi(w)(μ(0)(Xi)+wτ(Xi)). Notice that w0,1 so when w=0 (control) we have ϵi(0)=Yi(0)μ(0)(Xi), i.e. the noise is the difference between the potential outcome for Xi corresponding to the control scenario and the expected potential outcome for the control, given Xi. When w=1 (treatment) then we get ϵi(1)=Yi(1)μ(1)(Xi) which is the same but for the treatment scenario. We also define the conditional mean outcome m(x)=E[YX=x] and we notice that by definition this can be written as follows m(x)=E[YX=x]=w=0,1E[Y(w)X=x]P(W=wX=x)=w=0,1μ(w)(x)e(x)=μ(1)(x)e(x)+μ(0)(x)(1e(x))=μ(0)(x)+e(x)τ(x). We then define the residuals as the difference between the observed outcome and the conditional mean outcome ri=Yim(Xi) By using the definition of the observational noise and the derivation above we can write ri(Wi)=ϵi+(μ(0)(Xi)+Wiτ(Xi))(μ(0)(Xi)+e(Xi)τ(Xi))=(Wie(Xi))τ(Xi)+ϵi(Wi) The key to R-learning is that the CATE function satisfying the expression above can equivalently be written as the solution of this optimization problem τ=argminτE[((Yim(Xi))(Wie(Xi))τ(Xi))2] That is, the expected squared distance between Yim(Xi) (the difference between the observed outcome and the conditional mean outcome) and (Wie(Xi))τ(Xi). The idea is that if one had access to the propensities and the conditional mean outcome then one could estimate the above by empirical loss minimization by adding a regularizer Λn(τ).

R-learning then dives the data into strata/folds and for each fold estimates m^ and e^ and then plugs them into the empirical loss minimization version of the optimization problem above.

Suppose we have indeed estimated m^ and e^ and for now suppose that we are working on a single fold of the data. The R-learning objective becomes minτ1ni=1n(riτ(xi)αi)2+Λ(τ), where

  • ri=yim^i is the difference between the observed outcome and the conditional mean outcome for xi
  • αi=wie^i is the difference between the treatment assignment and the estimated propensity for xi. Typically the matrix Xi will have already undergone some feature learning/transformation so one can assume a linear treatment effect τ(xi)=Bxi.
Previous
Next