Fundamental Challenges in Causality
May 10, 2023
DML is a general framework for causal inference and estimation of causal parameters based on machine learning
Summarized in Chernozhukov et al. (2018)
Combines the strengths of machine learning and econometrics
Obtain a DML estimate of the causal parameter with asymptotically valid confidence intervals
DML estimator has good theoretical statistical properties like √N rate of convergence, unbiasedness, approximate normality
DML can be generalized to other causal models and settings (multiple treatments, heterogeneous treatment effects, … )
Partially linear regression model (PLR)
Y=Dθ0+g0(X)+ζ,E[ζ|D,X]=0,D=m0(X)+V,E[V|X]=0,
with
Inference is based on a moment condition that satisfies the Neyman orthogonality condition ψ(W;θ,η) E[ψ(W;θ0,η0)]=0,
where W:=(Y,D,X,Z) and with θ0 being the unique solution that obeys the Neyman orthogonality condition ∂ηE[ψ(W;θ0,η]|η=η0=0.
∂η denotes the pathwise (Gateaux) derivative operator
Neyman orthogonality ensures that the moment condition identifying θ0 is insensitive to small pertubations of the nuisance function η around η0
Using a Neyman-orthogonal score eliminates the first order biases arising from the replacement of η0 with a ML estimator ˆη0
PLR example: Partialling-out score function (cf. Section 5.3 , Appendix) ψ(⋅)=(Y−E[Y|X]−θ(D−E[D|X]))(D−E[D|X])
The nuisance parameters are estimated with high-quality (fast-enough converging) machine learning methods.
Different structural assumptions on η0 lead to the use of different machine-learning tools for estimating η0 Chernozhukov et al. (2018) (Section 3)
Rate requirements depend on the causal model and orthogonal score, e.g. (see Chernozhukov et al. (2018)),
To avoid the biases arising from overfitting, a form of sample splitting is used at the stage of producing the estimator of the main parameter θ0.
Efficiency gains by using cross-fittng (swapping roles of samples for train / hold-out)
There exist regularity conditions, such that the DML estimator ˜θ0 concentrates in a 1/√N-neighborhood of θ0 and the sampling error is approximately √N(˜θ0−θ0)∼N(0,σ2), with σ2:=J−20E(ψ2(W;θ0,η0)),J0=E(ψa(W;η0)).
Which learner should be used for estimation of the nuisance parameter η0?
How should the hyperparameters of the learners be tuned? Which sample (splitting) to use for tuning?
Can AutoML frameworks be used? How well do they perform in practice?
Which causal model should be used?
How do more/less accurate estimators for η0 affect the causal estimate, ˆθ?
Relationship of nuisance fit and estimation quality
Simulated example
Lasso learner for η0=(ℓ0,m0) evaluated over a grid of λ=(λℓ,λm) values (l1-penalty)
scikit-learn
’s cross-validated Lasso (Pedregosa et al. 2011)XGBoost
tuned (Chen et al. 2015)XGBoost
defaultflaml
(Wang et al. 2021)flaml, DGP 2
flaml, all DGPs
Careful choice of learners and hyperparameter tuning is important for estimation quality for causal parameter θ0
Full-sample and on-fold tuning seem to outperform the split-data approach in terms of
On-fold tuning computationally more expensive
Lower (combined) first stage error found to be associated with better performance
Choice of causal model / score
Scale and summarize results, more benchmarks
Evaluation of additional strategies:
Refined hyperparameter tuning
The inference is based on a score function ψ(W;θ,η) that satisfies E[ψ(W;θ0,η0)]=0,
where W:=(Y,D,X,Z) and with θ0 being the unique solution that obeys the Neyman orthogonality condition ∂ηE[ψ(W;θ0,η]|η=η0=0.
∂η denotes the pathwise (Gateaux) derivative operator
Neyman orthogonality ensures that the moment condition identifying θ0 is insensitive to small pertubations of the nuisance function η around η0
Using a Neyman-orthogonal score eliminates the first order biases arising from the replacement of η0 with a ML estimator ˆη0
PLR example: Partialling-out score function (cf. Section 5.3 , Appendix) ψ(⋅)=(Y−E[Y|X]−θ(D−E[D|X]))(D−E[D|X])
Solution to regularization bias: Orthogonalization
Remember the Frisch-Waugh-Lovell (FWL) Theorem in a linear regression model
Y=Dθ0+X′β+ε
θ0 can be consistently estimated by partialling out X, i.e,
OLS regression of Y on X: ˜β=(X′X)−1X′Y → Residuals ˆε
OLS regression of D on X: ˜γ=(X′X)−1X′D → Residuals ˆζ
Final OLS regression of ˆε on ˆζ
Orthogonalization: The idea of the FWL Theorem can be generalized to using ML estimators instead of OLS
ψ(W,θ0,η)=(Y−Dθ0−g0(X))D
Regression adjustment score
η=g(X),η0=g0(X),
ψ(W,θ0,η0)=((Y−E[Y|X])−(D−E[D|X])θ0)(D−E[D|X])
Neyman-orthogonal score (Frisch-Waugh-Lovell)
η=(g(X),m(X)),η0=(g0(X),m0(X))=(E[Y∣X],E[D∣X])
rel. MSE(ˆθ)
Abs. bias
TODO
For a nontechnical introduction to DML: Bach et al. (2021)
Software implementation:
Paper draft to be uploaded at arxiv soon
::::
Practical Aspects of DML