Fundamental Challenges in Causality
May 10, 2023
DML is a general framework for causal inference and estimation of causal parameters based on machine learning
Summarized in Chernozhukov et al. (2018)
Combines the strengths of machine learning and econometrics
Obtain a DML estimate of the causal parameter with asymptotically valid confidence intervals
DML estimator has good theoretical statistical properties like \(\sqrt{N}\) rate of convergence, unbiasedness, approximate normality
DML can be generalized to other causal models and settings (multiple treatments, heterogeneous treatment effects, \(\ldots\) )
Partially linear regression model (PLR)
\[\begin{align*} &Y = D \theta_0 + g_0(X) + \zeta, & &\mathbb{E}[\zeta | D,X] = 0, \\ &D = m_0(X) + V, & &\mathbb{E}[V | X] = 0, \end{align*}\]
with
Inference is based on a moment condition that satisfies the Neyman orthogonality condition \(\psi(W; \theta, \eta)\) \[E[\psi(W; \theta_0, \eta_0)] = 0,\]
where \(W:=(Y,D,X,Z)\) and with \(\theta_0\) being the unique solution that obeys the Neyman orthogonality condition \[\left.\partial_\eta \mathbb{E}[\psi(W; \theta_0, \eta] \right|_{\eta=\eta_0} = 0.\]
\(\partial_{\eta}\) denotes the pathwise (Gateaux) derivative operator
Neyman orthogonality ensures that the moment condition identifying \(\theta_0\) is insensitive to small pertubations of the nuisance function \(\eta\) around \(\eta_0\)
Using a Neyman-orthogonal score eliminates the first order biases arising from the replacement of \(\eta_0\) with a ML estimator \(\hat{\eta}_0\)
PLR example: Partialling-out score function (cf. Section 5.3 , Appendix) \[\psi(\cdot)= (Y-E[Y|X]-\theta (D - E[D|X]))(D-E[D|X])\]
The nuisance parameters are estimated with high-quality (fast-enough converging) machine learning methods.
Different structural assumptions on \(\eta_0\) lead to the use of different machine-learning tools for estimating \(\eta_0\) Chernozhukov et al. (2018) (Section 3)
Rate requirements depend on the causal model and orthogonal score, e.g. (see Chernozhukov et al. (2018)),
To avoid the biases arising from overfitting, a form of sample splitting is used at the stage of producing the estimator of the main parameter \(\theta_0\).
Efficiency gains by using cross-fittng (swapping roles of samples for train / hold-out)
There exist regularity conditions, such that the DML estimator \(\tilde{\theta}_0\) concentrates in a \(1/\sqrt{N}\)-neighborhood of \(\theta_0\) and the sampling error is approximately \[\sqrt{N}(\tilde{\theta}_0 - \theta_0) \sim N(0, \sigma^2),\] with \[\begin{align}\begin{aligned}\sigma^2 := J_0^{-2} \mathbb{E}(\psi^2(W; \theta_0, \eta_0)),\\J_0 = \mathbb{E}(\psi_a(W; \eta_0)).\end{aligned}\end{align}\]
Which learner should be used for estimation of the nuisance parameter \(\eta_0\)?
How should the hyperparameters of the learners be tuned? Which sample (splitting) to use for tuning?
Can AutoML frameworks be used? How well do they perform in practice?
Which causal model should be used?
How do more/less accurate estimators for \(\eta_0\) affect the causal estimate, \(\hat{\theta}\)?
Relationship of nuisance fit and estimation quality
Simulated example
Lasso learner for \(\eta_0 = (\ell_0, m_0)\) evaluated over a grid of \(\lambda = (\lambda_{\ell}, \lambda_{m})\) values (l1-penalty)
scikit-learn
’s cross-validated Lasso (Pedregosa et al. 2011)XGBoost
tuned (Chen et al. 2015)XGBoost
defaultflaml
(Wang et al. 2021)Careful choice of learners and hyperparameter tuning is important for estimation quality for causal parameter \(\theta_0\)
Full-sample and on-fold tuning seem to outperform the split-data approach in terms of
On-fold tuning computationally more expensive
Lower (combined) first stage error found to be associated with better performance
Choice of causal model / score
Scale and summarize results, more benchmarks
Evaluation of additional strategies:
Refined hyperparameter tuning
The inference is based on a score function \(\psi(W; \theta, \eta)\) that satisfies \[E[\psi(W; \theta_0, \eta_0)] = 0,\]
where \(W:=(Y,D,X,Z)\) and with \(\theta_0\) being the unique solution that obeys the Neyman orthogonality condition \[\left.\partial_\eta \mathbb{E}[\psi(W; \theta_0, \eta] \right|_{\eta=\eta_0} = 0.\]
\(\partial_{\eta}\) denotes the pathwise (Gateaux) derivative operator
Neyman orthogonality ensures that the moment condition identifying \(\theta_0\) is insensitive to small pertubations of the nuisance function \(\eta\) around \(\eta_0\)
Using a Neyman-orthogonal score eliminates the first order biases arising from the replacement of \(\eta_0\) with a ML estimator \(\hat{\eta}_0\)
PLR example: Partialling-out score function (cf. Section 5.3 , Appendix) \[\psi(\cdot)= (Y-E[Y|X]-\theta (D - E[D|X]))(D-E[D|X])\]
Solution to regularization bias: Orthogonalization
Remember the Frisch-Waugh-Lovell (FWL) Theorem in a linear regression model
\[Y = D \theta_0 + X'\beta + \varepsilon\]
\(\theta_0\) can be consistently estimated by partialling out \(X\), i.e,
OLS regression of \(Y\) on \(X\): \(\tilde{\beta} = (X'X)^{-1} X'Y\) \(\rightarrow\) Residuals \(\hat{\varepsilon}\)
OLS regression of \(D\) on \(X\): \(\tilde{\gamma} = (X'X)^{-1} X'D\) \(\rightarrow\) Residuals \(\hat{\zeta}\)
Final OLS regression of \(\hat{\varepsilon}\) on \(\hat{\zeta}\)
Orthogonalization: The idea of the FWL Theorem can be generalized to using ML estimators instead of OLS
\[\begin{align} \psi (W, \theta_0, \eta) = & (Y - D\theta_0 - g_0(X))D \end{align}\]
Regression adjustment score
\[\begin{align} \eta &= g(X), \\ \eta_0 &= g_0(X), \end{align}\]
\[\begin{align} \psi (W, \theta_0, \eta_0) = & ((Y- E[Y|X])-(D-E[D|X])\theta_0)\\ & (D-E[D|X]) \end{align}\]
Neyman-orthogonal score (Frisch-Waugh-Lovell)
\[\begin{align} \eta &= (g(X), m(X)), \\ \eta_0 &= ( g_0(X), m_0(X)) = ( \mathbb{E} [Y \mid X], \mathbb{E}[D \mid X]) \end{align}\]
TODO
For a nontechnical introduction to DML: Bach et al. (2021)
Software implementation:
Paper draft to be uploaded at arxiv soon
::::
Practical Aspects of DML