Hippocampus's Garden

Under the sea, in the hippocampus's garden...

From Direct Method to Doubly Robust

July 29, 2020  |  4 min read  |  71 views

  • このエントリーをはてなブックマークに追加

Counterfactual machine learning is drawing more and more attention these days. This research field is now often associated with applications to advertisements and reinforcement learning, but most of its fundamentals were developed in the field of epidemiology as causal inference

In this post, I briefly review three popular techniques developed in causal inference and formulate them with the simple notations used in bandit problems. These techniques are useful for adjusting confounding factors, off-policy evaluation (OPE), completing missing-not-at-randoms (MNARs), etc.

Notations

Following the customs in contextual bandit, I use these notations and terms in the following part.

Notation Meaning
μ\mu Behavior (logging) policy
π\pi Evaluation policy
xix_i Context
aia_i Selected arm
rir_i Reward
nn The number of records in the log
{(xi,ai,ri)}i=1n\{(x_i,a_i,r_i)\}_{i=1}^n Log (data) from behavior policy
A\mathcal{A} Set of possible arms
V^(π)\hat{V}(\pi) Estimated value of policy π\pi

In recommender systems, you can interpret the notation as follows: xix_i is a user feature, aia_i is a recommended item, rir_i is whether the user buy it or not, and V^(π)\hat{V}(\pi) is the estimated conversion rate (CVR).

Direct Method

Direct method (DM) simply predicts the counterfactual outcomes. One can train a model with the log from the behavior policy μ\mu to predict the reward given context and possible arms, and use the predicted reward r^(xi,a)\hat{r}(x_i,a) to estimate the value of the evaluation policy π\pi.

V^DM(π)=1ni=1naAr^(xi,a)π(axi)\hat{V}_{\mathrm{DM}}(\pi) = \frac{1}{n}\sum_{i=1}^{n} \sum_{a\in\mathcal{A}}\hat{r}(x_i,a)\pi(a|x_i)

DM has low variance but its bias is large when the model is mis-specified.

Importance Sampling

Importance sampling (IS) sums the observed rewards {ri}i=1n\{r_i\}_{i=1}^n, but with weights of probability ratio. That is, it emphasizes the importance of the rewards that happen often in the evaluation policy π\pi and rarely in the behavior policy μ\mu.

V^IS(π)=1ni=1nriπ(aixi)μ(aixi)\hat{V}_{\mathrm{IS}}(\pi) = \frac{1}{n}\sum_{i=1}^{n} r_i\frac{\pi(a_i|x_i)}{\mu(a_i|x_i)}

IS is provably unbiased when the behavior policy is known, but it has high variance when the two policies differ a lot, where the weight term is unstable. When the behavior policy is unknown (e.g., observational data), you need to estimate it, and IS is no more unbiased.

IS is also referred to as inverse propensity scoring (IPS).

Doubly Robust

Doubly robust (DR) combines the above two methods. At first glance, it has a scary-looking definition, but when you compare line 1 and line 2, you will see DR is a combination of DM and residual IS.

V^DR(π)=1ni=1n[(rir^(xi,ai))π(aixi)μ(aixi)+aAr^(xi,a)π(axi)]=V^IS(π)1ni=1nr^(xi,ai)π(aixi)μ(aixi)+V^DM(π)\begin{aligned} \hat{V}_{\mathrm{DR}}(\pi) &=& \frac{1}{n}\sum_{i=1}^{n} \Biggl[ (r_i-\hat{r}(x_i,a_i))\frac{\pi(a_i|x_i)}{\mu(a_i|x_i)} + \sum_{a\in\mathcal{A}}\hat{r}(x_i,a)\pi(a|x_i) \Biggr]\\ &=& \hat{V}_{\mathrm{IS}}(\pi) - \frac{1}{n}\sum_{i=1}^{n}\hat{r}(x_i,a_i)\frac{\pi(a_i|x_i)}{\mu(a_i|x_i)} + \hat{V}_{\mathrm{DM}}(\pi) \end{aligned}

DR is unbiased when the DM model is well-specified or the behavior policy is known. In this sense, this method is “doubly robust”. Its variance is lower than IS.

There are some advances like [2].

References

[1] Maria Dimakopoulou. Slate Bandit Learning & Evaluation. 2020.
[2] Mehrdad Farajtabar, Yinlam Chow, Mohammad Ghavamzadeh. More Robust Doubly Robust Off-policy Evaluation. ICML. 2018.


  • このエントリーをはてなブックマークに追加

Written by Shion Honda. If you like this, please share!