What is Riesz Regression?

The setup

In semiparametric inference you are often interested in an estimand (a.k.a. target parameter) of the form \[ \psi \;=\; \mathbb{E}\!\left[\,m(\mu)(Z,Y)\,\right] \] where \(Z = (A, X)\) are the observed predictors (\(A\) is treatment or any other intervention; \(X\) are covariates), \(Y\) is the outcome, and \(\mu(a,x) = \mathbb{E}[Y \mid A = a, X = x]\) is the unknown outcome regression. The operator \(m\) takes the function \(\mu\) as input and produces another function \(m(\mu)\), evaluated at \(Z,Y\) inside the expectation. “Operator” is just a fancy word for a function that takes functions as input and produces functions as output.

Take the average treatment effect (ATE) as an example: \[ \begin{aligned} \psi \;&=\; \mathbb{E}\!\left[\,\mathbb{E}[Y \mid A = 1, X] - \mathbb{E}[Y \mid A = 0, X]\,\right] \\ &=\; \mathbb{E}\!\left[\,\mu(1, X) - \mu(0, X)\,\right]. \end{aligned} \] To match \(\psi = \mathbb{E}[m(\mu)(Z,Y)]\), \(m\) has to be the operator that takes \(\mu(A, X)\) and returns the function \((A, X) \mapsto \mu(1, X) - \mu(0, X)\). That is, \(m(\mu)\) is a function (just like \(\mu\) is), and we evaluate it at \(Z = (A, X)\) inside the expectation.

In code:

def mu(a, x):
    # The unknown outcome regression. Could be anything; here a + x.
    return a + x

def m(mu):
    # An operator: takes a function mu and returns another function.
    # In this case, the ATE operator.
    def m_of_mu(a, x, y=None):
        return mu(1, x) - mu(0, x)
    return m_of_mu

m(mu)(a=1, x=3)

mu <- function(a, x) {
  # The unknown outcome regression. Could be anything; here a + x.
  a + x
}

m <- function(mu) {
  # An operator: takes a function mu and returns another function.
  # In this case, the ATE operator.
  function(a, x, y=NULL) mu(1, x) - mu(0, x)
}

m(mu)(a = 1, x = 3)

[1] 1

A few standard examples — each is a different choice of \(m\):

Estimand	\(m(\mu)(Z,Y)\)	Reference
Average treatment effect (ATE)	\(\mu(1, X) - \mu(0, X)\)	Hahn (1998)
Treatment-specific mean (TSM) at \(a^\star\)	\(\mu(a^\star, X)\)	Robins (1986)
Additive shift \(a \mapsto a + \delta\)	\(\mu(A + \delta, X) - \mu(A, X)\)	Haneuse & Rotnitzky (2013)
Stochastic intervention \(a \sim h(\cdot \mid X)\)	\(\int \mu(a', X)\,h(a' \mid X)\,da'\)	Muñoz & van der Laan (2012)

Riesz representation

The Riesz representation theorem says: for any estimand \(\psi = \mathbb{E}[m(\mu)(Z,Y)]\) that is linear and continuous in \(\mu\) (we’ll unpack both words below), there exists a function \(\alpha\) (the Riesz representer) such that \[ \mathbb{E}\!\left[\,m(\mu)(Z,Y)\,\right] \;=\; \mathbb{E}\!\left[\,\alpha(Z)\,Y\,\right] \qquad \text{for every plausible } \mu. \] This is just an outcome-weighted average — the weights are \(\alpha(Z)\). For the ATE, \(\alpha(A, X) = A/\pi(X) - (1 - A)/(1 - \pi(X))\) is the inverse-propensity weight. For the treatment-specific mean, \(\alpha(A, X) = \mathbb{1}\{A = a^\star\} / \pi(a^\star \mid X)\) is a density ratio. In general \(\alpha\) has a closed form involving something like a propensity score. You can think of \(\alpha\) as the right weight for an outcome-weighted estimator of \(\psi\) — though strictly speaking it’s the theorem that motivates the weighting estimator, not the other way around.

What does “linear in \(\mu\)” mean?

The estimand \(\psi\) depends on the outcome regression \(\mu\). Linear means that this dependence behaves like a linear function: scale the input \(\mu\) by a constant and \(\psi\) scales by the same constant; add two regressions \(\mu_1\) and \(\mu_2\) and \(\psi\) becomes the sum \(\psi_1 + \psi_2\). There is no curvature in how \(\psi\) depends on \(\mu\).

Take the ATE as a worked example. Suppose two researchers have hypothesized different outcome regressions \(\mu_1\) and \(\mu_2\), and they each compute an ATE estimate \(\psi_1\) and \(\psi_2\). A third researcher decides to average them and uses \(\mu = \tfrac{1}{2}\mu_1 + \tfrac{1}{2}\mu_2\). What is the resulting ATE? Plug into the formula: \[ \psi \;=\; \mathbb{E}\!\left[\,\mu(1, X) - \mu(0, X)\,\right] \;=\; \tfrac{1}{2}\,\psi_1 \;+\; \tfrac{1}{2}\,\psi_2. \] The estimand at the averaged regression equals the average of the two estimands. That’s linearity. It worked because the ATE formula is itself a difference of two expectations — and differences and expectations are both linear, so the whole thing inherits linearity.

You can check linearity directly from the operator \(m\). Plug \(c_1 \mu_1 + c_2 \mu_2\) into \(m\) and verify that \[ m\!\big(c_1 \mu_1 + c_2 \mu_2\big)(Z,Y) \;=\; c_1\,m(\mu_1)(Z) + c_2\,m(\mu_2)(Z,Y) \] holds for all constants \(c_1, c_2\) and all input regressions \(\mu_1, \mu_2\). Every estimand in the estimands page — averages, differences, integrals against fixed densities — is built from operations that preserve sums and constants, so they all pass this check.

There is one further technical condition called continuity (or boundedness). In plain terms: tiny changes in \(\mu\) should produce tiny changes in \(\psi\), with no abrupt jumps. This rules out pathological constructions where \(\psi\) blows up the moment \(\mu\) moves slightly. It holds automatically for every estimand the package ships and is straightforward to verify by hand.

Why should we expect this representation to exist?

The theorem is short to state, but the why is best understood by analogy with finite-dimensional vectors. Start there.

Take any linear function \(T\) that consumes a vector \(v \in \mathbb{R}^d\) and returns a single real number. Linearity forces a very specific form: \(T\) has to be a weighted sum of the entries of \(v\), \[ T(v) \;=\; w_1 v_1 + w_2 v_2 + \cdots + w_d v_d \;=\; w^\top v, \] for some coefficients \(w_1, \ldots, w_d\). There’s no other choice. Try writing down a linear \(T\) that isn’t a weighted sum of the entries — you can’t. The coefficients \(w\) are determined by \(T\): the first entry \(w_1\) is what you get by feeding \(T\) the vector \((1, 0, \ldots, 0)\), \(w_2\) is what you get from \((0, 1, 0, \ldots, 0)\), and so on.

The vector \(w\) is called the representer of \(T\): it’s the unique vector that, when you take the inner product \(w^\top v\), reproduces the value \(T(v)\) for every \(v\). Linearity alone forces this representation to exist.

The Riesz representation theorem says exactly the same thing in a setting where the input is a function instead of a vector. Replace each entry \(v_i\) with the value of \(\mu\) at a point — there are now infinitely many “entries”, one for each possible \(z\). Replace the weighted sum \(\sum_i w_i v_i\) with an expectation \(\mathbb{E}[\alpha(Z) Y]\) — the expectation is the continuous analogue of summing over entries, and \(\alpha(Z) Y\) plays the role of the per-entry contribution. The continuity condition above is what makes that expectation well-defined.

Then for any linear, continuous \(\psi = \mathbb{E}[m(\mu)(Z)]\), exactly the same thing happens: there’s a unique \(\alpha\), the Riesz representer, that reproduces \(\psi\) as \(\mathbb{E}[\alpha(Z) Y]\). You don’t have to construct it by hand — its existence is automatic from linearity plus continuity, just as \(w\)’s existence is automatic for \(T\) in finite dimensions.

Estimation

The natural way to estimate \(\psi = \mathbb{E}[m(\mu)(Z,Y)]\) is to plug in: train an outcome regression \(\hat\mu\) on the data, then average, \[ \hat\psi_{\text{plug-in}} \;=\; \frac{1}{n} \sum_i m(\hat\mu)(Z_i,Y_i). \] This is biased: \(\hat\mu\) is itself estimated and converges slowly toward the true \(\mu\) when the underlying class is flexible (random forests, gradient boosting, neural nets), and the plug-in inherits that slow convergence.

Riesz representation also motivates a different estimator: since \(\psi = \mathbb{E}[\alpha(Z)\,Y]\), the obvious empirical analog is the outcome-weighted mean, \[ \hat\psi_{\text{weighted}} \;=\; \frac{1}{n} \sum_i \hat\alpha(Z_i)\,Y_i. \] This is the standard IPW estimator when \(\alpha\) is the inverse propensity. It’s also biased: \(\hat\alpha\) is itself estimated, and weights can blow up in regions of low overlap.

Modern semiparametric estimators (one-step, targeted maximum likelihood [TMLE], double machine learning [DML]) combine both routes and correct the leading-order bias. The DML correction is \[ \hat\psi \;=\; \frac{1}{n} \sum_i \Big[\,m(\hat\mu)(Z_i) \;+\; \hat\alpha(Z_i)\,\big(Y_i - \hat\mu(Z_i)\big)\,\Big]. \] The added term has mean zero whenever either \(\hat\alpha\) or \(\hat\mu\) is a good estimate, which is what makes the estimator doubly robust: only one of the two nuisances has to converge fast for the bias to vanish. Under mild conditions, \(\hat\psi\) is also efficient — it has the smallest possible large-sample variance among all reasonable estimators, so you get the tightest confidence intervals and most powerful tests achievable without bias or strong parametric assumptions.

Whichever route you reach for — a plain weighted estimator, or the doubly-robust DML / TMLE machinery — you need an estimate \(\hat\alpha\) of the Riesz representer. That is exactly what Riesz regression provides.

Why not estimate the propensity directly?

For the ATE, \(\alpha\) has a closed form in the propensity \(\pi\), and you could fit \(\hat\pi\) and plug in: \(\hat\alpha(A, X) = A/\hat\pi(X) - (1-A)/(1-\hat\pi(X))\). This breaks down in three places.

Positivity is borderline. If \(\hat\pi(X)\) approaches 0 or 1 for some rows, \(\hat\alpha\) explodes. Truncation patches the symptom but introduces its own bias.
The estimand is non-standard. For continuous shifts, stochastic interventions, and longitudinal modified treatment policies, the closed form involves derivatives of the conditional density of \(A\) given \(X\). Density derivatives are harder to estimate than \(\alpha\) itself.
Misspecification compounds. Plug-in error in \(\hat\pi\) propagates nonlinearly through \(1/\hat\pi\) — small mistakes in \(\hat\pi\) become large mistakes in \(\hat\alpha\).

Riesz regression

The trick that makes direct estimation of \(\alpha\) possible is a small algebraic identity. The natural loss is mean squared error against the (unobserved) truth \(\alpha_0\): \[ \begin{aligned} \mathbb{E}\!\left[\,(\alpha(Z) - \alpha_0(Z))^2\,\right] &= \mathbb{E}[\alpha(Z)^2] \;-\; 2\,\mathbb{E}[\alpha(Z)\,\alpha_0(Z)] \;+\; \mathbb{E}[\alpha_0(Z)^2] \\ &= \mathbb{E}[\alpha(Z)^2] \;-\; 2\,\mathbb{E}\!\left[m(\alpha)(Z,Y)\right] \;+\; \text{const} \end{aligned} \] The middle term simplifies via the Riesz identity \(\mathbb{E}[\alpha(Z)\,\alpha_0(Z)] = \mathbb{E}[m(\alpha)(Z,Y)]\) — apply the representation theorem with \(\mu = \alpha\), treating \(\alpha\) now as a hypothetical outcome regression. The last term doesn’t depend on \(\alpha\), so we can drop it. What’s left is the Riesz loss, \[ L(\alpha) \;=\; \mathbb{E}\!\left[\,\alpha(Z)^2 \;-\; 2\,m(\alpha)(Z,Y)\,\right], \] and crucially, every term is computable from data — you only need \(m\) (which you supplied as part of defining the estimand) and the candidate \(\alpha\). The unobserved \(\alpha_0\) has dropped out. Minimizing \(L\) over a flexible function class gives \(\hat\alpha\). Chernozhukov et al. (2021) introduced this construction.

The minimizer is well-defined regardless of how badly the closed-form weights blow up: where \(\pi(X)\) is small, the squared-loss penalty shrinks \(\hat\alpha\) toward zero in low-overlap regions, which controls downstream variance.

Hines & Miles (2025) extended the same trick beyond the squared loss to the Bregman family — KL divergence, Bernoulli, bounded-squared, and others. The derivation is the same shape: each Bregman loss can be rewritten purely in terms of \(m\) and \(\alpha\), no \(\alpha_0\) needed. The practical value is that different losses give different guarantees: KL produces strictly positive \(\hat\alpha\) (right for density-ratio estimands like TSM), bounded-squared keeps \(\hat\alpha\) inside a prior box (right when you have an a-priori bound on IPW magnitudes), and so on.

Pick a learner

Riesz regression with the squared loss is a standard regression problem. Different learners trade off the same way they do for outcome regression:

Gradient boosting — the standard tool for tabular data. rieszboost plugs the per-row gradient and Hessian into xgboost’s custom-objective interface. Hyperparameters that work for outcome models transfer.
Kernel ridge regression — closed form on small / medium data. krrr solves the linear system per \(\lambda\); no boosting iterations.
Random forests — non-parametric and well-suited to many covariates. forestriesz ships two flavors: an augmentation-style fit that works on every estimand (including custom ones) without per-estimand configuration, and a moment-style fit with honest-split confidence intervals on ATE, ATT, and TSM.
Neural networks — riesznet trains the Riesz loss with PyTorch autograd; the same MLP / custom-nn.Module story you get from any neural regressor.

The four learner families produce different \(\hat\alpha\), but all are minimizers of the same population objective and converge to \(\alpha\) under their respective consistency conditions.

What `rieszreg` provides

A scikit-learn-style learner (and an R6 mirror) that:

Fits \(\hat\alpha\) for ATE, ATT, TSM, AdditiveShift, LocalShift, or your own estimand.
Composes with cross_val_predict, GridSearchCV, Pipeline.
Surfaces near-positivity and outlier-extrapolation warnings via .diagnose(Z).

For a feature-by-feature comparison with similar tools (genriesz, EconML, DoubleML, tlverse), see Related packages.