An estimand is a scalar summary of the data-generating distribution that you want to estimate. The estimands rieszreg handles all share the form \[
\psi = \mathbb{E}\!\left[\,m(\mu)(Z, Y)\,\right]
\] where \(\mu(z) = \mathbb{E}[Y \mid Z = z]\) is the outcome regression and \(m\) is an operator — a rule that maps the function \(\mu\) to another function \(m(\mu)\) that can also depend on the row \((Z, Y)\). For the average treatment effect (ATE) with binary treatment \(A\), \(m(\mu)(Z, Y) = \mu(1, X) - \mu(0, X)\) — m ignores \(Y\) and produces a contrast of two evaluations of \(\mu\).
An estimand object packages two things: the column names \(\alpha\) is indexed by (feature_keys) and the operator itself (m). m is curried: it takes a candidate \(\alpha\) and returns a function of one row \((z, y)\). The five built-in factories (ATE, ATT, TSM, AdditiveShift, LocalShift) cover most causal-inference targets; section Custom estimands shows how to write your own.
The factories live in rieszreg; rieszboost, krrr, forestriesz, and riesznet all re-export them so from rieszboost import ATE, from krrr import ATE, from forestriesz import ATE, and from riesznet import ATE resolve to the same object.
Quickstart: ATE on the Lalonde NSW data
The NSW (National Supported Work) demonstration is the canonical Lalonde dataset: a job-training experiment for disadvantaged workers, plus a CPS observational comparison group. The goal here is the average treatment effect on 1978 earnings.
The estimand says exactly which dataframe columns feed \(\alpha\): the treatment column and the eight covariates, in the order they appear in feature_keys.
<rieszreg.estimands.base.ATE object at 0x7fa48dcfacf0>
signature: (alpha)
TSM(level =1, treatment ="a", covariates ="x")
<rieszreg.estimands.base.TSM object at 0x7fa48dd68f20>
signature: (alpha)
LocalShift(delta =0.3, threshold =0.7)
<rieszreg.estimands.base.LocalShift object at 0x7fa48dd69190>
signature: (alpha)
Mapping your dataframe to an estimand
An estimand object carries two pieces: the column names \(\alpha\) is indexed by (feature_keys) and the operator m(alpha)(z, y).
feature_keys declares the input schema of \(\alpha\) — the columns the underlying learner consumes when scoring a row. The orchestrator slices these columns out of Z per row to build the feature matrix the learner sees.
The outcome Y is not part of feature_keys. It flows in the sklearn-style way: a separate vector passed as the second argument to .fit(Z, y). The orchestrator threads it through to m(alpha)(z, y). Built-in estimands ignore y; pass it anyway when following sklearn convention.
For the Lalonde quickstart above, the wiring looks like this:
The Z dataframe must contain every name in feature_keys; extra columns are ignored. With an ndarray, columns are interpreted in feature_keys order.
Theory: how the augmented dataset is derived
This section explains the math that lets rieszreg evaluate the Riesz loss without ever materializing \(\alpha_0\). Skip on a first read.
While the Riesz representer theorem applies to any continuous linear functional, the estimands rieszreg supports are those whose \(m\) reduces to a finite linear combination of point evaluations of \(\mu\). For example, the ATE is \(\mathbb{E}[\mu(1, X) - \mu(0, X)]\), which reduces to two point evaluations \(\mu(1, x)\) and \(\mu(0, x)\) for each row’s covariates \(x\). We call these finite-evaluation estimands.
Formally, an estimand \(\psi = \mathbb{E}[m(\mu)(Z, Y)]\) is finite-evaluation if there exist fixed functions \(a: \mathcal Z \times \mathcal Y \to \mathcal Z^k\) (the point generator) and \(c: \mathcal Z \times \mathcal Y \to \mathbb R^k\) (the weight function) such that \[
m(\mu)(z, y) = c(z, y)^\top \vec\mu(a(z, y)),
\] where \(\vec\mu\) denotes elementwise application of \(\mu\) to the vector-valued input. For ATE, \(k = 2\), \(a(z, y) = ((1, z_x), (0, z_x))\), and \(c(z, y) = (1, -1)\).
For these estimands the empirical Bregman-Riesz loss simplifies. Define the \(k\) pseudo-datapoints \(Z_{ij} = a_j(Z_i, Y_i)\) with coefficients \(C_{ij} = c_j(Z_i, Y_i)\) for each row \(i\). Then \[
\begin{aligned}
\mathcal L_{h,n}(\alpha)
&= \sum_{i=1}^n \left[\tilde h(\alpha(Z_i)) - m(h' \circ \alpha)(Z_i, Y_i)\right] \\
&= \sum_{i=1}^n \left[\tilde h(\alpha(Z_i)) - \sum_{j=1}^k C_{ij} \cdot h'(\alpha(Z_{ij}))\right].
\end{aligned}
\]
The loss depends on \(\alpha\) only through its values at \(\{Z_i\} \cup \{Z_{ij}\}\), at most \(n(k+1)\) points. The coefficients \(C_{ij}\) depend on data but not on \(\alpha\), so they can be computed once as a preprocessing step.
The augmented dataset
The cleanest way to keep track of the loss, derivatives, etc. during training is to materialize an augmented dataset of the points \(\{Z_i\} \cup \{Z_{ij}\}\) along with the appropriate coefficients \(C_{ij}\).
Some pseudo-datapoints coincide with the original observation. For ATE, \(Z_{i1} = (1, x_i)\) and \(Z_{i2} = (0, x_i)\). A treated unit (\(a_i = 1\)) has \(Z_{i1} = Z_i\), so two of the three evaluation points are the same. Let \(D_{ij} = \mathbf{1}[Z_{ij} = Z_i]\) indicate overlap. Splitting the \(h'\) sum into overlapping and non-overlapping terms: \[
\mathcal L_{h,n}(\alpha) = \sum_{i=1}^n \left[\tilde h(\alpha(Z_i)) - \sum_{j:\, D_{ij}=1} C_{ij}\, h'(\alpha(Z_i))\right] - \sum_{i=1}^n \sum_{j:\, D_{ij}=0} C_{ij}\, h'(\alpha(Z_{ij})).
\]
At each distinct evaluation point, the loss combines a \(\tilde h\)-weight and an \(h'\)-weight into one term. Estimand.augment collects these distinct points for each row and stores their coefficients. Let \(r\) index an enumeration of the augmented data points. Each augmented row \(r\) corresponds to either a \(Z_{ij}\) or a \(Z_i\) for some original row \(i\); write \(i_r\) for that origin row. The two weights are \[
D_r = \mathbf{1}[Z_r = Z_{i_r}],
\qquad
C_r = -\sum_{j \,:\, Z_{i_r, j} = Z_r} C_{i_r, j}.
\]
The \(\tilde h\)-weight \(D_r\) is \(1\) exactly when \(Z_r\) is the original observation; the \(h'\)-weight \(C_r\) collects \(-C_{ij}\) from every pseudo-datapoint of row \(i_r\) that lands at \(Z_r\) — an empty sum (giving \(C_r = 0\)) when no pseudo-datapoint coincides with \(Z_{i_r}\). The per-row contribution to the empirical loss is \(L_r = D_r\, \tilde h(\alpha(Z_r)) + C_r\, h'(\alpha(Z_r))\), and \(\mathcal L_{h,n}(\alpha) = \sum_r L_r\).
For ATE with an untreated unit (\(a_i = 0\)): before merging there are three evaluation points — \(Z_i = (0, x_i)\) and pseudo-datapoints \((1, x_i)\), \((0, x_i)\). The pseudo-datapoint \((0, x_i)\) coincides with \(Z_i\) (\(D_{i2} = 1\)), so they merge into one entry with \(\tilde h\)-weight \(1\) and \(h'\)-weight \(-C_{i2} = +1\). The other pseudo-datapoint \((1, x_i)\) stays separate with \(\tilde h\)-weight \(0\) and \(h'\)-weight \(-C_{i1} = -1\). The result is two augmented rows instead of three. A treated unit merges the other pseudo-datapoint instead, also producing two rows.
It can also happen that none of the pseudo-datapoints coincide with \(Z_i\) — the union \(\{Z_i\} \cup \{Z_{ij}\}\) then has \(k+1\) distinct entries and no merging occurs. For example, TSM(level=1) on an untreated row (\(a_i = 0\)) has \(Z_i = (0, x_i)\) and a single pseudo-datapoint \((1, x_i)\); they are distinct, so \(Z_i\) enters the augmented dataset as its own row with \(\tilde h\)-weight \(1\) and \(h'\)-weight \(0\) (no pseudo-datapoint lands there), and \((1, x_i)\) enters as a separate row with \(\tilde h\)-weight \(0\) and \(h'\)-weight \(-C_{i1} = -1\). The general rule is the same in every case: include \(Z_i\) once, then add each pseudo-datapoint \(Z_{ij}\) — merging into an existing entry when \(Z_{ij}\) matches a point already in the dataset for row \(i\), otherwise as a new row.
To make this concrete, take the ATE estimand on \(n=3\) rows:
\(i\)
\(A_i\)
\(X_i\)
1
0
0.2
2
1
0.5
3
1
0.8
For each row, the three evaluation points \(\{Z_i,\, Z_{i1} = (1, X_i),\, Z_{i2} = (0, X_i)\}\) collapse to two distinct entries after merging. The augmented dataset has \(2n = 6\) rows; each row \(r\) carries the evaluation point \((A_r, X_r)\), the \(\tilde h\)-weight \(D_r\), the \(h'\)-weight \(C_r\), and contributes a term \(L_r = D_r \cdot \tilde h(\alpha(A_r, X_r)) + C_r \cdot h'(\alpha(A_r, X_r))\) to the empirical loss:
from \(i\)
\(A_r\)
\(X_r\)
\(D_r\)
\(C_r\)
\(L_r\)
1
0
0.2
\(1\)
\(+1\)
\(\tilde h(\alpha(0, 0.2)) + h'(\alpha(0, 0.2))\)
1
1
0.2
\(0\)
\(-1\)
\(-h'(\alpha(1, 0.2))\)
2
1
0.5
\(1\)
\(-1\)
\(\tilde h(\alpha(1, 0.5)) - h'(\alpha(1, 0.5))\)
2
0
0.5
\(0\)
\(+1\)
\(+h'(\alpha(0, 0.5))\)
3
1
0.8
\(1\)
\(-1\)
\(\tilde h(\alpha(1, 0.8)) - h'(\alpha(1, 0.8))\)
3
0
0.8
\(0\)
\(+1\)
\(+h'(\alpha(0, 0.8))\)
Rows \(i=2\) and \(i=3\) are treated, so the merging happens on the \((1, X_i)\) pseudo-datapoint instead of \((0, X_i)\), flipping which row carries the \(\tilde h\)-weight. Summing the \(L_r\) column recovers the empirical loss: \[
\mathcal L_{h,n}(\alpha) = \sum_{r=1}^{6} L_r = \sum_{i=1}^{3} \big[\tilde h(\alpha(Z_i)) - h'(\alpha(1, X_i)) + h'(\alpha(0, X_i))\big].
\]
Estimand.augment produces this table directly: it stores the two coefficients as is_original (the \(\tilde h\)-weight \(D_r\)) and potential_deriv_coef (the \(h'\)-weight \(C_r\)) on the returned AugmentedDataset.
import numpy as npimport pandas as pdfrom rieszreg import ATEdf = pd.DataFrame({"a": [0, 1, 1], "x": [0.2, 0.5, 0.8]})df
i A X D C
0 1 1.0 0.2 0.0 -1.0
1 2 1.0 0.5 1.0 -1.0
2 3 1.0 0.8 1.0 -1.0
3 1 0.0 0.2 1.0 1.0
4 2 0.0 0.5 0.0 1.0
5 3 0.0 0.8 0.0 1.0
Pick a candidate \(\alpha(a, x) = a + x\) and the squared loss (\(\tilde h(t) = t^2\), \(h'(t) = 2t\)). The per-row contribution becomes \(L_r = D \cdot \alpha^2 + 2 C \cdot \alpha\):
F = aug_df["A"] + aug_df["X"] # α evaluated at each augmented pointaug_df["L_r"] = aug.is_original * F**2+2.0* aug.potential_deriv_coef * Faug_df
i A X D C L_r
0 1 1.0 0.2 0.0 -1.0 -2.40
1 2 1.0 0.5 1.0 -1.0 -0.75
2 3 1.0 0.8 1.0 -1.0 -0.36
3 1 0.0 0.2 1.0 1.0 0.44
4 2 0.0 0.5 0.0 1.0 1.00
5 3 0.0 0.8 0.0 1.0 1.60
aug_df["L_r"].sum() # empirical loss
np.float64(-0.46999999999999975)
The same number drops out of the original row-wise form \(\sum_i [\alpha(Z_i)^2 - 2\alpha(1, X_i) + 2\alpha(0, X_i)]\):
To fit a finite-evaluation functional not covered by the built-ins, write the operator m(alpha)(z, y) and wrap it in a FiniteEvalEstimand. The tracer extracts the linear-form structure automatically.
m must be a finite linear combination of point evaluations of \(\alpha\), built with +, -, scalar *, and calls to alpha(...). Squaring, adding constants, integrals, and other non-linear operations raise.
NotePython-only
Custom m() is Python-only. From R, define m in Python via reticulate::py_run_string and pass the resulting FiniteEvalEstimand to your estimator. See the R interface page for the recipe.
from rieszreg import FiniteEvalEstimandfrom rieszboost import RieszBoosterimport numpy as np, pandas as pd# A weighted contrast: 2x the ATE at level a=2 minus the level at a=1.def m_weighted_contrast(alpha):def inner(z, y=None):return2.0* (alpha(a=2, x=z["x"]) - alpha(a=0, x=z["x"])) \- alpha(a=1, x=z["x"])return innerest = FiniteEvalEstimand( feature_keys=("a", "x"), m=m_weighted_contrast, name="WeightedContrast",)print(est)
<rieszreg.estimands.base.FiniteEvalEstimand object at 0x7fa48dd5b320>
The inner function takes (z, y) even when y is unused — that signature is the contract every custom m follows. Pass the estimand to any backend’s estimator class:
TypeError: Cannot add a scalar to a LinearForm — m must be linear in alpha (no constant offsets).
For estimands with integrals or derivatives of \(\alpha\) (outside the finite-linear-form algebra), Monte-Carlo-sample and approximate as a finite sum.
Initialization
Every RieszEstimator subclass takes an init= constructor arg that controls the α-space starting value of the model. The orchestrator converts init to an η-space base_score via loss.alpha_to_eta; backends use base_score as a prediction offset.
The default — init=None — sets α to the empirical loss-minimizing constant. For any Bregman loss with strictly convex \(h\), the constant \(a^*\) that minimizes the per-row Riesz loss \[
P_n L(a) \;=\; \tilde h(a) \;-\; h'(a) \cdot \bar m, \qquad \bar m = P_n[m(\alpha = 1)(Z, Y)]
\] satisfies the FOC \(\tilde h'(a) = h''(a) \bar m\). Using \(\tilde h(t) = t h'(t) - h(t)\) we get \(\tilde h'(t) = t h''(t)\), so the FOC collapses to \(a^* = \bar m\). The result is loss-independent — the only loss-specific step is projecting \(\bar m\) into the loss’s α-domain when its codomain is restricted. Each Loss exposes best_constant_init(m_bar) for that projection.
For the built-in estimands \(\bar m\) is structurally constant (doesn’t depend on the data realization):
Estimand
\(m(\alpha = 1)(Z, Y)\) per row
\(\bar m\)
ATE, ATT, AdditiveShift, LocalShift
\(1 + (-1) = 0\)
\(0\)
TSM(v)
\(1\)
\(1\)
For custom estimands \(\bar m\) can depend on the data; the orchestrator computes it from the training rows after the validation split (no peeking) by tracing each row and averaging the sum of trace coefficients.
For BernoulliLoss + TSM (or any density-ratio estimand with \(\bar m = 1\)) the loss-minimizing constant sits on the upper boundary, and the projection clips to \(1 - \varepsilon\). The sigmoid is saturated at that init and gradients are tiny, so iterative training may stall. Pass init=0.5 (or another interior value) when this matters. The KLLoss / BoundedSquaredLoss boundaries can hit the same issue when \(\bar m\) lands exactly on a bound.
Estimand subclass taxonomy and future work
Today every estimand rieszreg handles is a FiniteEvalEstimand: \(m\) reduces to a finite linear combination of point evaluations of \(\mu\). The taxonomy of \(m\)-operators outside this class — integrals, derivatives, and others — is still being designed. Likely subclasses:
Integrals.\(m(\mu)(z, y) = \int \mu(a', x)\,h(a' \mid z)\,da'\) for a known density \(h\). The classic example is a stochastic-intervention policy. Approximate via Monte Carlo to reduce to the finite-evaluation case (with \(K\) samples per row), or support natively via a backend that integrates \(\alpha\) analytically.
Derivatives.\(m(\mu)(z, y) = \partial_a \mu(a, x)\big|_{a = z_a}\). Finite-difference approximations reduce to the finite-evaluation case; backends with autodiff can support directly.
Other linear operators on \(\mu\). Any \(m\) that satisfies linearity on \(\mu\) (additivity and scalar homogeneity) but doesn’t fit the patterns above.
A previously released StochasticIntervention factory is currently stubbed (raises NotImplementedError). A reintroduction will land once the integral-vs-finite-evaluation distinction is wired into the type hierarchy.