Tree backend (riesztree)

riesztree fits a single decision tree to estimate the Riesz representer \(\alpha_0\) of a linear functional, by greedy splits on the augmented Bregman-Riesz loss. Each leaf stores a closed-form value \(\alpha_\ell^* = -C_\ell / D_\ell\), projected to the loss’s \(\alpha\)-domain. Loss-aware splits work for every built-in Bregman loss; the per-leaf \(\alpha^*\) formula is universal.

The method is new — there is no published reference at this writing. The package is conceptually the simplest of the four learners, and a useful starting point for understanding what every other learner is doing.

Note

For headline accuracy, reach for forestriesz (random forest) or rieszboost (gradient boosting). A single tree’s value is interpretability: a depth-3 or depth-4 tree is a readable rule, and each leaf’s \(\alpha_\ell^*\) is a subgroup-specific IPW-like weight you can stare at directly.

1 How it works

1.1 The augmented loss

For a finite-evaluation estimand \(m\), the empirical Bregman-Riesz loss reduces over the augmented dataset to

\[ \hat L(\alpha) \;=\; \frac{1}{n}\sum_{r=1}^{M} \Big[\, D_r\, \tilde h(\alpha(z_r)) \;+\; C_r\, h'(\alpha(z_r)) \,\Big], \]

with one row per evaluation point \(z_r\), weight \(D_r \in \{0, 1\}\) (\(1\) for original observations, \(0\) for counterfactual evaluation points), and linear coefficient \(C_r\) accumulating the trace coefficients of \(m\) at \(z_r\). See Losses for the derivation.

For ATE on \(Z_i = (a_i, x_i)\) the trace expands to two evaluation points \((1, x_i)\) and \((0, x_i)\) with coefficients \(+1\) and \(-1\):

augmented row	\(z_r\)	\(D_r\)	\(C_r\)
original-merged	\((a_i, x_i)\)	\(1\)	\(-1\) if \(a_i = 1\), else \(+1\)
counterfactual at the other arm	\((1 - a_i, x_i)\)	\(0\)	\(+1\) if \(a_i = 1\), else \(-1\)

1.2 Per-leaf decoupling

Consider a tree with leaves \(\ell = 1, \ldots, K\) assigning a constant \(\alpha_\ell\) in leaf \(\ell\). Let \(D_\ell = \sum_{r \in \ell} D_r\) and \(C_\ell = \sum_{r \in \ell} C_r\). The empirical loss becomes

\[ \hat L(\alpha_1, \ldots, \alpha_K) = \frac{1}{n}\sum_\ell \big[\, D_\ell\, \tilde h(\alpha_\ell) + C_\ell\, h'(\alpha_\ell) \,\big]. \]

The summand for leaf \(\ell\) depends only on \(\alpha_\ell, D_\ell, C_\ell\) — no cross-leaf coupling. Setting the derivative to zero and using the identity \(\tilde h'(t) = t \cdot h''(t)\) (immediate from \(\tilde h = t\, h' - h\)):

\[ h''(\alpha_\ell) \cdot (D_\ell\, \alpha_\ell + C_\ell) = 0. \]

Since \(h\) is strictly convex, \(h''(\alpha_\ell) > 0\), so

\[ \alpha_\ell^* \;=\; -\,\frac{C_\ell}{D_\ell} \]

(when \(-C_\ell / D_\ell\) lies in the loss’s \(\alpha\)-domain; otherwise project to the boundary). This formula is universal across SquaredLoss, KLLoss, BernoulliLoss, and BoundedSquaredLoss — only the link’s \(\alpha\)-domain changes.

1.3 Splitting criterion

For a candidate split of parent leaf \(p\) into children \(\ell, r\), the gain is the reduction in empirical loss at the per-leaf optima:

\[ \Delta(\text{split}) \;=\; \hat L_p(\alpha_p^*) - \hat L_\ell(\alpha_\ell^*) - \hat L_r(\alpha_r^*). \]

The gain expression depends on the loss. For squared loss, the per-leaf objective at the optimum is

\[ \hat L_\ell(\alpha_\ell^*) \;=\; D_\ell\, (\alpha_\ell^*)^2 + 2 C_\ell\, \alpha_\ell^* \;=\; D_\ell \!\left(\!-\frac{C_\ell}{D_\ell}\!\right)^{\!2} + 2 C_\ell\!\left(\!-\frac{C_\ell}{D_\ell}\!\right) \;=\; -\frac{C_\ell^2}{D_\ell}, \]

\[ \Delta_{\text{sq}} \;=\; \frac{C_\ell^2}{D_\ell} + \frac{C_r^2}{D_r} - \frac{C_p^2}{D_p}. \]

Since the augmented rows include \(D_r = 0\) counterfactual points contributing only to \(C_\ell\) — these add a linear shift in \(\alpha\) to the within-leaf objective rather than a quadratic — the gain cannot be encoded as a vanilla weighted regression target. A custom Cython criterion plugged into sklearn’s tree builder would work; the package hand-rolls the splitter in pure NumPy, which is fast enough at the \(n\) a single tree is sensible for.

For other losses the gain has its own analytic form. KL (\(h(t) = t \log t - t\)), for instance, gives \(\hat L_\ell(\alpha_\ell^*) = -C_\ell + C_\ell \log(-C_\ell / D_\ell)\) when \(C_\ell < 0\), and the gain is

\[ \Delta_{\text{KL}} \;=\; -C_\ell \log\!\frac{-C_\ell}{D_\ell} - C_r \log\!\frac{-C_r}{D_r} + C_p \log\!\frac{-C_p}{D_p}. \]

The orderings of candidate splits under \(\Delta_{\text{sq}}\) and \(\Delta_{\text{KL}}\) can differ — the splits depend on the loss, even though the leaf values given a partition don’t.

2 Quickstart

import numpy as np, pandas as pd
from riesztree import RieszTreeRegressor, ATE

rng = np.random.default_rng(0)
n = 1500
x = rng.uniform(0, 1, n)
pi = 1 / (1 + np.exp(-(-0.02*x - x**2 + 4*np.log(x + 0.3) + 1.5)))
a = rng.binomial(1, pi).astype(float)
df = pd.DataFrame({"a": a, "x": x})

est = RieszTreeRegressor(
    estimand=ATE(treatment="a", covariates=("x",)),
    max_depth=4,
    random_state=0,
)
est.fit(df)

RieszTreeRegressor(estimand=<rieszreg.estimands.base.ATE object at 0x7f44f10e7bc0>,
                   max_depth=4)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

alpha_hat  = est.predict(df)
true_alpha = a / pi - (1 - a) / (1 - pi)
print(f"corr(α̂, α₀) = {np.corrcoef(alpha_hat, true_alpha)[0, 1]:.3f}")

corr(α̂, α₀) = 0.973

print(f"RMSE         = {np.sqrt(np.mean((alpha_hat - true_alpha)**2)):.3f}")

RMSE         = 0.601

pkgload::load_all("../packages/riesztree/r/riesztree")
use_python_riesztree(file.path(getwd(), "../.venv/bin/python"))

set.seed(0)
n  <- 1500
x  <- runif(n)
pi <- 1 / (1 + exp(-(-0.02 * x - x^2 + 4 * log(x + 0.3) + 1.5)))
a  <- rbinom(n, 1, pi)
df <- data.frame(a = as.numeric(a), x = x)

rt <- RieszTreeRegressor$new(
  estimand = ATE(treatment = "a", covariates = "x"),
  max_depth = 4L, random_state = 0L
)
rt$fit(df)
alpha_hat <- as.numeric(rt$predict(df))

3 Growth strategies

Two growth policies are supported:

Depthwise (growth_policy="depthwise", default). Standard recursive depth-first growth. Stops at max_depth or when min_samples_split blocks the node.
Leafwise (growth_policy="leafwise"). Best-first growth: at each step split the leaf whose best candidate has the highest gain anywhere in the tree. Termination on max_leaf_nodes (the natural cap for this policy) or when no positive-gain split remains.

from riesztree import RieszTreeRegressor, ATE, n_leaves

est_dw = RieszTreeRegressor(
    estimand=ATE(treatment="a", covariates=("x",)),
    growth_policy="depthwise", max_depth=4,
).fit(df)

est_lw = RieszTreeRegressor(
    estimand=ATE(treatment="a", covariates=("x",)),
    growth_policy="leafwise", max_leaf_nodes=16, max_depth=20,
).fit(df)

print(f"depthwise leaves: {n_leaves(est_dw.predictor_.tree)}")

depthwise leaves: 14

print(f"leafwise  leaves: {n_leaves(est_lw.predictor_.tree)}")

leafwise  leaves: 16

4 Pruning

Two regularisation paths are available:

Cost-complexity pruning via ccp_alpha > 0. Standard Breiman et al. weakest-link algorithm: walk the tree bottom-up, computing per-subtree \(R_\alpha(T) = R(T) + \alpha\, |\text{leaves}(T)|\), and collapse the weakest link until \(g_{\min} \geq \alpha\). \(R(T)\) is the augmented Bregman training loss summed over leaves.
Early stopping via early_stopping_rounds + validation_fraction. Stops growing when the held-out augmented loss has not improved for that many consecutive accepted splits, mirroring rieszboost / riesznet.

Both are off by default. Combine freely with either growth policy.

est_pruned = RieszTreeRegressor(
    estimand=ATE(treatment="a", covariates=("x",)),
    max_depth=10, ccp_alpha=0.5,
).fit(df)

est_es = RieszTreeRegressor(
    estimand=ATE(treatment="a", covariates=("x",)),
    max_depth=10, early_stopping_rounds=3, validation_fraction=0.2,
).fit(df)

print(f"unpruned: {n_leaves(RieszTreeRegressor(estimand=ATE(treatment='a', covariates=('x',)), max_depth=10).fit(df).predictor_.tree)} leaves")

unpruned: 83 leaves

print(f"pruned  : {n_leaves(est_pruned.predictor_.tree)} leaves")

pruned  : 64 leaves

print(f"early-stop: {n_leaves(est_es.predictor_.tree)} leaves")

early-stop: 9 leaves

5 Categorical predictors

Declare integer-coded categorical columns via categorical_features=(col_idx, ...). The splitter uses the standard CART trick: order levels by within-level \(\alpha_{\text{level}}^* = -C_{\text{level}} / D_{\text{level}}\), then sweep contiguous splits in that order. The ordering theorem carries over because the augmented Bregman per-leaf objective is convex.

rng = np.random.default_rng(0)
n_cat = 2000
pi_per_level = rng.uniform(0.1, 0.9, size=8)
cat = rng.integers(0, 8, size=n_cat).astype(float)
pi_cat = pi_per_level[cat.astype(int)]
a_cat = (rng.uniform(0, 1, size=n_cat) < pi_cat).astype(float)
df_cat = pd.DataFrame({"a": a_cat, "cat": cat, "x": rng.normal(size=n_cat)})

est_cat = RieszTreeRegressor(
    estimand=ATE(treatment="a", covariates=("cat", "x")),
    max_depth=4,
    categorical_features=(1,),    # column 1 in feature_keys = ("a","cat","x")
).fit(df_cat)

alpha_cat = est_cat.predict(df_cat)
print(f"alpha range: [{alpha_cat.min():.3f}, {alpha_cat.max():.3f}]")

alpha range: [-7.684, 15.769]

6 Save / load

import tempfile, os
with tempfile.TemporaryDirectory() as tmp:
    save_dir = os.path.join(tmp, "fitted")
    est.save(save_dir)

    loaded = RieszTreeRegressor.load(save_dir)
    pred_loaded = loaded.predict(df)
    print(f"max abs diff after round-trip: {np.max(np.abs(pred_loaded - alpha_hat)):.2e}")

max abs diff after round-trip: 0.00e+00

For built-in estimands the metadata fully reconstructs the estimand on load. For custom estimands pass estimand= to RieszTreeRegressor.load(path, estimand=...).

7 Sklearn integration

RieszTreeRegressor is a BaseEstimator, so it composes with clone, GridSearchCV, cross_val_predict, Pipeline. See Tuning and Cross-Fitting.

from sklearn.model_selection import GridSearchCV
gs = GridSearchCV(
    RieszTreeRegressor(estimand=ATE(treatment="a", covariates=("x",))),
    {"max_depth": [2, 4, 6, 8]},
    cv=3, n_jobs=1,
)
gs.fit(df)

GridSearchCV(cv=3,
             estimator=RieszTreeRegressor(estimand=<rieszreg.estimands.base.ATE object at 0x7f44f11157c0>),
             n_jobs=1, param_grid={'max_depth': [2, 4, 6, 8]})

print(f"best max_depth = {gs.best_params_['max_depth']}")

best max_depth = 4

8 Hyperparameters

Mirrors sklearn.tree.DecisionTreeRegressor where the augmented Bregman-Riesz setting allows.

Knob	Default	Notes
`max_depth`	8	Cap on tree depth.
`min_samples_split`	20	Minimum count of original (\(D > 0\)) augmented rows in a node before considering a split.
`min_samples_leaf`	10	Minimum count of original rows in each child of a candidate split.
`min_weight_fraction_leaf`	0.0	Sklearn parity. With unit weights, leaves must hold \(\geq \lceil \text{frac} \cdot n_{\text{orig}} \rceil\) original rows (combined with `min_samples_leaf` via `max(...)`).
`max_leaf_nodes`	31	Cap for leafwise growth. Ignored when `growth_policy="depthwise"`.
`max_features`	None	Per-split feature subsample. `None`, `"sqrt"`, `"log2"`, an int, or a float in \((0, 1]\). Critical for forest decorrelation.
`growth_policy`	`"depthwise"`	Or `"leafwise"`.
`min_impurity_decrease`	0.0	Reject splits with gain \(\leq\) this threshold.
`ccp_alpha`	0.0	Cost-complexity pruning penalty. Sklearn name.
`early_stopping_rounds`	None	Stop when held-out augmented loss has not improved for that many accepted splits.
`validation_fraction`	0.1	Held-out fraction for early stopping. Ignored when not needed.
`categorical_features`	None	Sequence of column indices treated as integer category labels.
`loss`	`SquaredLoss()`	Bregman-Riesz loss.
`random_state`	0	Seeds the per-split feature subsample (`max_features`) and the per-feature random threshold (`splitter='random'`).
`splitter`	`"exact"`	One of `"exact"` (Cython per-feature sweep), `"hist"` (Cython quantile-binned histogram, fastest at large \(n\)), `"random"` (sklearn ExtraTrees-style), `"python"` (legacy; deprecated, removed in v0.0.3).
`max_bins`	255	Quantile bins per feature when `splitter="hist"`. Sklearn HGB convention; fits in `uint8`.

The v0.0.1 names max_leaves and pruning_alpha are accepted as deprecated aliases for max_leaf_nodes and ccp_alpha; passing them emits a FutureWarning.

9 Splitter modes

riesztree ships three Cython splitter paths plus a legacy Python fallback:

# Compare splitter modes on the same data — predictions and timings.
import time
from riesztree import RieszTreeRegressor, ATE, n_leaves

estimand = ATE(treatment="a", covariates=("x",))

for sp in ["exact", "hist", "random"]:
    t0 = time.perf_counter()
    est = RieszTreeRegressor(estimand=estimand, max_depth=4, splitter=sp).fit(df)
    fit_ms = (time.perf_counter() - t0) * 1000
    print(f"splitter={sp:6s}  fit={fit_ms:6.1f} ms  leaves={n_leaves(est.predictor_.tree)}")

splitter=exact   fit=   2.0 ms  leaves=14
splitter=hist    fit=   3.6 ms  leaves=14
splitter=random  fit=   1.5 ms  leaves=12

"exact" (default): per-feature linear scan over distinct values. Best partition; same byte-for-byte as the legacy Python splitter on the four built-in losses.
"hist": pre-binned (default 255 quantile bins per feature). O(n_{\text{leaf}} + \text{max\_bins}) per feature per leaf instead of O(n_{\text{leaf}} \log n_{\text{leaf}}). Fastest at large \(n\); threshold lands on a bin boundary so split-ranking is approximate but the leaf \(\alpha^\star\) itself is exact.
"random": sklearn ExtraTrees-style. One uniform threshold per feature per leaf; useful for forest decorrelation when combined with max_features="sqrt".

For losses outside the four built-ins (SquaredLoss, KLLoss, BernoulliLoss, BoundedSquaredLoss), the Cython splitter routes through a registered Numba @cfunc if one was registered via riesztree.fast.register_fast_leaf_solver — otherwise it falls back to the Python path with a one-time UserWarning.

10 Custom-loss extension hook

import numba
from rieszreg import Loss
from riesztree.fast import register_fast_leaf_solver

class MyLoss(Loss):
    ...   # implement the Loss interface

@numba.cfunc("float64(float64, float64)", cache=True, nopython=True)
def my_leaf_loss(D, C):
    if D <= 0.0:
        return 0.0
    return -C * C / D       # SquaredLoss-equivalent here

def my_alpha(D, C):
    return 0.0 if D <= 0 else -C / D

register_fast_leaf_solver(MyLoss, my_leaf_loss, my_alpha)

# Subsequent fits with loss=MyLoss() use the C-speed Cython splitter.

The cfunc is called from the Cython splitter’s tight loop without the GIL — same per-evaluation cost as the four built-in losses. See riesztree’s BENCH_BASELINE.md for the locked baselines and a comparison vs sklearn / LightGBM / XGBoost.

11 Loss-aware splits

The splitter uses the analytic per-leaf-loss-at-optimum for each Bregman loss:

Loss	\(\alpha_\ell^*\) in-domain when	\(\hat L_\ell(\alpha_\ell^*)\)
`SquaredLoss`	always	\(-C_\ell^2 / D_\ell\)
`KLLoss`	\(C_\ell < 0\) (boundary at \(C_\ell = 0\))	\(-C_\ell + C_\ell \log(-C_\ell / D_\ell)\)
`BernoulliLoss`	\(-D_\ell < C_\ell < 0\)	\(D_\ell \log D_\ell - (D_\ell + C_\ell) \log(D_\ell + C_\ell) + C_\ell \log(-C_\ell)\)
`BoundedSquaredLoss(lo, hi)`	always (project \(\alpha_\ell^*\) to \([lo, hi]\))	\(D_\ell (\alpha_\ell^)^2 + 2 C_\ell \alpha_\ell^\)

Out-of-domain children are disqualified (+∞ leaf loss). Pair KLLoss / BernoulliLoss with density-ratio-style estimands (TSM); difference-of-evaluations estimands (ATE, ATT) frequently produce out-of-domain leaves under those losses.

from riesztree import RieszTreeRegressor, TSM
from rieszreg import KLLoss

est_kl = RieszTreeRegressor(
    estimand=TSM(level=1.0, treatment="a", covariates=("x",)),
    loss=KLLoss(),
    max_depth=4, random_state=0,
).fit(df)
alpha_kl = est_kl.predict(df)
print(f"all alpha_hat ≥ 0: {bool((alpha_kl >= 0).all())}")

all alpha_hat ≥ 0: True

print(f"alpha range      : [{alpha_kl.min():.3f}, {alpha_kl.max():.3f}]")

alpha range      : [0.000, 14.167]

12 What you reach for instead

riesztree is intentionally a single tree. For higher-accuracy variants:

Single-tree augmentation in a forest: see forestriesz/AugForestRieszRegressor — the same augmentation-style splitter, ensembled.
Linear-in-\(X\) leaves (model trees): not implemented in v1. Within-leaf the augmented Bregman loss is still a small convex problem in the leaf’s \(X\)-coefficients, so the decoupling property survives. Planned.
Treatment-dimension sieve (à la forestriesz’s riesz_feature_fns="auto", which uses \([\mathbf 1\{T = 0\}, \mathbf 1\{T = 1\}]\) as basis): not implemented in v1. The tree code generalises naturally.

13 Custom estimands

Because riesztree implements the augmentation-style Backend.fit_augmented, any user-supplied FiniteEvalEstimand works without configuration: \(D_r\) and \(C_r\) vary across augmented rows even with no sieve. (Contrast forestriesz’s moment-style backend, which raises a row-constant degeneracy error on AdditiveShift / LocalShift / custom estimands.) Pass your custom estimand and fit normally:

from rieszreg import FiniteEvalEstimand, LinearForm

def m_custom(alpha):
    def inner(z, y=None):
        return LinearForm.point(1.0, {"a": z["a"] + 1.0, "x": z["x"]}) \
             - LinearForm.point(1.0, {"a": z["a"], "x": z["x"]})
    return inner

custom = FiniteEvalEstimand(feature_keys=("a", "x"), m=m_custom)
est = RieszTreeRegressor(estimand=custom, max_depth=4).fit(df)

14 Consistency

For the squared Bregman-Riesz loss the population objective satisfies \[ L(\alpha) \;-\; L(\alpha_0) \;=\; \|\alpha - \alpha_0\|_{L^2(P)}^2, \] so minimising \(L\) is \(L^2\)-regression onto the (unobserved) Riesz representer \(\alpha_0\). Theorem 1 below makes this rigorous for the empirical risk minimiser over piecewise-constant functions on tree partitions.

14.1 Setting

Let \(Z_1, \ldots, Z_n \overset{\text{iid}}{\sim} P\) on a measurable space \(\mathcal{Z} \subseteq \mathbb{R}^{d+1}\) (treatment plus \(d\) covariates). Let \(m\) be a finite-evaluation linear functional with trace \[ m(g)(z) \;=\; \sum_{r=1}^{R(z)} c_r(z)\, g(z_r(z)) \] and Riesz representer \(\alpha_0 \in L^2(P)\) defined by \(\mathbb{E}[m(g)(Z)] = \mathbb{E}[\alpha_0(Z)\, g(Z)]\) for all \(g \in L^2(P)\). The squared Bregman-Riesz loss and its empirical analog are \[ L(\alpha) \;=\; \mathbb{E}\!\left[\alpha(Z)^2 - 2\, m(\alpha)(Z)\right], \qquad \hat L_n(\alpha) \;=\; \frac{1}{n}\sum_{i=1}^n \left[\alpha(Z_i)^2 - 2\, m(\alpha)(Z_i)\right]. \]

Let \(\mathcal{T}_K\) be the class of binary tree partitions of \(\mathcal{Z}\) into at most \(K\) leaves with axis-aligned splits, and let \[ \mathcal{F}_{K,M} \;=\; \big\{\alpha : \mathcal{Z} \to [-M, M] \;\big|\; \alpha \text{ is constant on each leaf of some } \Pi \in \mathcal{T}_K\big\}. \] Define the global empirical risk minimiser at complexity \(K_n\), with leaf values clipped to \([-M, M]\): \[ \hat\alpha_n \;\in\; \arg\min_{\alpha \in \mathcal{F}_{K_n, M}} \hat L_n(\alpha). \]

14.2 Assumptions

(A1) Bounded representer. \(\|\alpha_0\|_\infty \le M < \infty\).
(A2) Bounded trace. There exist constants \(B, R < \infty\) such that \(R(z) \le R\) and \(\sum_{r=1}^{R(z)} |c_r(z)| \le B\) for \(P\)-a.e. \(z\).
(A3) Approximation. \(\inf_{\alpha \in \mathcal{F}_{K_n, M}} \|\alpha - \alpha_0\|_{L^2(P)} \to 0\) as \(n \to \infty\).
(A4) Capacity. \(K_n \to \infty\) and \(K_n \log n / n \to 0\) as \(n \to \infty\), with \(d\) fixed.

14.3 Theorem

Theorem 1. Under (A1)–(A4), \[ \|\hat\alpha_n - \alpha_0\|_{L^2(P)} \;\xrightarrow{P}\; 0. \]

Proof of Theorem 1

Write \(\Delta_n(\alpha) := \hat L_n(\alpha) - L(\alpha)\).

Step 1 (loss identity). For any \(\alpha \in L^2(P)\), \[ L(\alpha) - L(\alpha_0) \;=\; \mathbb{E}[\alpha^2] - \mathbb{E}[\alpha_0^2] - 2\, \mathbb{E}[m(\alpha) - m(\alpha_0)] \;=\; \mathbb{E}[\alpha^2] - \mathbb{E}[\alpha_0^2] - 2\, \mathbb{E}[\alpha_0(\alpha - \alpha_0)] \;=\; \|\alpha - \alpha_0\|_{L^2(P)}^2, \] using the Riesz identity \(\mathbb{E}[m(g)(Z)] = \mathbb{E}[\alpha_0(Z) g(Z)]\) in the second equality, applied to \(g = \alpha - \alpha_0 \in L^2(P)\) (which is in \(L^2\) by (A1) and the assumption \(\alpha \in \mathcal{F}_{K_n, M}\) so \(\|\alpha\|_\infty \le M\)).

Step 2 (oracle decomposition). Let \(\alpha_n^\star \in \arg\min_{\alpha \in \mathcal{F}_{K_n, M}} L(\alpha)\). By optimality of \(\hat\alpha_n\) for \(\hat L_n\), \(\hat L_n(\hat\alpha_n) - \hat L_n(\alpha_n^\star) \le 0\), so \[ \begin{aligned} L(\hat\alpha_n) - L(\alpha_0) &= \big[L(\hat\alpha_n) - \hat L_n(\hat\alpha_n)\big] + \big[\hat L_n(\hat\alpha_n) - \hat L_n(\alpha_n^\star)\big] \\ &\quad + \big[\hat L_n(\alpha_n^\star) - L(\alpha_n^\star)\big] + \big[L(\alpha_n^\star) - L(\alpha_0)\big] \\ &\le -\Delta_n(\hat\alpha_n) + 0 + \Delta_n(\alpha_n^\star) + \inf_{\alpha \in \mathcal{F}_{K_n, M}} \|\alpha - \alpha_0\|_{L^2(P)}^2 \\ &\le 2 \sup_{\alpha \in \mathcal{F}_{K_n, M}} |\Delta_n(\alpha)| + \inf_{\alpha \in \mathcal{F}_{K_n, M}} \|\alpha - \alpha_0\|_{L^2(P)}^2. \end{aligned} \] The substitution in the third line uses Step 1 to write \(L(\alpha_n^\star) - L(\alpha_0) = \inf_{\alpha} \|\alpha - \alpha_0\|_{L^2(P)}^2\) (since \(\alpha_n^\star\) minimises \(L\) over \(\mathcal{F}_{K_n, M}\)). The infimum vanishes by (A3); it remains to control the supremum.

Step 3 (uniform LLN). Decompose \[ \Delta_n(\alpha) \;=\; \underbrace{\big(\hat E_n \alpha^2 - \mathbb{E}[\alpha^2]\big)}_{=:\, U_n(\alpha)} \;-\; 2\, \underbrace{\big(\hat E_n F_\alpha - \mathbb{E}[F_\alpha]\big)}_{=:\, V_n(\alpha)}, \qquad F_\alpha(z) \,:=\, \sum_{r=1}^{R(z)} c_r(z)\, \alpha(z_r(z)), \] where \(\hat E_n f := \frac{1}{n}\sum_i f(Z_i)\). By (A1)-(A2), every \(\alpha \in \mathcal{F}_{K_n, M}\) satisfies \(\|\alpha\|_\infty \le M\), \(\|\alpha^2\|_\infty \le M^2\), and \(\|F_\alpha\|_\infty \le BM\).

We bound the empirical \(L^1\)-covering number of \(\mathcal{F}_{K_n, M}\), \(\mathcal{Q}_n := \{\alpha^2 : \alpha \in \mathcal{F}_{K_n, M}\}\), and \(\mathcal{M}_n := \{F_\alpha : \alpha \in \mathcal{F}_{K_n, M}\}\). For a class \(\mathcal{F}\) and a measure \(\mu\) on \(\mathcal{Z}\), write \(N_1(\epsilon, \mathcal{F}, \mu)\) for the smallest \(N\) such that \(\mathcal{F}\) admits an \(\epsilon\)-cover in \(L^1(\mu)\).

Each \(\alpha \in \mathcal{F}_{K_n, M}\) is determined by a partition \(\Pi \in \mathcal{T}_{K_n}\) and a value vector \((\alpha_\ell)_{\ell=1}^{K_n} \in [-M, M]^{K_n}\). Two functions with the same partition and value vectors \((\alpha_\ell), (\alpha'_\ell)\) satisfy \[ \|\alpha - \alpha'\|_{L^1(P_n)} \;\le\; \max_\ell |\alpha_\ell - \alpha'_\ell|. \] The number of distinct partitions in \(\mathcal{T}_{K_n}\) that yield distinct restrictions to the data points \(Z_1, \ldots, Z_n\) together with all augmented evaluation points \(\{z_r(Z_i) : i \le n,\, r \le R\}\) (a set of size at most \(nR\)) is at most \((d \cdot nR)^{K_n - 1} \le (dnR)^{K_n}\): each of the \(\le K_n - 1\) splits picks one of \(d\) axes and one of at most \(nR\) achievable thresholds along that axis. Combined with an \(\epsilon\)-cover of \([-M, M]^{K_n}\) in \(L^\infty\) of size \(\lceil 2M/\epsilon \rceil^{K_n}\), \[ N_1\!\big(\epsilon,\, \mathcal{F}_{K_n, M},\, P_n\big) \;\le\; (dnR)^{K_n}\, (2M/\epsilon)^{K_n}. \] Composition: since \(|a^2 - b^2| \le 2M\, |a - b|\) on \([-M, M]\) and \(|F_\alpha(z) - F_{\alpha'}(z)| \le T(z)\, \max_\ell |\alpha_\ell - \alpha'_\ell| \le B\, \max_\ell |\alpha_\ell - \alpha'_\ell|\) by (A2), \[ N_1\!\big(\epsilon,\, \mathcal{Q}_n,\, P_n\big) \;\le\; (dnR)^{K_n}\, (4M^2/\epsilon)^{K_n}, \qquad N_1\!\big(\epsilon,\, \mathcal{M}_n,\, P_n\big) \;\le\; (dnR)^{K_n}\, (2BM/\epsilon)^{K_n}. \]

Pollard’s uniform Vapnik-Chervonenkis inequality (Pollard 1984, Theorem II.24; equivalently Devroye, Györfi and Lugosi 1996, Theorem 12.6) states that for any class \(\mathcal{F}\) of \([-B', B']\)-valued measurable functions and any \(\epsilon > 0\), \[ P\!\left(\sup_{f \in \mathcal{F}} \big|\hat E_n f - \mathbb{E}[f]\big| > \epsilon\right) \;\le\; 8\, \mathbb{E}\!\big[N_1(\epsilon/8, \mathcal{F}, P_n)\big]\, \exp\!\big(-n \epsilon^2 / (128\, B'^2)\big). \] Apply this to \(\mathcal{Q}_n\) with \(B' = M^2\) and to \(\mathcal{M}_n\) with \(B' = BM\): \[ \begin{aligned} P\!\left(\sup_{\alpha} |U_n(\alpha)| > \epsilon\right) \;&\le\; 8\, (dnR)^{K_n}\, (32 M^2/\epsilon)^{K_n}\, \exp\!\big(-n \epsilon^2 / (128\, M^4)\big), \\ P\!\left(\sup_{\alpha} |V_n(\alpha)| > \epsilon\right) \;&\le\; 8\, (dnR)^{K_n}\, (16 BM/\epsilon)^{K_n}\, \exp\!\big(-n \epsilon^2 / (128\, B^2 M^2)\big). \end{aligned} \] Taking logarithms, the exponents are \[ K_n \log(dnR) + K_n \log(C/\epsilon) - n \epsilon^2 / C', \] which tend to \(-\infty\) for any fixed \(\epsilon > 0\) provided \(K_n \log n / n \to 0\), i.e., (A4). Hence \[ \sup_{\alpha \in \mathcal{F}_{K_n, M}} |\Delta_n(\alpha)| \;\le\; \sup_\alpha |U_n(\alpha)| + 2 \sup_\alpha |V_n(\alpha)| \;\xrightarrow{P}\; 0. \]

Step 4 (conclusion). Combining Steps 2 and 3, \(L(\hat\alpha_n) - L(\alpha_0) \xrightarrow{P} 0\). By Step 1, \(L(\hat\alpha_n) - L(\alpha_0) = \|\hat\alpha_n - \alpha_0\|_{L^2(P)}^2\), so \(\|\hat\alpha_n - \alpha_0\|_{L^2(P)} \xrightarrow{P} 0\). \(\blacksquare\)

14.4 Remarks on the assumptions

(A1) and (A3) are the standard assumptions in non-parametric regression-tree consistency proofs, transported to the Riesz target \(\alpha_0\) in place of the regression function. (A3) is an assumption on the class of partitions and on the smoothness of \(\alpha_0\). It holds, for example, when \(\mathcal{Z}\) is bounded, \(\alpha_0\) is continuous, \(P\) has no atoms, and \(\mathcal{T}_{K_n}\) contains all axis-aligned tree partitions with \(\le K_n\) leaves; the rate at which the infimum vanishes is governed by the smoothness of \(\alpha_0\) and the dimension \(d\) via standard approximation theory.
(A4) is the standard tree-complexity assumption: trees may grow with \(n\), but slowly enough that the log-covering complexity \(K_n \log n\) of the partition class stays \(o(n)\).
(A2) is the only Riesz-specific assumption. The trace coefficient sum \(T(z) := \sum_r |c_r(z)|\) controls the magnitude of the augmented loss residuals, and is the analog of “bounded response” in regression CART. For ATE the trace sum equals \(2\) identically. For ATT, ATC, and weighted estimands, \(T\) inherits inverse-propensity factors, and (A2) reduces to the standard positivity (overlap) assumption \(\pi(x) \in [\eta, 1 - \eta]\) for some \(\eta > 0\).

14.5 Greedy versus global ERM

Theorem 1 applies to the global minimiser of \(\hat L_n\) over \(\mathcal{F}_{K_n, M}\). The riesztree splitter is greedy: it accepts a sequence of best-gain splits and need not return the global optimum. Greedy CART consistency for ordinary regression is established only under additional structural conditions on the partitions the rule produces (Nobel 1996; Scornet, Biau and Vert 2015 for additive regression). Corollary 1 imports those conditions into the Riesz setting.

Let \(\hat\Pi_n \in \mathcal{T}_{K_n}\) denote any data-dependent partition (in particular, the partition produced by greedy riesztree under any growth or pruning policy) and let \[ \hat\alpha_n(z) \;=\; \mathrm{clip}_{[-M, M]}\!\left( \frac{\sum_{i=1}^n m(\mathbf{1}_{\hat\Pi_n(z)})(Z_i)}{\#\{i : Z_i \in \hat\Pi_n(z)\}} \right) \] be the riesztree estimator (with the convention that empty cells take value \(0\), which is then clipped harmlessly). Equivalently in the doc’s notation, the leaf value is \(\mathrm{clip}_{[-M, M]}(-\hat C_\ell / \hat D_\ell)\).

Corollary 1 (consistency under shrinking-cell conditions). Suppose (A1), (A2), and (A4) of Theorem 1 hold, together with

(B1) Shrinking cells. \(\mathrm{diam}\big(\hat\Pi_n(Z_*)\big) \xrightarrow{P} 0\), where \(Z_* \sim P\) is independent of the training sample.
(B2) Uniformly continuous representer. \(\alpha_0\) is uniformly continuous on \(\mathcal{Z}\).

Then \(\|\hat\alpha_n - \alpha_0\|_{L^2(P)} \xrightarrow{P} 0\).

Proof of Corollary 1

Let \(\bar\alpha_n(z) := \mu_{\hat\Pi_n(z)}\) where \(\mu_A := \mathbb{E}[\alpha_0(Z) \mid Z \in A,\, \hat\Pi_n]\) when \(P(A) > 0\) and \(\mu_A := 0\) otherwise. By (A1), \(|\mu_A| \le M\), so \(\bar\alpha_n \in \mathcal{F}_{\hat\Pi_n, M} \subseteq \mathcal{F}_{K_n, M}\).

Step 1 (oracle decomposition restricted to \(\hat\Pi_n\)). The riesztree leaf-value formula \(\hat\alpha_\ell^* = -\hat C_\ell / \hat D_\ell\) is the unrestricted minimiser of \(\hat L_n\) over piecewise-constant functions on \(\hat\Pi_n\) (per-leaf decoupling, derived earlier in this page). Clipping to \([-M, M]\) is the projection onto \(\mathcal{F}_{\hat\Pi_n, M}\), and since the squared Bregman-Riesz loss is convex in each leaf value, the clipped optimum minimises \(\hat L_n\) over \(\mathcal{F}_{\hat\Pi_n, M}\). In particular \(\hat L_n(\hat\alpha_n) \le \hat L_n(\bar\alpha_n)\), so as in Step 2 of the proof of Theorem 1, \[ L(\hat\alpha_n) - L(\bar\alpha_n) \;\le\; 2 \sup_{\alpha \in \mathcal{F}_{K_n, M}} \big|\hat L_n(\alpha) - L(\alpha)\big|. \] The supremum is over the full class \(\mathcal{F}_{K_n, M}\) (which contains \(\hat\Pi_n\)’s class as a subset, regardless of how \(\hat\Pi_n\) was chosen). Step 3 of the proof of Theorem 1 shows this supremum vanishes in probability under (A4): \[ L(\hat\alpha_n) - L(\bar\alpha_n) \;\xrightarrow{P}\; 0. \]

Step 2 (bias from cell shrinkage). Let \(\omega(\delta) := \sup\{|\alpha_0(z) - \alpha_0(z')| : z, z' \in \mathcal{Z},\, \|z - z'\| \le \delta\}\) be the modulus of continuity of \(\alpha_0\), with \(\omega(\delta) \to 0\) as \(\delta \to 0\) by (B2). For any \(z\) with \(P(\hat\Pi_n(z)) > 0\), \[ |\bar\alpha_n(z) - \alpha_0(z)| \;=\; \big|\mathbb{E}[\alpha_0(Z) - \alpha_0(z) \mid Z \in \hat\Pi_n(z)]\big| \;\le\; \omega\!\big(\mathrm{diam}(\hat\Pi_n(z))\big). \] For any \(\delta > 0\) and any \(z \in \mathcal{Z}\), \[ |\bar\alpha_n(z) - \alpha_0(z)| \;\le\; \omega(\delta) + 2M\, \mathbf{1}\{\mathrm{diam}(\hat\Pi_n(z)) > \delta\}, \] using (A1) for the bound \(2M\) on the second term and the convention \(|\bar\alpha_n - \alpha_0| \le 2M\) when \(P(\hat\Pi_n(z)) = 0\) (a \(P\)-null set). Squaring and integrating against \(P\): \[ \|\bar\alpha_n - \alpha_0\|_{L^2(P)}^2 \;\le\; 2\omega(\delta)^2 + 8 M^2\, P\!\big(\mathrm{diam}(\hat\Pi_n(Z_*)) > \delta\big). \] By (B1), \(P(\mathrm{diam}(\hat\Pi_n(Z_*)) > \delta) \to 0\) for any fixed \(\delta > 0\). Letting \(\delta \to 0\) along a sequence \(\delta_n \downarrow 0\) slowly enough that the second term vanishes, \[ \|\bar\alpha_n - \alpha_0\|_{L^2(P)}^2 \;\xrightarrow{P}\; 0. \]

Step 3 (combine). By Step 1 of the proof of Theorem 1, \(L(\bar\alpha_n) - L(\alpha_0) = \|\bar\alpha_n - \alpha_0\|_{L^2(P)}^2\). Adding the bounds from Steps 1 and 2 above, \[ \|\hat\alpha_n - \alpha_0\|_{L^2(P)}^2 \;=\; L(\hat\alpha_n) - L(\alpha_0) \;\le\; \big[L(\hat\alpha_n) - L(\bar\alpha_n)\big] + \|\bar\alpha_n - \alpha_0\|_{L^2(P)}^2 \;\xrightarrow{P}\; 0. \qquad \blacksquare \]

On (B1). Whether greedy CART produces shrinking cells is a separate question that depends on the splitting rule and the DGP. Sufficient conditions in the regression-tree literature (Wager and Walther 2016; Klusowski 2021; Scornet, Biau and Vert 2015 for additive regression) carry over to the augmented Bregman setting under analogous structural conditions on the signal carried by \(\alpha_0\). (B1) holds automatically for “honest” splitters that allocate fresh data to leaf-value estimation, and for purely random forests.

Why (B2) of Nobel (1996) is not needed here. Nobel’s framework requires growing cell counts (\(\hat n_{\hat\Pi_n(Z_*)} \xrightarrow{P} \infty\)) to control per-leaf variance of the histogram regression estimator. Here the leaf values are clipped to \([-M, M]\) (since \(\alpha_0 \in [-M, M]\)), and the uniform empirical-process bound from Step 3 of the proof of Theorem 1 controls the variance for the whole class \(\mathcal{F}_{K_n, M}\) at once. Clipping plus uniform LLN replaces growing-leaf-counts.

14.6 In practice

RMSE on the canonical DGPs in rieszreg.testing.dgps shrinks as \(n\) grows, with the noise floor close to what a comparable-depth CART achieves on the analogous \(L^2\) regression problem. Single-tree variance dominates; use forestriesz or rieszboost for tighter rates.

15 Simulations: how does this stack up against IPW?

A common alternative for ATE-style estimands is two-step IPW: fit \(\hat\pi(x)\) (a propensity model), then plug into

\[ \hat\alpha(a, x) \;=\; \frac{2a - 1}{a\,\hat\pi(x) + (1 - a)\,(1 - \hat\pi(x))}. \]

Two natural choices for \(\hat\pi\):

Logistic regression (well-specified on the linear-Gaussian DGP below).
Regression tree of \(A\) on \(X\) (mis-specified, depth tuned via held-out NLL).

Compare both against riesztree (depth tuned via held-out Riesz loss). Same training set; oracle RMSE measured on a fresh test set of size \(5n\). Faceted across training sizes \(n \in \{100, 200, 500, 2000, 8000\}\).

import numpy as np, pandas as pd, matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

from riesztree import RieszTreeRegressor, ATE


def sample(n, seed):
    rng = np.random.default_rng(seed)
    x = rng.normal(0.0, 1.0, size=(n, 5))
    logit = 0.7 * x[:, 0] - 0.3 * x[:, 1]
    pi = 1.0 / (1.0 + np.exp(-logit))
    a = (rng.uniform(0, 1, n) < pi).astype(float)
    df = pd.DataFrame(x, columns=[f"x{i}" for i in range(5)])
    df.insert(0, "a", a)
    df["_pi"] = pi
    return df


def truth_from(df):
    pi = df["_pi"].values
    a = df["a"].values
    prob_a = a * pi + (1.0 - a) * (1.0 - pi)
    return (2.0 * a - 1.0) / prob_a


def rmse(a, b):
    return float(np.sqrt(np.mean((a - b) ** 2)))


def ipw_with_estimator(clf, train, test):
    feats = [f"x{i}" for i in range(5)]
    clf.fit(train[feats].values, train["a"].values.astype(int))
    pi_hat = np.clip(clf.predict_proba(test[feats].values)[:, 1], 1e-3, 1 - 1e-3)
    a = test["a"].values
    prob_a = a * pi_hat + (1.0 - a) * (1.0 - pi_hat)
    return (2.0 * a - 1.0) / prob_a


depths = [2, 3, 4, 6, 8, 12]
n_grid = [100, 200, 500, 2000, 8000]
covariates = ("x0", "x1", "x2", "x3", "x4")

records = []
for n in n_grid:
    train = sample(n, seed=0)
    test = sample(5 * n, seed=1)
    truth = truth_from(test)

    # Logistic-regression IPW: a single horizontal reference (no depth knob).
    log_ipw_alpha = ipw_with_estimator(LogisticRegression(max_iter=2000), train, test)
    log_ipw_rmse = rmse(log_ipw_alpha, truth)

    for d in depths:
        # riesztree at this depth.
        est = RieszTreeRegressor(
            estimand=ATE(treatment="a", covariates=covariates),
            max_depth=d, random_state=0,
        ).fit(train)
        records.append({
            "n_train": n, "method": "riesztree", "depth": d,
            "rmse": rmse(est.predict(test), truth),
        })
        # IPW-tree at this depth.
        records.append({
            "n_train": n, "method": "IPW (regression tree)", "depth": d,
            "rmse": rmse(
                ipw_with_estimator(
                    DecisionTreeClassifier(max_depth=d, min_samples_leaf=10, random_state=0),
                    train, test,
                ),
                truth,
            ),
        })
    records.append({
        "n_train": n, "method": "IPW (logistic regression)", "depth": None,
        "rmse": log_ipw_rmse,
    })

res = pd.DataFrame(records)

fig, axes = plt.subplots(1, len(n_grid), figsize=(3.0 * len(n_grid), 4), sharey=False)
colors = {"riesztree": "#005f73", "IPW (regression tree)": "#bb3e03"}
for ax, n in zip(axes, n_grid):
    sub = res[res["n_train"] == n]
    for method, color in colors.items():
        m = sub[sub["method"] == method].sort_values("depth")
        ax.plot(m["depth"], m["rmse"], "-o", color=color, label=method, linewidth=2)
    log_rmse = sub[sub["method"] == "IPW (logistic regression)"]["rmse"].iloc[0]
    ax.axhline(log_rmse, color="black", linestyle="--", label="IPW (logistic, well-spec.)")
    ax.set_title(f"n_train = {n}")
    ax.set_xlabel("max_depth")
    ax.set_ylabel("oracle RMSE" if n == n_grid[0] else "")
    ax.set_yscale("log")
    ax.grid(alpha=0.3)
axes[-1].legend(loc="best", fontsize=8)
fig.tight_layout()
plt.show()

Oracle RMSE vs depth, faceted by n_train. Lower is better. Logistic-regression IPW is well-specified on this DGP and gives the best achievable two-step baseline; riesztree directly optimises the Bregman-Riesz loss and beats both IPW variants once n exceeds a few hundred.

Three things to read off the chart:

At very small \(n\) (100, 200), all three methods are competitive — there isn’t enough signal to separate them.
From \(n \approx 500\) onward, riesztree consistently beats IPW-with-CART, and the gap grows with \(n\).
The horizontal logistic-regression line is the floor for IPW with a correctly-specified propensity model. riesztree reaches and crosses that floor at moderate \(n\), despite not knowing the propensity is logistic. Direct loss optimisation on \(\alpha\) avoids paying the model-mis-specification tax that the tree-IPW route incurs.

The held-out depth that minimises RMSE for riesztree typically sits in the 4–8 range; deeper trees overfit. Tune via GridSearchCV in practice.

16 Sharp edges

High variance. A single tree’s RMSE is materially higher than a tuned forest or booster on the same DGP. Use this learner for interpretability or when the dependency footprint matters; otherwise pick forestriesz or rieszboost.
No confidence intervals in v1. Honest splits are not implemented.
Constant per-leaf \(\alpha\) only. Linear-in-\(X\) and treatment-sieve leaves are documented as future work.
Loss-domain disqualifications. KL / Bernoulli with ATE-style estimands will see many split candidates ruled out; the resulting tree may be shallower than max_depth. This is the splitter respecting the link’s \(\alpha\)-domain, not a bug.