Losses

A Loss describes the Bregman-Riesz loss the learner minimizes. rieszreg ships four built-in losses (SquaredLoss, KLLoss, BernoulliLoss, BoundedSquaredLoss) and a Loss base class for defining your own.

Bregman-Riesz Losses

Hines & Miles (2025) show that for a convex, differentiable function \(h: \mathbb R \to \mathbb R\) (called \(F\) in their paper), the minimizer of a population loss based on \(h\) (the average Bregman divergence between \(\alpha\) and \(\alpha_0\) with potential \(h\)) can be expressed in a form that does not involve the true Riesz representer \(\alpha_0\).

\[ \begin{aligned} \arg\min_\alpha \mathcal L_h(\alpha) &= \arg\min_\alpha \mathbb E[h(\alpha_0(Z)) - h(\alpha(Z)) - h'(\alpha(Z))(\alpha_0(Z) - \alpha(Z))] \\ &= \arg\min_\alpha \mathbb E[\tilde h(\alpha(Z)) - m(h' \circ \alpha)(Z,Y)] \end{aligned} \]

where \(\tilde h(t) = t h'(t) - h(t)\). Kato (2026) and Hines & Miles both give some intuition for why this loss makes sense and what sensible choices of \(h\) are in different cases. When \(h(t) = t^2\) this reduces to the standard Riesz loss of Chernozhukov, Newey, Quintas-Martínez, and Syrgkanis (2021). rieszreg implements several Bregman-Riesz losses and also allows users to define their own by defining the function \(h\).

Loss table

The “Potential \(h\)” column is \(h\) in α-space — the function the user passes as potential= (or overrides). At fit time the learner produces an unconstrained \(\eta\), the link maps \(\eta\) to \(\alpha = g(\eta)\) in the domain, and \(h\) is evaluated at \(\alpha\). The “Effective potential \(h \circ g\)” column is what actually gets evaluated in \(\eta\)-space — useful when comparing losses to gradient-boosting objectives stated in \(\eta\).

Loss	Potential \(h(\alpha)\)	Effective \(h(g(\eta))\)	\(\alpha\) domain	Use for
`SquaredLoss`	\(\alpha^2\)	\(\eta^2\)	\(\mathbb{R}\)	Default.
`KLLoss`	\(\alpha \log \alpha\)	\(\eta \cdot e^\eta\)	\((0, \infty)\)	Density-ratio estimands (TSM, IPSI). Guarantees \(\hat\alpha > 0\).
`BernoulliLoss`	\(\alpha \log \alpha + (1-\alpha)\log(1-\alpha)\)	\(\eta\,\sigma(\eta) - \log(1+e^\eta)\)	\((0, 1)\)	\(\alpha_0\) known to lie in \((0, 1)\).
`BoundedSquaredLoss(lo, hi)`	\(\alpha^2\)	\(\big(\mathrm{lo}+(\mathrm{hi}-\mathrm{lo})\sigma(\eta)\big)^2\)	\((\mathrm{lo}, \mathrm{hi})\)	Squared-loss fit with prior bounds (e.g. trimmed-propensity ratios).

Pass the loss instance to any learner via loss=:

import numpy as np, pandas as pd
from rieszboost import RieszBooster
from rieszreg import ATE, TSM, KLLoss, BoundedSquaredLoss

rng = np.random.default_rng(0)
n = 500
x = rng.uniform(0, 1, n)
pi = 1 / (1 + np.exp(-(4 * x - 2)))
a = rng.binomial(1, pi)
df = pd.DataFrame({"a": a.astype(float), "x": x})

# Squared loss (default)
RieszBooster(estimand=ATE(), n_estimators=100).fit(df).predict(df)[:3]

array([1.5734262, 2.5232608, 3.595529 ], dtype=float32)

# KL loss for a density-ratio estimand
RieszBooster(estimand=TSM(level=1), loss=KLLoss(), n_estimators=100).fit(df).predict(df)[:3]

array([1.550644, 2.55281 , 4.471114], dtype=float32)

# Bounded squared with prior bounds
RieszBooster(estimand=ATE(),
             loss=BoundedSquaredLoss(lo=-5.0, hi=5.0),
             n_estimators=100).fit(df).predict(df)[:3]

array([1.5943003, 2.5617194, 4.412609 ], dtype=float32)

set.seed(0)
n  <- 500
x  <- runif(n)
pi <- 1 / (1 + exp(-(4 * x - 2)))
a  <- rbinom(n, 1, pi)
df <- data.frame(a = as.numeric(a), x = x)

# Squared loss (default)
RieszBooster$new(estimand = ATE(), n_estimators = 100L)$fit(df)$predict(df)[1:3]

[1]  1.169958 -1.336773  1.965481

# KL loss for a density-ratio estimand
RieszBooster$new(estimand = TSM(level = 1), loss = KLLoss(),
                 n_estimators = 100L)$fit(df)$predict(df)[1:3]

[1] 1.1560670 0.2848236 1.8298105

# Bounded squared with prior bounds
RieszBooster$new(estimand = ATE(),
                 loss = BoundedSquaredLoss(lo = -5.0, hi = 5.0),
                 n_estimators = 100L)$fit(df)$predict(df)[1:3]

[1]  1.187630 -1.338935  2.180158

Custom losses

Python-only

Custom losses are Python-only. From R, define the loss in Python via reticulate::py_run_string and pass the instance via loss=.

Define your own loss the same way the built-ins are defined: subclass Loss, override potential(alpha) (the math symbol \(h\)), potential_deriv(alpha) (\(h'\)), and (if the loss has a non-trivial domain) the link methods. The example below is a pseudo-Huber potential \(h(\alpha) = \sqrt{1 + \alpha^2} - 1\) with the identity link — strictly convex, bounded gradient, no domain restriction.

import numpy as np
from rieszreg import Loss, ATE
from rieszboost import RieszBooster

class PseudoHuberLoss(Loss):
    name = "pseudo-huber"

    def potential(self, alpha):
        return np.sqrt(1.0 + alpha ** 2) - 1.0

    def potential_deriv(self, alpha):
        return alpha / np.sqrt(1.0 + alpha ** 2)

booster = RieszBooster(
    estimand=ATE(),
    loss=PseudoHuberLoss(),
    n_estimators=200, learning_rate=0.05, max_depth=3,
).fit(df)
booster.predict(df)[:5]

array([ 1.0391645,  1.1831129,  1.2119455, -0.9240942,  0.9052777],
      dtype=float32)

For a quick experiment without writing a class, instantiate Loss directly with the potential and link as constructor arguments:

loss = Loss(
    potential=lambda a: np.sqrt(1.0 + a ** 2) - 1.0,
    potential_deriv=lambda a: a / np.sqrt(1.0 + a ** 2),
    link="identity",
)

If you skip potential_deriv, the framework will try to derive it via jax.grad when potential is written using jax.numpy ops. If jax isn’t available or your potential uses stock numpy, it falls back to a central-difference numerical derivative. Subclassing with an analytic potential_deriv is fastest; jax autograd is the next-best option.

import jax.numpy as jnp

# potential_deriv resolved automatically by jax.grad
loss = Loss(potential=lambda a: jnp.sqrt(1.0 + a ** 2) - 1.0, link="identity")
loss.potential_deriv(np.array([0.5, 1.0, 2.0]))

array([0.4472136 , 0.70710677, 0.8944272 ], dtype=float32)

When the loss has a restricted α-domain — say you want \(\alpha > 0\) — set the link to constrain \(\hat\alpha\):

class KLLikeLoss(Loss):
    name = "kl-like"

    def potential(self, alpha):
        return alpha * np.log(alpha)

    def potential_deriv(self, alpha):
        return np.log(alpha) + 1.0

    def link_to_alpha(self, eta):
        return np.exp(np.clip(eta, -50.0, 50.0))

    def alpha_to_eta(self, alpha):
        return np.log(alpha)

Loss from a constructor or a one-off subclass is not save/load round-trippable — the user’s potential callable doesn’t survive the JSON metadata path. Pass the loss explicitly when reloading the estimator. The four built-ins do round-trip.

SquaredLoss and KLLoss are short reference implementations.

Links

A link \(g: \mathbb R \to \mathcal D \subset \mathbb R\) is a differentiable, strictly monotone bijection from the real line onto a target domain. The learner produces an unconstrained \(\eta\); the link maps \(\eta\) to \(\alpha = g(\eta)\) in the loss’s natural domain. Loss exposes the link via link_to_alpha(eta) and its inverse alpha_to_eta(alpha).

A link is mathematically redundant in principle — the composed potential \(h \circ g\) is itself convex and differentiable, so any loss with a non-trivial link can be re-expressed with the identity link by absorbing \(g\) into \(h\). The link exists for convenience: it lets you write \(h\) in its natural form (e.g. \(h(t) = t \log t\) for KL, defined for \(t > 0\)) and then constrain \(\hat\alpha\) to that domain at predict time without re-deriving anything.

The inline Loss(...) constructor accepts three named links:

`link`	\(\alpha = g(\eta)\)	\(\alpha\) domain
`"identity"` (default)	\(\eta\)	\(\mathbb R\)
`"exp"`	\(\exp(\eta)\)	\((0, \infty)\)
`"sigmoid"`	\(1 / (1 + \exp(-\eta))\)	\((0, 1)\)

Subclasses can implement any link — just override link_to_alpha and alpha_to_eta. BoundedSquaredLoss does this with a sigmoid-scaled link \(\alpha = \mathrm{lo} + (\mathrm{hi} - \mathrm{lo}) \cdot \sigma(\eta)\) to keep predictions in a user-specified box.

Backend support

Which backends accept which losses

Boosting (rieszboost): all four losses.
Kernel ridge (krrr): SquaredLoss only — non-quadratic losses need a Newton iteration on the kernel system.
Random forest, augmentation-style (forestriesz.AugForestRieszRegressor): all four losses. Tree structure from the squared MSE criterion, then a per-leaf Newton replaces each leaf’s stored θ with the Bregman optimum.
Random forest, moment-style (forestriesz.ForestRieszRegressor): SquaredLoss only.
Neural network (riesznet): all four losses. Each loss’s per-row Bregman form is implemented in autograd-friendly torch ops; gradient parity vs the analytic loss is checked at 1e-8.

Pass an unsupported loss and the backend raises at fit time with a clear error.