Performance at Scale¶

A run with 1,000 policies × 100 scenarios produces 100,000 rows; a stochastic study with 10,000 policies × 10,000 scenarios produces 100 million. Running those without a memory budget fails on a laptop. The scenario loop is built around three things that keep memory bounded regardless of run size: batched scenarios, mergeable aggregators, and an auto-sized batch.

This page covers when each knob matters and how to read the run-time signals.

`batch_size` — the primary memory control¶

for_each_scenario(..., batch_size=N) runs at most N scenarios at a time. Aggregator state lives in memory between batches and merges across them; the per-batch projection (the cross-joined policy × scenario frame) is released as soon as the batch's reductions land.

import polars as pl
from gaspatchio_core.frame import ActuarialFrame
from gaspatchio_core.scenarios import Sum, Mean, for_each_scenario


def policies():
    return ActuarialFrame({
        "policy_id": list(range(1, 101)),
        "premium": [100.0 + i for i in range(100)],
    })


def model(af, *, tables=None, drivers=None):
    return af.with_columns(af["premium"].alias("loss"))


scenarios = [f"S{i:04d}" for i in range(50)]

result = for_each_scenario(
    policies(),
    scenarios=scenarios,
    model_fn=model,
    aggregations=(
        Sum("loss").alias("total"),
        Mean("loss").alias("mean"),
    ),
    batch_size=4,
)

print(result.aggregations["total"])  # 747500.0
print(result.aggregations["mean"])   # 14950.0
print(result.batch_size)             # 4
print(result.batch_size_resolution)  # 'manual'

Peak in-memory size scales with batch_size × policies × periods, not with the full scenario count. A 10,000-scenario run at batch_size=16 holds 16 scenarios worth of projection in memory at any time — the other 9,984 are scheduled batches or aggregator state.

Bit-equivalence across batch sizes¶

Every built-in aggregator merges deterministically across batches. For most aggregators, the result is bit-identical regardless of batch_size:

results = {}
for bs in (1, 4, 16):
    r = for_each_scenario(
        policies(), scenarios=scenarios, model_fn=model,
        aggregations=(Sum("loss").alias("total"),),
        batch_size=bs,
    )
    results[bs] = r.aggregations["total"]

print(results[1])   # 747500.0
print(results[4])   # 747500.0
print(results[16])  # 747500.0

Bit-exact across batch_size (10 of 13 aggregators):

Aggregator	Why it's bit-exact
`Sum`	Neumaier-compensated summation -- error O(ε), order-stable
`Count`, `Min`, `Max`, `ArgMin`, `ArgMax`	Integer add or pick semantics -- exact
`Quantile`, `Median`, `CTE`, `QuantileRank`	DDSketch buckets are integer counters; merge is commutative addition; quantile lookup is deterministic

Numerically stable but not bit-exact (Mean, Variance, Std):

These three use the Welford-Chan online algorithm, whose parallel merge formula divides by n_total at intermediate steps. The result is associative algebraically but the floating-point rounding of the intermediate divisions is sensitive to how scenarios were split into batches. Drift is O(ε · log N) — for actuarial cashflow scales this is well below 1 ULP relative and well below any meaningful threshold for variance-of-loss work, but it isn't zero.

The choice is deliberate: Welford-Chan stays numerically stable when computing variance of large near-equal numbers (where the textbook Σx² − (Σx)²/n form loses all precision). Trading O(ε·log N) batch-size drift for that stability is the right call for actuarial use.

What this gets you in practice: pick batch_size to fit your memory budget. Sum and friends are auditably reproducible across batch sizes; Mean/Variance/Std are reproducible within their stated bound, which is far inside any actuarially meaningful precision.

The bit-equivalence guarantees here are pinned by bindings/python/tests/scenarios/test_batch_equivalence.py, which exercises every aggregator across batch_size ∈ {1, 2, 4, 16, 64}.

`batch_size="auto"` — let the loop pick¶

If you don't want to tune by hand, "auto" probes memory at run time and picks a size targeting roughly half your available RAM:

result = for_each_scenario(
    policies(),
    scenarios=scenarios,
    model_fn=model,
    aggregations=(Sum("loss").alias("total"),),
    batch_size="auto",
)

print(result.batch_size)             # 256 (or whatever the probe chose)
print(result.batch_size_resolution)  # 'auto_probe'
print(result.peak_rss_mb)            # 0.7 (delta over baseline)

target_memory_fraction (default 0.5) controls how aggressively the loop sizes itself. The probe runs two throwaway warm-up batches — one at batch_size=1 and one at batch_size=4 — measures the RSS delta of each, and fits a linear model delta(size) = fixed_overhead + per_cell_cost · size. The fixed term (base tables, encoder caches, Polars warmup) is paid once out of the memory budget; the remaining budget is divided by per_cell_cost to pick the run-time batch_size. The two-point fit is what stops the picker over-shooting on high-policy runs where fixed overhead dominates a single-point measurement. The result is stamped on result.batch_size so you can verify what it picked.

For most runs, "auto" is the right default. Override only when you have a specific reason — e.g., a CI runner with a 2 GB memory ceiling needs batch_size=64 to stay under it deterministically.

DDSketch memory¶

Most aggregators (Sum, Count, Mean, Variance) compute from running totals — each scenario adds to a small fixed accumulator and the answer doesn't depend on how many scenarios you've seen. Quantile, Median, CTE, and QuantileRank can't work that way. They need the shape of the distribution, not just running sums. The naive answer is to hold every value in memory and sort at the end; at 10 million scenarios × multiple partitions that's a lot of floats.

These four sidestep the problem with DDSketch — a streaming data structure that estimates quantiles to a bounded relative error using a bounded amount of memory, regardless of how many values you feed it. The bound is exact and mergeable: two sketches built from disjoint batches combine into one whose accuracy guarantee still holds. That's what lets the batched scenario loop produce stable, bit-equivalent quantile answers across batch sizes.

DDSketch is a published, peer-reviewed algorithm, not a homegrown approximation — Masson, Rim & Lee, DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees, VLDB 2019 (doi:10.14778/3352063.3352135). It is the same sketch Datadog runs across its metrics platform, and the relative-error and mergeability properties quoted here are the ones the paper proves. The quantile aggregators are backed by the authors' reference ddsketch implementation directly, so what you reconcile against is the published algorithm rather than a re-derivation of it.

The tradeoff is exact quantiles for bounded ones. Two consequences for sizing:

Scaling with partitions. A CTE.over(("region", "peril")) with 50 region/peril combinations holds 50 sketches per CTE, totalling ~60 MB just for that one aggregator. Multiple CTEs in the same plan multiply.
Bounded by accuracy choice. Tightening to relative_accuracy=1e-3 drops per-sketch memory to ~125 KB (10×) at the cost of widening tail error to ~100 bp (10×).

For Solvency II SCR work the default 10 bp tail error sits well inside actuarial assumption uncertainty. Tighten only if you genuinely need it; loosen if you're running many partitioned sketches and seeing memory pressure.

The scalar aggregators (Sum, Count, Min, Max, Mean, Variance, Std, ArgMin, ArgMax) carry trivial state (4 floats or fewer per partition slot). They aren't the variable that drives the memory curve.

Scenario ID encoding¶

For runs with thousands of scenario IDs, encoding choice matters for the projection step (not the aggregator step):

scenarios = list(range(10_000))                # int IDs — UInt32 under the hood
scenarios = [f"STOCH_{i:05d}" for i in range(10_000)]   # string IDs — slower

ID type	Memory at 100M rows	Cross-join speed
`int` (UInt32)	~400 MB	Fastest
Categorical-encoded string	~400 MB	Fast
Plain string	~1.2 GB	Slowest

For stochastic studies, integer scenario IDs are the cheapest. For a small named set (BASE / UP / DOWN), the readability of strings is worth the small cost.

What to monitor¶

ScenarioResult carries three observability fields:

result.batch_size — what the loop actually ran at
result.batch_size_resolution — 'manual', 'auto_probe', or 'auto_calibrated'
result.peak_rss_mb — peak resident memory delta over baseline during the run
result.wall_time_s — total wall time

If peak_rss_mb climbs unexpectedly between runs at the same batch_size, the projection has grown — usually because model_fn is producing wider frames than expected. Drop batch_size while you investigate.

When the audit chain pins memory¶

ScenarioRun.run carries the same batch_size argument. The recipe-level SHA does not include batch_size — two runs with different batch sizes share a source_sha because batch size is a memory choice, not a recipe choice. For bit-exact aggregators that means aggregator outputs are identical; for Welford-Chan aggregators (Mean/Variance/Std) outputs are reproducible within O(ε·log N), far inside any actuarially meaningful precision.

If you write an audit sidecar, run_metadata.batch_size records what the run actually ran at, so an auditor reproducing the run can match the memory profile if they want — but they're not forced to.