Streaming and Spill¶

Some questions only the full grid can answer. A C1/C2 capital charge reads every per-policy cashflow; an audit extract has to reproduce each cell a reviewer might query; a downstream system ingests the projection row by row, not a portfolio total. For these you need the complete per-policy output — every cashflow, every period, every policy — not an aggregate.

The problem is size. A portfolio of millions of policies, each projected over a few hundred periods, is a grid with hundreds of millions of rows. That grid does not fit in memory, and unlike a portfolio total it cannot be folded down as you go — every cell is part of the answer.

run_to_parquet is the tool for that case. It projects the portfolio in memory-safe batches and writes each batch straight to disk as a parquet file, so the full output never has to be resident all at once. You get back a manifest describing what was written.

This is the counterpart to aggregating at scale. When the answer you need is an aggregate — a sum of reserves, an SCR at the 99.5th percentile — the run folds each batch down to per-period figures and only the running aggregate stays in memory; the per-policy detail is discarded as it goes. run_to_parquet is the opposite contract: it keeps every per-policy row, paying for that with disk instead of RAM. Reach for it when you genuinely need the full grid, not a summary of it.

Spill a portfolio to disk¶

A short net-cashflow projection over a four-policy book. The model function builds one per-policy list column; run_to_parquet slices the book into batches of two and writes each batch to its own parquet file under policy_output/.

from pathlib import Path

import polars as pl
from gaspatchio_core import ActuarialFrame
from gaspatchio_core.scenarios import run_to_parquet


def project(af: ActuarialFrame) -> ActuarialFrame:
    """Net cashflow per period: premium less a growing claim cost."""
    months = 3
    af.net_cf = pl.concat_list(
        [pl.col("premium") - pl.col("claim_cost") * (t + 1) for t in range(months)]
    )
    return af


model_points = pl.DataFrame({
    "policy_id":  [1, 2, 3, 4],
    "premium":    [1200.0, 2400.0, 600.0, 1800.0],
    "claim_cost": [50.0, 120.0, 20.0, 90.0],
})

result = run_to_parquet(project, model_points, Path("policy_output"), batch_size=2)

print(result.n_policies)   # 4
print(result.n_batches)    # 2
print(result.output_dir)   # policy_output

Four policies, batched two at a time, give two parquet files. Each batch carries the model points it was sliced from plus every column the projection added — here the net_cf list column, one entry per projection period.

The files are named batch_NNNN.parquet, zero-padded to four digits and numbered from zero, so they sort in projection order on disk:

written = sorted(p.name for p in Path("policy_output").glob("*.parquet"))
print(written)
# ['batch_0000.parquet', 'batch_0001.parquet']

Read one back and the per-policy detail is exactly what the projection produced — nothing was aggregated away:

first_batch = pl.read_parquet("policy_output/batch_0000.parquet")
print(first_batch.select(["policy_id", "premium", "net_cf"]))
# policy_id 1: net_cf [1150.0, 1100.0, 1050.0]
# policy_id 2: net_cf [2280.0, 2160.0, 2040.0]

Point a lazy scan at the whole directory and you read the full grid back without loading it all at once — the same memory discipline that wrote it:

full_grid = pl.scan_parquet("policy_output/batch_*.parquet")
print(full_grid.select(pl.len()).collect().item())   # 4

The manifest¶

run_to_parquet returns a SpillResult — a small record describing the run, not the data itself. The data is on disk; the manifest tells you where and how the run behaved.

print(result.output_dir)     # PosixPath('policy_output') — where the batches landed
print(result.n_policies)     # 4   — rows projected, total
print(result.n_batches)      # 2   — batch_NNNN.parquet files written
print(round(result.wall_time_s, 3))   # seconds of wall time
print(result.peak_rss_mb)    # peak resident memory above baseline, in MB

n_policies and n_batches reconcile the run: every policy went into exactly one batch, and n_batches is the file count you can expect to find under output_dir. peak_rss_mb is the headroom the run actually used above its starting footprint — the figure that confirms the batching kept the job inside its memory budget rather than the figure you hoped for.

Sizing the batches¶

The batch_size argument controls how many policies go into each file.

Pass an integer to fix the batch size yourself, as the example above does with batch_size=2. This is the predictable choice when you already know what your hardware will hold, or when you want a fixed number of policies per file for a downstream consumer.

Leave it at the default — batch_size="auto" — and the run sizes each batch to the available memory budget. It projects a small leading slice to measure how much memory a single policy's output consumes, checks the target directory has the disk room for the full output, then picks the largest batch that stays within budget. Batching here exists for one reason: memory safety. The full per-policy output cannot fold to a running total, so the run holds one batch at a time and lets the rest live on disk.

# Let the run size each batch to the memory budget.
auto = run_to_parquet(project, model_points, Path("policy_output_auto"))
print(auto.n_policies)   # 4
print(auto.n_batches)    # 1 — this tiny book fits in a single batch

A small book like this fits in one batch; a portfolio of millions resolves to many. Either way the contract is the same — every per-policy row is written, and no more than one batch is resident at a time.

Where the batches go¶

output_dir is a directory, created if it does not exist, and each batch is written there atomically: the run writes to a temporary file first and renames it into place only once the full batch is on disk, so an interrupted run never leaves a half-written batch_NNNN.parquet for a reader to trip over.

One constraint is worth knowing. The point of spilling is to move the full output out of memory and onto a disk, so the target must be real disk. A RAM-backed filesystem defeats the exercise — the batches would count against the same memory the run is trying to protect — and run_to_parquet refuses such a target with a clear error rather than spilling into the memory it was asked to spare. Before the first batch is written, the run also checks the filesystem has room for the estimated full output and stops up front if it does not, rather than failing partway through a long projection.

When to reach for it¶

You need...	Use
A portfolio total, a percentile, an SCR — an aggregate of the run	aggregating at scale
Every per-policy cashflow, at a scale that won't fit in memory	`run_to_parquet`
Quick exploration on a book that fits in RAM	a plain projection on an `ActuarialFrame`

The dividing line is the answer, not the portfolio size. If the answer is a summary, fold it down and keep the run in memory. If the answer is the grid itself — a capital extract, an audit trail, a feed to a downstream system — spill it to disk and read it back batch by batch.