Skip to content

Assumptions in Gaspatchio

Overview

Actuarial models rely heavily on assumption tables - mortality rates, lapse rates, expense assumptions, and other factors that drive projections. Gaspatchio provides a high-performance vector-based lookup system with a simple, intuitive API that handles the complexities of assumption table loading and transformation automatically.

One core principle of Gaspatchio is to "meet people where they are". With regard to assumption tables, this means recognizing that you've likely already got a table in a format that you like. That might come from Excel, another system, regulatory requirements, or a combination of all of those.

Gaspatchio's assumption system is designed to work with any table format and will automatically transform it into a format that is optimized for performance. That's what "meeting people where they are" means to us. Keep your data as it is, and let Gaspatchio do the rest.

The Table API

Gaspatchio's assumption system revolves around the dimension-based Table API:

  • Table() - Load and register assumption tables with automatic format detection and dimension configuration. This happens ONCE before you start your projection/run.
  • table.lookup() - Perform high-performance vector lookups. This happens for each policy/projection period. Which is A LOT.
import gaspatchio_core as gs
import polars as pl

# Load any assumption table with the Table API
mortality_table = gs.Table(
    name="mortality_rates",
    source="mortality_table.csv",  # or DataFrame
    dimensions={
        "age": "age"  # Simple string shorthand for data dimensions
    },
    value="mortality_rate"
)

# Use in projections with vector lookups
af.mort_rate = mortality_table.lookup(age=af.age_last)

Key Advantages

1. Dimension-Based Design

The API uses explicit dimensions for clarity and flexibility:

# Simple curve (1D table)
lapse_table = gs.Table(
    name="lapse_rates", 
    source=lapse_df,
    dimensions={
        "duration": "duration"  # Map duration column to duration dimension
    },
    value="lapse_rate"
)

# Wide table with melt dimension (age × duration grid)
mortality_table = gs.Table(
    name="mortality_rates",
    source=mortality_df,
    dimensions={
        "age": "age",
        "duration": gs.assumptions.MeltDimension(
            columns=["1", "2", "3", "4", "5", "Ultimate"],
            name="duration",
            overflow=gs.assumptions.ExtendOverflow("Ultimate", to_value=120)
        )
    },
    value="qx"
)

# Multi-dimensional table
multi_dim_table = gs.Table(
    name="vbt_2015",
    source=vbt_df,
    dimensions={
        "age": "age",
        "sex": "sex",
        "smoker": "smoker_status"  # Can map different column names
    },
    value="mortality_rate"
)

2. Automatic Format Detection with Analysis

Use the analyze_table() function to get insights and configuration suggestions:

# Analyze any table to understand its structure
schema = gs.assumptions.analyze_table(df)
print(schema.suggest_table_config())

# Output:
# Table(
#     name="your_table_name",
#     source=df,
#     dimensions={
#         "age": "age",
#         "duration": MeltDimension(
#             columns=["1", "2", "3", "4", "5", "Ultimate"],
#             overflow=ExtendOverflow("Ultimate", to_value=120)
#         )
#     },
#     value="rate"
# )

3. Smart Overflow Handling

Wide tables often have "Ultimate" or overflow columns for durations beyond the explicit range. The API handles this explicitly:

# Table with columns: Age, 1, 2, 3, 4, 5, "Ult."
mortality_table = gs.Table(
    name="mortality_table",
    source=df,
    dimensions={
        "age": "age",
        "duration": gs.assumptions.MeltDimension(
            columns=["1", "2", "3", "4", "5", "Ult."],
            overflow=gs.assumptions.ExtendOverflow("Ult.", to_value=120),  # Expands to duration 120
            fill=gs.assumptions.LinearInterpolate()  # Optional: interpolate gaps
        )
    },
    value="rate"
)

# Lookups work seamlessly for any duration
af.mort_rate = mortality_table.lookup(age=af.age, duration=af.duration)

4. Vector-Native Performance

Handle entire projection vectors without loops or exploding data:

# Create projection timeline - each policy gets a list of months
af.month = af.date.create_projection_timeline(af.issue_date, af.maturity_date)

# Age progresses as a vector per policy (list column)
af.attained_age = af.issue_age + af.month // 12  # e.g., [30, 30, 30, ..., 31, 31, ...]

# Single lookup returns vector of rates for all ages
af.mort_rate = mortality_table.lookup(age=af.attained_age)
# Result: [0.0011, 0.0011, 0.0011, ..., 0.0012, 0.0012, ...]

Rust-Powered Multi-Core Performance

Gaspatchio's assumption system is implemented in Rust and uses an adaptive storage strategy that automatically selects the optimal backend for each table:

Array Storage (default for dense tables):

  • ~3ns per lookup via direct array indexing
  • Dictionary-encoded keys enable O(1) index computation
  • Perfect cache locality with contiguous memory access
  • GPU-ready: same arrays work seamlessly on CPU and GPU

Hash Storage (fallback for sparse tables):

  • O(1) average-case lookups via Rust AHashMap
  • ~20ns per lookup (hash computation + bucket probe)
  • Memory-efficient for tables with many missing combinations

Gaspatchio automatically chooses array storage when tables are >30% dense (most actuarial tables are 70%+ dense). A typical mortality table [3 table_ids × 101 ages × 25 durations] = 7,575 values = 60KB - trivially small!

# This operation uses array indexing on ALL CPU cores
af.mort_rate = mortality_table.lookup(age=af.attained_age)
# 324M lookups completed in ~1 second, not 27 seconds

Tidy Data Principles

Following Tidy Data Best Practices

Gaspatchio's assumption system is built around the tidy data principles outlined by Hadley Wickham in his seminal 2014 paper "Tidy Data" (Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10).

Tidy datasets follow three fundamental rules:

  1. Each variable is a column - keys (age, duration, gender) and values (mortality rates, lapse rates) are separate columns
  2. Each observation is a row - each row represents one lookup combination (e.g., age 30 + duration 5 = rate 0.0023)
  3. Each type of observational unit is a table - mortality assumptions, lapse assumptions, etc. are separate tables

Why Tidy Assumptions Matter

Traditional actuarial tables are often stored in "wide" format - convenient for human reading but inefficient for computation:

Wide Format (Human-Readable)

┌─────┬──────┬──────┬──────┬──────┐
│ Age │  1   │  2   │  3   │ Ult. │
├─────┼──────┼──────┼──────┼──────┤
│ 30  │0.001 │0.002 │0.003 │0.005 │
│ 31  │0.001 │0.002 │0.003 │0.005 │
└─────┴──────┴──────┴──────┴──────┘

Tidy Format (Machine-Optimized)

┌─────┬──────────┬───────┐
│ Age │ Duration │  Rate │
├─────┼──────────┼───────┤
│ 30  │    1     │ 0.001 │
│ 30  │    2     │ 0.002 │
│ 30  │    3     │ 0.003 │
│ 30  │   120    │ 0.005 │
│ 31  │    1     │ 0.001 │
└─────┴──────────┴───────┘

Automatic Tidy Transformation

The Table class with MeltDimension automatically converts wide tables to tidy format:

# Input: Wide mortality table
wide_table = pl.DataFrame({
    "age": [30, 31, 32],
    "1": [0.0011, 0.0012, 0.0013],
    "2": [0.0012, 0.0013, 0.0014],
    "3": [0.0013, 0.0014, 0.0015],
    "Ult.": [0.0050, 0.0051, 0.0052]
})

# Automatic tidy transformation
mortality_table = gs.Table(
    name="mortality",
    source=wide_table,
    dimensions={
        "age": "age",
        "duration": gs.assumptions.MeltDimension(
            columns=["1", "2", "3", "Ult."],
            name="duration"
        )
    },
    value="rate"
)

# Result: Tidy table ready for high-performance lookups
# Each age/duration combination becomes a separate row

The tidy format enables:

  • Vectorized lookups: Query millions of age/duration combinations in microseconds
  • Flexible filtering: Add conditions like gender, smoking status, or product type as additional columns
  • Consistent API: Same lookup pattern works for all assumption types
  • Memory efficiency: No duplicate storage of rates across multiple table formats

Loading Different Table Types

Curve Tables (1-Dimensional)

For simple tables with one key and one value:

# Lapse rates by policy duration
lapse_df = pl.DataFrame({
    "policy_duration": [1, 2, 3, 4, 5],
    "lapse_rate": [0.05, 0.04, 0.03, 0.02, 0.01]
})

lapse_table = gs.Table(
    name="lapse_rates",
    source=lapse_df,
    dimensions={
        "policy_duration": "policy_duration"
    },
    value="lapse_rate"
)

Wide Tables (Age × Duration Grids)

For mortality tables and similar multi-dimensional assumptions:

# Mortality table with multiple gender/smoking combinations
mortality_table = gs.Table(
    name="mortality_vbt_2015",
    source="mortality.parquet",
    dimensions={
        "age-last": "age-last",
        "variable": gs.assumptions.MeltDimension(
            columns=["MNS", "FNS", "MS", "FS"],  # Male/Female, Non-Smoker/Smoker
            name="variable"
        )
    },
    value="mortality_rate"
)

Input DataFrame:

┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ age-last │ MNS      │ FNS      │ MS       │ FS       │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ 30       │ 0.0011   │ 0.0010   │ 0.0021   │ 0.0019   │
│ 31       │ 0.0012   │ 0.0011   │ 0.0022   │ 0.0020   │
└──────────┴──────────┴──────────┴──────────┴──────────┘

Automatic transformation to tidy format:

┌──────────┬──────────┬───────────────┐
│ age-last │ variable │ mortality_rate│
├──────────┼──────────┼───────────────┤
│ 30       │ MNS      │ 0.0011        │
│ 30       │ FNS      │ 0.0010        │
│ 30       │ MS       │ 0.0021        │
│ 30       │ FS       │ 0.0019        │
│ 31       │ MNS      │ 0.0012        │
│ 31       │ FNS      │ 0.0011        │
└──────────┴──────────┴───────────────┘

Tables with Overflow Columns

For tables with "Ultimate" or "Term" columns representing rates beyond the explicit duration range:

# VBT 2015 table with durations 1-25 plus "Ult." column
vbt_table = gs.Table(
    name="vbt_2015_female_smoker",
    source="2015-VBT-FSM-ANB.csv",
    dimensions={
        "issue_age": "issue_age",
        "duration": gs.assumptions.MeltDimension(
            columns=[str(i) for i in range(1, 26)] + ["Ult."],
            name="duration",
            overflow=gs.assumptions.ExtendOverflow("Ult.", to_value=120)
        )
    },
    value="qx"
)

This automatically creates lookup entries for durations 26, 27, 28, ... 120, all using the "Ultimate" rate from the original table.

Performing Lookups

Single-Key Lookups

# Simple lapse rate lookup
af.lapse_rate = lapse_table.lookup(policy_duration=af.policy_duration)

Multi-Key Lookups

# Mortality lookup with age and gender/smoking status
af.mort_rate = mortality_table.lookup(age_last=af.age_last, variable=af.gender_smoking)

Vector Lookups

The most powerful feature - handle entire projection vectors:

# Create projection timeline (list of months per policy)
af.month = af.date.create_projection_timeline(af.issue_date, af.maturity_date)

# Calculate age and duration as projection vectors
af.attained_age = af.issue_age + af.month // 12
af.duration = af.month // 12

# Single lookup returns mortality rates for all timesteps at once
af.mort_rate = mortality_table.lookup(age=af.attained_age, duration=af.duration)

Note the idiomatic patterns:

  • Attribute notation (af.column) instead of bracket notation
  • Direct assignment rather than intermediate with_columns calls
  • Inline expressions in lookups when simple enough

Inspecting Tables with describe()

The describe() method provides a human-readable summary of a table's structure. This is useful for debugging lookup issues, generating documentation, and understanding table configuration during model development.

Simple Table Example

import polars as pl
from gaspatchio_core.assumptions import Table

# Create a simple 1D lapse rate table
lapse_data = pl.DataFrame({
    "duration": [1, 2, 3, 4, 5],
    "lapse_rate": [0.08, 0.06, 0.04, 0.03, 0.02]
})

lapse_table = Table(
    name="lapse_rates",
    source=lapse_data,
    dimensions={"duration": "duration"},
    value="lapse_rate"
)

print(lapse_table.describe())

Output:

Table: lapse_rates
Rows: 5
Storage mode: array
Value column: lapse_rate
Key columns (1): duration
Dimensions (1): duration
  - duration: DataDimension

Multi-Dimensional Table Example

import polars as pl
from gaspatchio_core.assumptions import Table, MeltDimension, ExtendOverflow

# Create a wide mortality table (age × duration grid)
mortality_data = pl.DataFrame({
    "age": [30, 31, 32, 33],
    "1": [0.0010, 0.0011, 0.0012, 0.0013],
    "2": [0.0011, 0.0012, 0.0013, 0.0014],
    "3": [0.0012, 0.0013, 0.0014, 0.0015],
    "Ult.": [0.0050, 0.0051, 0.0052, 0.0053]
})

mortality_table = Table(
    name="mortality_select",
    source=mortality_data,
    dimensions={
        "age": "age",
        "duration": MeltDimension(
            columns=["1", "2", "3", "Ult."],
            name="duration",
            overflow=ExtendOverflow("Ult.", to_value=10)
        )
    },
    value="qx"
)

print(mortality_table.describe())

Output:

Table: mortality_select
Rows: 40
Storage mode: array
Value column: qx
Key columns (2): age, duration
Dimensions (2): age, duration
  - age: DataDimension
  - duration: MeltDimension

Notice the row count is 40 (4 ages × 10 durations) because ExtendOverflow expanded the "Ult." column to durations 4 through 10.

What describe() Tells You

Field Description
Table The table name used for identification
Rows Total row count after tidy transformation and overflow expansion
Storage mode array (fast, dense) or hash (flexible, sparse)
Value column The column containing lookup values
Key columns Columns used for indexing (excludes value column)
Dimensions Configured dimensions with their types

The dimension types help you understand how each dimension was configured:

  • DataDimension - Direct column mapping
  • MeltDimension - Wide-to-long transformation (unpivoted columns)
  • CategoricalDimension - Constant categorical value
  • ComputedDimension - Derived from an expression

Complete Model Example

Here's how assumption tables integrate into a complete actuarial model:

import gaspatchio_core as gs
import polars as pl
from gaspatchio_core import ActuarialFrame

def setup_assumptions():
    """Load all assumption tables for the model"""

    # Load mortality table (wide format with MNS, FNS, MS, FS columns)
    mortality_df = pl.read_parquet("assumptions/mortality.parquet")
    mortality_table = gs.Table(
        name="mortality_rates",
        source=mortality_df,
        dimensions={
            "age_last": "age_last",
            "variable": gs.assumptions.MeltDimension(
                columns=["MNS", "FNS", "MS", "FS"],
                name="variable"
            )
        },
        value="mortality_rate"
    )

    # Load lapse curve (simple 1D table)
    lapse_df = pl.read_parquet("assumptions/lapse.parquet")
    lapse_table = gs.Table(
        name="lapse_rates",
        source=lapse_df,
        dimensions={
            "policy_duration": "policy_duration"
        },
        value="lapse_rate"
    )

    # Load premium rates (wide format)
    premium_df = pl.read_parquet("assumptions/premium_rates.parquet")
    premium_table = gs.Table(
        name="premium_rates",
        source=premium_df,
        dimensions={
            "age_last": "age_last",
            "variable": gs.assumptions.MeltDimension(
                columns=["MNS", "FNS", "MS", "FS"],
                name="variable"
            )
        },
        value="premium_rate"
    )

    return mortality_table, lapse_table, premium_table

def life_model(policies_df):
    """Complete life insurance projection model"""

    # Setup assumption tables
    mortality_table, lapse_table, premium_table = setup_assumptions()

    # Create ActuarialFrame
    af = ActuarialFrame(policies_df)

    # Create projection timeline using fill_series
    # Calculate projection length based on policy term (project until age 70)
    max_age = 70
    af.num_proj_months = (max_age - af.issue_age) * 12
    af.month = af.fill_series(af.num_proj_months, start=0, increment=1)

    # Calculate indexing columns as projection vectors
    af.attained_age = af.issue_age + af.month // 12
    af.duration = af.month // 12

    # Create gender/smoking variable for lookups
    af.variable = af.gender + af.smoking_status

    # Vector lookups - get rates for all timesteps at once
    af.mort_rate = mortality_table.lookup(age_last=af.attained_age, variable=af.variable)
    af.lapse_rate = lapse_table.lookup(policy_duration=af.duration)
    af.premium_rate = premium_table.lookup(age_last=af.attained_age, variable=af.variable)

    # Calculate monthly persistence probability
    af.monthly_persist = (1 - af.mort_rate / 12) * (1 - af.lapse_rate / 12)

    # Probability in force using projection accessor
    af.pols_if = af.monthly_persist.projection.cumulative_survival()

    # Cash flows
    af.premium_cf = af.premium_rate / 12 * af.pols_if * af.sum_assured / 1000
    af.claims_cf = af.pols_if * af.mort_rate / 12 * af.sum_assured
    af.profit_cf = af.premium_cf - af.claims_cf

    return af

# Run the model
policies = pl.read_csv("model_points.csv")
results = life_model(policies)

Using the TableBuilder Pattern

For complex table configurations, use the TableBuilder pattern:

# Build a complex table step by step
table = (
    gs.TableBuilder("complex_mortality")
    .from_source("mortality_data.csv")
    .with_data_dimension("issue_age", "issue_age")
    .with_data_dimension("policy_year", "policy_year")
    .with_computed_dimension(
        "attained_age",
        pl.col("issue_age") + pl.col("policy_year") - 1,
        "attained_age"
    )
    .with_melt_dimension(
        "duration",
        columns=[str(i) for i in range(1, 26)] + ["Ultimate"],
        overflow=gs.assumptions.ExtendOverflow("Ultimate", to_value=100)
    )
    .with_value_column("mortality_rate")
    .build()
)

Performance Benefits

The assumption system provides significant performance improvements through intelligent storage selection and optimized data structures.

1. Adaptive Storage: Array vs Hash

Gaspatchio automatically selects the optimal storage backend based on table density:

Storage Lookup Time Best For Memory
Array ~3ns Dense tables (>30% filled) Proportional to key range
Hash ~20ns Sparse tables (<30% filled) Proportional to entries

How Array Storage Works:

String keys (like table_id: "A", "B", "C") are dictionary-encoded to integers (0, 1, 2). Integer keys (age 18-100, duration 0-24) are used directly. The linear index is computed as:

index = table_idx × 101 × 25 + age × 25 + duration
value = data[index]  # Direct O(1) array access

This eliminates hash computation entirely - just multiplication, addition, and array indexing.

Real-World Table Analysis:

Lookup Keys Density Array Suitable?
mortality_select table_id, age, duration ~30% ✅ Yes
lapse_rates lapse_id, duration ~80% ✅ Yes
inv_returns scenario_id, t, fund_index ~95% ✅ Yes

Most actuarial tables benefit from array storage automatically.

2. Pre-Computed Expansion

Overflow columns are expanded once at load time, not during every lookup:

# Table with durations 1-25 + "Ult." gets expanded to 1-120 immediately
table = gs.Table(
    name="mortality",
    source=df,
    dimensions={
        "age": "age",
        "duration": gs.assumptions.MeltDimension(
            columns=duration_cols,
            overflow=gs.assumptions.ExtendOverflow("Ult.", to_value=120)
        )
    },
    value="rate"
)

# All lookups are now O(1) array operations - no overflow logic needed
af.mort_rate = table.lookup(age=af.age, duration=100)  # duration=100 works instantly

3. Vector-Native Operations

No exploding, joining, or reaggregating required:

# Traditional approach: explode 1M policies × 480 months = 480M rows
# Gaspatchio: 1M policies with 480-element vectors = 1M rows

# Single operation handles entire projection
af.mort_rate = mortality_table.lookup(age=af.attained_age)

4. Benchmark: 100k Policies × 3 Scenarios

A real-world benchmark with 100,000 model points, 3 scenarios, and 180 timesteps (324 million total lookups):

Approach Lookup Time Total Model Time Speedup
Hash storage ~27s ~43s 1x
Array storage ~1s ~16s 2.7x overall

The lookup portion alone sees an 8x speedup, with the overall model running 2.7x faster.

5. GPU-Ready Architecture

The array storage strategy is designed with GPU acceleration in mind. Once tables are stored as dense arrays, GPU execution becomes trivial:

# Same arrays work on both CPU and GPU
# JAX/XLA automatically parallelizes across thousands of GPU threads
mort_rate = mort_array[table_idx, age, duration]  # Direct array indexing

No GPU hash tables required - the array structure enables massive parallelism automatically.

API Reference

Table Class

gs.Table(
    name: str,                          # Table name for lookups
    source: str | pl.DataFrame,         # File path or DataFrame
    dimensions: dict[str, str | Dimension], # Dimension configuration
    value: str = "rate",                # Name for value column
    metadata: dict | None = None,       # Optional metadata storage
    validate: bool = True               # Enable validation
) -> Table

Parameters: - name: Unique identifier for the table in the lookup registry - source: Either a file path (.csv/.parquet) or a Polars DataFrame
- dimensions: Dictionary mapping dimension names to columns or Dimension objects - value: Name of the value column in the final tidy table - metadata: Optional dictionary stored with the table - validate: Whether to validate dimension configuration

table.lookup()

table.lookup(
    **dimensions: str | pl.Expr         # Dimension names mapped to columns/expressions
) -> pl.Expr

Returns a Polars expression that performs the lookup. Use with attribute assignment (e.g., af.rate = table.lookup(...)).

Dimension Types

  • DataDimension: Maps a column directly to a dimension
  • MeltDimension: Transforms wide columns into long format
  • CategoricalDimension: Adds a constant categorical value
  • ComputedDimension: Creates a dimension from an expression

Strategy Types

  • ExtendOverflow: Extends a specific column value to higher indices
  • AutoDetectOverflow: Automatically detects overflow columns
  • LinearInterpolate: Fills gaps with linear interpolation
  • FillConstant: Fills gaps with a constant value
  • FillForward: Forward fills missing values