Skip to content

Assumptions in Gaspatchio

Overview

Actuarial models rely heavily on assumption tables - mortality rates, lapse rates, expense assumptions, and other factors that drive projections. Gaspatchio provides a high-performance vector-based lookup system with a simple, intuitive API that handles the complexities of assumption table loading and transformation automatically.

One core principle of Gaspatchio is to "meet people where they are". With regard to assumption tables, this means recognizing that you've likely already got a table in a format that you like. That might come from Excel, another system, regulatory requirements, or a combination of all of those.

Gaspatchio's assumption system is designed to work with any table format and will automatically transform it into a format that is optimized for performance. That's what "meeting people where they are" means to us. Keep your data as it is, and let Gaspatchio do the rest.

The Table API

Gaspatchio's assumption system revolves around the dimension-based Table API:

  • Table() - Load and register assumption tables with automatic format detection and dimension configuration. This happens ONCE before you start your projection/run.
  • table.lookup() - Perform high-performance vector lookups. This happens for each policy/projection period. Which is A LOT.
import gaspatchio_core as gs
import polars as pl

# Load any assumption table with the Table API
mortality_table = gs.Table(
    name="mortality_rates",
    source="mortality_table.csv",  # or DataFrame
    dimensions={
        "age": "age"  # Simple string shorthand for data dimensions
    },
    value="mortality_rate"
)

# Use in projections with vector lookups
af = af.with_columns(
    mortality_table.lookup(age=af["age_last"])
)

Key Advantages

1. Dimension-Based Design

The API uses explicit dimensions for clarity and flexibility:

# Simple curve (1D table)
lapse_table = gs.Table(
    name="lapse_rates", 
    source=lapse_df,
    dimensions={
        "duration": "duration"  # Map duration column to duration dimension
    },
    value="lapse_rate"
)

# Wide table with melt dimension (age × duration grid)
mortality_table = gs.Table(
    name="mortality_rates",
    source=mortality_df,
    dimensions={
        "age": "age",
        "duration": gs.MeltDimension(
            columns=["1", "2", "3", "4", "5", "Ultimate"],
            name="duration",
            overflow=gs.ExtendOverflow("Ultimate", to_value=120)
        )
    },
    value="qx"
)

# Multi-dimensional table
multi_dim_table = gs.Table(
    name="vbt_2015",
    source=vbt_df,
    dimensions={
        "age": "age",
        "sex": "sex",
        "smoker": "smoker_status"  # Can map different column names
    },
    value="mortality_rate"
)

2. Automatic Format Detection with Analysis

Use the analyze_table() function to get insights and configuration suggestions:

# Analyze any table to understand its structure
schema = gs.analyze_table(df)
print(schema.suggest_table_config())

# Output:
# Table(
#     name="your_table_name",
#     source=df,
#     dimensions={
#         "age": "age",
#         "duration": MeltDimension(
#             columns=["1", "2", "3", "4", "5", "Ultimate"],
#             overflow=ExtendOverflow("Ultimate", to_value=120)
#         )
#     },
#     value="rate"
# )

3. Smart Overflow Handling

Wide tables often have "Ultimate" or overflow columns for durations beyond the explicit range. The API handles this explicitly:

# Table with columns: Age, 1, 2, 3, 4, 5, "Ult."
mortality_table = gs.Table(
    name="mortality_table",
    source=df,
    dimensions={
        "age": "age",
        "duration": gs.MeltDimension(
            columns=["1", "2", "3", "4", "5", "Ult."],
            overflow=gs.ExtendOverflow("Ult.", to_value=120),  # Expands to duration 120
            fill=gs.LinearInterpolate()  # Optional: interpolate gaps
        )
    },
    value="rate"
)

# Lookups work seamlessly for any duration
af = af.with_columns(
    mortality_table.lookup(age=af["age"], duration=af["duration"])
)

4. Vector-Native Performance

Handle entire projection vectors without loops or exploding data:

# Age progresses as a vector per policy
df = df.with_columns(
    age_vector=[[30, 31, 32, 33, ...]]  # 480 months of ages
)

# Single lookup returns vector of rates for all ages
df = df.with_columns(
    mortality_table.lookup(age=pl.col("age_vector"))
)
# Result: [0.0011, 0.0012, 0.0013, ...]

Rust-Powered Multi-Core Performance

Gaspatchio's assumption system is implemented in Rust and leverages all available CPU cores automatically. The core registry (PyAssumptionTableRegistry) stores lookup indices as optimized Rust HashMap structures, providing:

  • O(1) hash-based lookups regardless of table size
  • Zero-copy memory access through Rust's ownership system
  • Automatic parallelization via Polars' multi-threaded query engine
  • SIMD vectorization for mathematical operations on assumption vectors

When you perform a lookup on 1 million policies with 480-month projections (480M total lookups), Gaspatchio distributes the work across all CPU cores simultaneously. A 16-core machine can process assumption lookups 16x faster than traditional single-threaded approaches.

# This single operation uses ALL your CPU cores
af = af.with_columns(
    mortality_table.lookup(age=af["age_vector"])
)
# 480M lookups completed in seconds, not minutes

Tidy Data Principles

Following Tidy Data Best Practices

Gaspatchio's assumption system is built around the tidy data principles outlined by Hadley Wickham in his seminal 2014 paper "Tidy Data" (Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10).

Tidy datasets follow three fundamental rules:

  1. Each variable is a column - keys (age, duration, gender) and values (mortality rates, lapse rates) are separate columns
  2. Each observation is a row - each row represents one lookup combination (e.g., age 30 + duration 5 = rate 0.0023)
  3. Each type of observational unit is a table - mortality assumptions, lapse assumptions, etc. are separate tables

Why Tidy Assumptions Matter

Traditional actuarial tables are often stored in "wide" format - convenient for human reading but inefficient for computation:

Wide Format (Human-Readable)

┌─────┬──────┬──────┬──────┬──────┐
│ Age │  1   │  2   │  3   │ Ult. │
├─────┼──────┼──────┼──────┼──────┤
│ 30  │0.001 │0.002 │0.003 │0.005 │
│ 31  │0.001 │0.002 │0.003 │0.005 │
└─────┴──────┴──────┴──────┴──────┘

Tidy Format (Machine-Optimized)

┌─────┬──────────┬───────┐
│ Age │ Duration │  Rate │
├─────┼──────────┼───────┤
│ 30  │    1     │ 0.001 │
│ 30  │    2     │ 0.002 │
│ 30  │    3     │ 0.003 │
│ 30  │   120    │ 0.005 │
│ 31  │    1     │ 0.001 │
└─────┴──────────┴───────┘

Automatic Tidy Transformation

The Table class with MeltDimension automatically converts wide tables to tidy format:

# Input: Wide mortality table
wide_table = pl.DataFrame({
    "age": [30, 31, 32],
    "1": [0.0011, 0.0012, 0.0013],
    "2": [0.0012, 0.0013, 0.0014],
    "3": [0.0013, 0.0014, 0.0015],
    "Ult.": [0.0050, 0.0051, 0.0052]
})

# Automatic tidy transformation
mortality_table = gs.Table(
    name="mortality",
    source=wide_table,
    dimensions={
        "age": "age",
        "duration": gs.MeltDimension(
            columns=["1", "2", "3", "Ult."],
            name="duration"
        )
    },
    value="rate"
)

# Result: Tidy table ready for high-performance lookups
# Each age/duration combination becomes a separate row

The tidy format enables:

  • Vectorized lookups: Query millions of age/duration combinations in microseconds
  • Flexible filtering: Add conditions like gender, smoking status, or product type as additional columns
  • Consistent API: Same lookup pattern works for all assumption types
  • Memory efficiency: No duplicate storage of rates across multiple table formats

Loading Different Table Types

Curve Tables (1-Dimensional)

For simple tables with one key and one value:

# Lapse rates by policy duration
lapse_df = pl.DataFrame({
    "policy_duration": [1, 2, 3, 4, 5],
    "lapse_rate": [0.05, 0.04, 0.03, 0.02, 0.01]
})

lapse_table = gs.Table(
    name="lapse_rates",
    source=lapse_df,
    dimensions={
        "policy_duration": "policy_duration"
    },
    value="lapse_rate"
)

Wide Tables (Age × Duration Grids)

For mortality tables and similar multi-dimensional assumptions:

# Mortality table with multiple gender/smoking combinations
mortality_table = gs.Table(
    name="mortality_vbt_2015",
    source="mortality.parquet",
    dimensions={
        "age-last": "age-last",
        "variable": gs.MeltDimension(
            columns=["MNS", "FNS", "MS", "FS"],  # Male/Female, Non-Smoker/Smoker
            name="variable"
        )
    },
    value="mortality_rate"
)

Input DataFrame:

┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ age-last │ MNS      │ FNS      │ MS       │ FS       │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ 30       │ 0.0011   │ 0.0010   │ 0.0021   │ 0.0019   │
│ 31       │ 0.0012   │ 0.0011   │ 0.0022   │ 0.0020   │
└──────────┴──────────┴──────────┴──────────┴──────────┘

Automatic transformation to tidy format:

┌──────────┬──────────┬───────────────┐
│ age-last │ variable │ mortality_rate│
├──────────┼──────────┼───────────────┤
│ 30       │ MNS      │ 0.0011        │
│ 30       │ FNS      │ 0.0010        │
│ 30       │ MS       │ 0.0021        │
│ 30       │ FS       │ 0.0019        │
│ 31       │ MNS      │ 0.0012        │
│ 31       │ FNS      │ 0.0011        │
└──────────┴──────────┴───────────────┘

Tables with Overflow Columns

For tables with "Ultimate" or "Term" columns representing rates beyond the explicit duration range:

# VBT 2015 table with durations 1-25 plus "Ult." column
vbt_table = gs.Table(
    name="vbt_2015_female_smoker",
    source="2015-VBT-FSM-ANB.csv",
    dimensions={
        "issue_age": "issue_age",
        "duration": gs.MeltDimension(
            columns=[str(i) for i in range(1, 26)] + ["Ult."],
            name="duration",
            overflow=gs.ExtendOverflow("Ult.", to_value=120)
        )
    },
    value="qx"
)

This automatically creates lookup entries for durations 26, 27, 28, ... 120, all using the "Ultimate" rate from the original table.

Performing Lookups

Single-Key Lookups

# Simple lapse rate lookup
af = af.with_columns(
    lapse_table.lookup({"policy_duration": af["policy_duration"]})
)

Multi-Key Lookups

# Mortality lookup with age and gender/smoking status
af = af.with_columns(
    mortality_table.lookup({
        "age_last": af["age_last"],
        "variable": af["gender_smoking"]
    })
)

Vector Lookups

The most powerful feature - handle entire projection vectors:

# Project 480 months for each policy
af = af.with_columns(
    monthly_ages=af["issue_age"] + (af["projection_months"] / 12),
    monthly_durations=af["policy_duration"] + (af["projection_months"] / 12)
)

# Single lookup returns 480 mortality rates per policy
af = af.with_columns(
    mortality_table.lookup(
        age=af["monthly_ages"],
        duration=af["monthly_durations"]
    )
)

Complete Model Example

Here's how assumption tables integrate into a complete actuarial model:

import gaspatchio_core as gs
import polars as pl
from gaspatchio_core import ActuarialFrame

def setup_assumptions():
    """Load all assumption tables for the model"""

    # Load mortality table (wide format with overflow)
    mortality_df = pl.read_parquet("assumptions/mortality.parquet")
    mortality_table = gs.Table(
        name="mortality_rates",
        source=mortality_df,
        dimensions={
            "age-last": "age-last",
            "variable": gs.MeltDimension(
                columns=["MNS", "FNS", "MS", "FS"],
                name="variable"
            )
        },
        value="mortality_rate"
    )

    # Load lapse curve (simple 1D table)
    lapse_df = pl.read_parquet("assumptions/lapse.parquet")
    lapse_table = gs.Table(
        name="lapse_rates",
        source=lapse_df,
        dimensions={
            "policy_duration": "policy_duration"
        },
        value="lapse_rate"
    )

    # Load premium rates (wide format)
    premium_df = pl.read_parquet("assumptions/premium_rates.parquet")
    premium_table = gs.Table(
        name="premium_rates",
        source=premium_df,
        dimensions={
            "age-last": "age-last",
            "variable": gs.MeltDimension(
                columns=["MNS", "FNS", "MS", "FS"],
                name="variable"
            )
        },
        value="premium_rate"
    )

    return mortality_table, lapse_table, premium_table

def life_model(policies_df):
    """Complete life insurance projection model"""

    # Setup assumption tables
    mortality_table, lapse_table, premium_table = setup_assumptions()

    # Create ActuarialFrame
    af = ActuarialFrame(policies_df)

    # Setup projection vectors (480 months per policy)
    max_age = 101
    af["num_proj_months"] = (max_age - af["age"]) * 12
    af["proj_months"] = af.fill_series(af["num_proj_months"], 0, 1)

    # Calculate age and duration vectors
    af["age_last"] = (af["age"] + (af["proj_months"] / 12)).floor()
    af["policy_duration"] = (af["policy_duration"] + (af["proj_months"] / 12)).floor()

    # Create gender/smoking variable for lookups
    af["variable"] = af["gender"] + af["smoking_status"]

    # Vector lookups - get rates for all 480 months at once
    af["mortality_rate"] = mortality_table.lookup({
        "age_last": af["age_last"],
        "variable": af["variable"]
    })

    af["lapse_rate"] = lapse_table.lookup({
        "policy_duration": af["policy_duration"]
    })

    af["premium_rate"] = premium_table.lookup({
        "age_last": af["age_last"],
        "variable": af["variable"]
    })

    # Calculate probabilities and cash flows
    af["monthly_persist_prob"] = (1 - af["mortality_rate"] / 12) * (1 - af["lapse_rate"] / 12)

    # Probability in force (cumulative product with shift)
    af["prob_in_force"] = af["monthly_persist_prob"].list.eval(
        pl.element().cum_prod().shift(1).fill_null(1.0)
    )

    # Cash flows
    af["premium_cf"] = af["premium_rate"] / 12 * af["prob_in_force"] * af["sum_assured"] / 1000
    af["claims_cf"] = af["prob_in_force"] * af["mortality_rate"] / 12 * af["sum_assured"]
    af["profit_cf"] = af["premium_cf"] - af["claims_cf"]

    return af

# Run the model
policies = pl.read_csv("model_points.csv")
results = life_model(policies)

Using the TableBuilder Pattern

For complex table configurations, use the TableBuilder pattern:

# Build a complex table step by step
table = (
    gs.TableBuilder("complex_mortality")
    .from_source("mortality_data.csv")
    .with_data_dimension("issue_age", "issue_age")
    .with_data_dimension("policy_year", "policy_year")
    .with_computed_dimension(
        "attained_age",
        pl.col("issue_age") + pl.col("policy_year") - 1,
        "attained_age"
    )
    .with_melt_dimension(
        "duration",
        columns=[str(i) for i in range(1, 26)] + ["Ultimate"],
        overflow=gs.ExtendOverflow("Ultimate", to_value=100)
    )
    .with_value_column("mortality_rate")
    .build()
)

Performance Benefits

The assumption system provides significant performance improvements:

1. Pre-Computed Expansion

Overflow columns are expanded once at load time, not during every lookup:

# Table with durations 1-25 + "Ult." gets expanded to 1-120 immediately
table = gs.Table(
    name="mortality",
    source=df,
    dimensions={
        "age": "age",
        "duration": gs.MeltDimension(
            columns=duration_cols,
            overflow=gs.ExtendOverflow("Ult.", to_value=120)
        )
    },
    value="rate"
)

# All lookups are now O(1) hash operations - no overflow logic needed
af = af.with_columns(
    table.lookup({"age": af["age"], "duration": 100})  # duration=100 works instantly
)

2. Vector-Native Operations

No exploding, joining, or reaggregating required:

# Traditional approach: explode 1M policies × 480 months = 480M rows
# Gaspatchio: 1M policies with 480-element vectors = 1M rows

# Single operation handles entire projection
af = af.with_columns(
    mortality_table.lookup({"age": af["age_vector"]})
)

3. Optimized Hash-Based Lookups

Built on Rust HashMaps for maximum performance:

  • O(1) lookup time regardless of table size
  • Efficient memory usage with pre-indexed structures
  • Integrates with Polars' lazy evaluation for optimal query planning

API Reference

Table Class

gs.Table(
    name: str,                          # Table name for lookups
    source: str | pl.DataFrame,         # File path or DataFrame
    dimensions: dict[str, str | Dimension], # Dimension configuration
    value: str = "rate",                # Name for value column
    metadata: dict | None = None,       # Optional metadata storage
    validate: bool = True               # Enable validation
) -> Table

Parameters: - name: Unique identifier for the table in the lookup registry - source: Either a file path (.csv/.parquet) or a Polars DataFrame
- dimensions: Dictionary mapping dimension names to columns or Dimension objects - value: Name of the value column in the final tidy table - metadata: Optional dictionary stored with the table - validate: Whether to validate dimension configuration

table.lookup()

table.lookup(
    **dimensions: str | pl.Expr         # Dimension names mapped to columns/expressions
) -> pl.Expr

Returns a Polars expression that performs the lookup. Use within .with_columns() or similar Polars operations.

Dimension Types

  • DataDimension: Maps a column directly to a dimension
  • MeltDimension: Transforms wide columns into long format
  • CategoricalDimension: Adds a constant categorical value
  • ComputedDimension: Creates a dimension from an expression

Strategy Types

  • ExtendOverflow: Extends a specific column value to higher indices
  • AutoDetectOverflow: Automatically detects overflow columns
  • LinearInterpolate: Fills gaps with linear interpolation
  • FillConstant: Fills gaps with a constant value
  • FillForward: Forward fills missing values