Assumptions in Gaspatchio¶

Overview¶

Actuarial models rely heavily on assumption tables - mortality rates, lapse rates, expense assumptions, and other factors that drive projections. Gaspatchio provides a high-performance vector-based lookup system with a simple, intuitive API that handles the complexities of assumption table loading and transformation automatically.

One core principle of Gaspatchio is to "meet people where they are". With regard to assumption tables, this means recognizing that you've likely already got a table in a format that you like. That might come from Excel, another system, regulatory requirements, or a combination of all of those.

Gaspatchio's assumption system is designed to work with any table format and will automatically transform it into a format that is optimized for performance. That's what "meeting people where they are" means to us. Keep your data as it is, and let Gaspatchio do the rest.

The Table API¶

Gaspatchio's assumption system revolves around the dimension-based Table API:

Table() - Load and register assumption tables with automatic format detection and dimension configuration. This happens ONCE before you start your projection/run.
table.lookup() - Perform high-performance vector lookups. This happens for each policy/projection period. Which is A LOT.

import gaspatchio_core as gs
import polars as pl

# Load any assumption table with the Table API
mortality_table = gs.Table(
    name="mortality_rates",
    source="mortality_table.csv",  # or DataFrame
    dimensions={
        "age": "age"  # Simple string shorthand for data dimensions
    },
    value="mortality_rate"
)

# Use in projections with vector lookups
af = af.with_columns(
    mortality_table.lookup(age=af["age_last"])
)

Key Advantages¶

1. Dimension-Based Design¶

The API uses explicit dimensions for clarity and flexibility:

# Simple curve (1D table)
lapse_table = gs.Table(
    name="lapse_rates", 
    source=lapse_df,
    dimensions={
        "duration": "duration"  # Map duration column to duration dimension
    },
    value="lapse_rate"
)

# Wide table with melt dimension (age × duration grid)
mortality_table = gs.Table(
    name="mortality_rates",
    source=mortality_df,
    dimensions={
        "age": "age",
        "duration": gs.MeltDimension(
            columns=["1", "2", "3", "4", "5", "Ultimate"],
            name="duration",
            overflow=gs.ExtendOverflow("Ultimate", to_value=120)
        )
    },
    value="qx"
)

# Multi-dimensional table
multi_dim_table = gs.Table(
    name="vbt_2015",
    source=vbt_df,
    dimensions={
        "age": "age",
        "sex": "sex",
        "smoker": "smoker_status"  # Can map different column names
    },
    value="mortality_rate"
)

2. Automatic Format Detection with Analysis¶

Use the analyze_table() function to get insights and configuration suggestions:

# Analyze any table to understand its structure
schema = gs.analyze_table(df)
print(schema.suggest_table_config())

# Output:
# Table(
#     name="your_table_name",
#     source=df,
#     dimensions={
#         "age": "age",
#         "duration": MeltDimension(
#             columns=["1", "2", "3", "4", "5", "Ultimate"],
#             overflow=ExtendOverflow("Ultimate", to_value=120)
#         )
#     },
#     value="rate"
# )

3. Smart Overflow Handling¶

Wide tables often have "Ultimate" or overflow columns for durations beyond the explicit range. The API handles this explicitly:

# Table with columns: Age, 1, 2, 3, 4, 5, "Ult."
mortality_table = gs.Table(
    name="mortality_table",
    source=df,
    dimensions={
        "age": "age",
        "duration": gs.MeltDimension(
            columns=["1", "2", "3", "4", "5", "Ult."],
            overflow=gs.ExtendOverflow("Ult.", to_value=120),  # Expands to duration 120
            fill=gs.LinearInterpolate()  # Optional: interpolate gaps
        )
    },
    value="rate"
)

# Lookups work seamlessly for any duration
af = af.with_columns(
    mortality_table.lookup(age=af["age"], duration=af["duration"])
)

4. Vector-Native Performance¶

Handle entire projection vectors without loops or exploding data:

# Age progresses as a vector per policy
df = df.with_columns(
    age_vector=[[30, 31, 32, 33, ...]]  # 480 months of ages
)

# Single lookup returns vector of rates for all ages
df = df.with_columns(
    mortality_table.lookup(age=pl.col("age_vector"))
)
# Result: [0.0011, 0.0012, 0.0013, ...]

Rust-Powered Multi-Core Performance

Gaspatchio's assumption system is implemented in Rust and leverages all available CPU cores automatically. The core registry (PyAssumptionTableRegistry) stores lookup indices as optimized Rust HashMap structures, providing:

O(1) hash-based lookups regardless of table size
Zero-copy memory access through Rust's ownership system
Automatic parallelization via Polars' multi-threaded query engine
SIMD vectorization for mathematical operations on assumption vectors

When you perform a lookup on 1 million policies with 480-month projections (480M total lookups), Gaspatchio distributes the work across all CPU cores simultaneously. A 16-core machine can process assumption lookups 16x faster than traditional single-threaded approaches.

# This single operation uses ALL your CPU cores
af = af.with_columns(
    mortality_table.lookup(age=af["age_vector"])
)
# 480M lookups completed in seconds, not minutes

Tidy Data Principles¶

Following Tidy Data Best Practices

Gaspatchio's assumption system is built around the tidy data principles outlined by Hadley Wickham in his seminal 2014 paper "Tidy Data" (Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10).

Tidy datasets follow three fundamental rules:

Each variable is a column - keys (age, duration, gender) and values (mortality rates, lapse rates) are separate columns
Each observation is a row - each row represents one lookup combination (e.g., age 30 + duration 5 = rate 0.0023)
Each type of observational unit is a table - mortality assumptions, lapse assumptions, etc. are separate tables

Why Tidy Assumptions Matter¶

Traditional actuarial tables are often stored in "wide" format - convenient for human reading but inefficient for computation:

Wide Format (Human-Readable)

┌─────┬──────┬──────┬──────┬──────┐
│ Age │  1   │  2   │  3   │ Ult. │
├─────┼──────┼──────┼──────┼──────┤
│ 30  │0.001 │0.002 │0.003 │0.005 │
│ 31  │0.001 │0.002 │0.003 │0.005 │
└─────┴──────┴──────┴──────┴──────┘

Tidy Format (Machine-Optimized)

┌─────┬──────────┬───────┐
│ Age │ Duration │  Rate │
├─────┼──────────┼───────┤
│ 30  │    1     │ 0.001 │
│ 30  │    2     │ 0.002 │
│ 30  │    3     │ 0.003 │
│ 30  │   120    │ 0.005 │
│ 31  │    1     │ 0.001 │
└─────┴──────────┴───────┘

Automatic Tidy Transformation¶

The Table class with MeltDimension automatically converts wide tables to tidy format:

# Input: Wide mortality table
wide_table = pl.DataFrame({
    "age": [30, 31, 32],
    "1": [0.0011, 0.0012, 0.0013],
    "2": [0.0012, 0.0013, 0.0014],
    "3": [0.0013, 0.0014, 0.0015],
    "Ult.": [0.0050, 0.0051, 0.0052]
})

# Automatic tidy transformation
mortality_table = gs.Table(
    name="mortality",
    source=wide_table,
    dimensions={
        "age": "age",
        "duration": gs.MeltDimension(
            columns=["1", "2", "3", "Ult."],
            name="duration"
        )
    },
    value="rate"
)

# Result: Tidy table ready for high-performance lookups
# Each age/duration combination becomes a separate row

The tidy format enables:

Vectorized lookups: Query millions of age/duration combinations in microseconds
Flexible filtering: Add conditions like gender, smoking status, or product type as additional columns
Consistent API: Same lookup pattern works for all assumption types
Memory efficiency: No duplicate storage of rates across multiple table formats

Loading Different Table Types¶

Curve Tables (1-Dimensional)¶

For simple tables with one key and one value:

# Lapse rates by policy duration
lapse_df = pl.DataFrame({
    "policy_duration": [1, 2, 3, 4, 5],
    "lapse_rate": [0.05, 0.04, 0.03, 0.02, 0.01]
})

lapse_table = gs.Table(
    name="lapse_rates",
    source=lapse_df,
    dimensions={
        "policy_duration": "policy_duration"
    },
    value="lapse_rate"
)

Wide Tables (Age × Duration Grids)¶

For mortality tables and similar multi-dimensional assumptions:

# Mortality table with multiple gender/smoking combinations
mortality_table = gs.Table(
    name="mortality_vbt_2015",
    source="mortality.parquet",
    dimensions={
        "age-last": "age-last",
        "variable": gs.MeltDimension(
            columns=["MNS", "FNS", "MS", "FS"],  # Male/Female, Non-Smoker/Smoker
            name="variable"
        )
    },
    value="mortality_rate"
)

Input DataFrame:

┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ age-last │ MNS      │ FNS      │ MS       │ FS       │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ 30       │ 0.0011   │ 0.0010   │ 0.0021   │ 0.0019   │
│ 31       │ 0.0012   │ 0.0011   │ 0.0022   │ 0.0020   │
└──────────┴──────────┴──────────┴──────────┴──────────┘

Automatic transformation to tidy format:

┌──────────┬──────────┬───────────────┐
│ age-last │ variable │ mortality_rate│
├──────────┼──────────┼───────────────┤
│ 30       │ MNS      │ 0.0011        │
│ 30       │ FNS      │ 0.0010        │
│ 30       │ MS       │ 0.0021        │
│ 30       │ FS       │ 0.0019        │
│ 31       │ MNS      │ 0.0012        │
│ 31       │ FNS      │ 0.0011        │
└──────────┴──────────┴───────────────┘

Tables with Overflow Columns¶

For tables with "Ultimate" or "Term" columns representing rates beyond the explicit duration range:

# VBT 2015 table with durations 1-25 plus "Ult." column
vbt_table = gs.Table(
    name="vbt_2015_female_smoker",
    source="2015-VBT-FSM-ANB.csv",
    dimensions={
        "issue_age": "issue_age",
        "duration": gs.MeltDimension(
            columns=[str(i) for i in range(1, 26)] + ["Ult."],
            name="duration",
            overflow=gs.ExtendOverflow("Ult.", to_value=120)
        )
    },
    value="qx"
)

This automatically creates lookup entries for durations 26, 27, 28, ... 120, all using the "Ultimate" rate from the original table.

Performing Lookups¶

Single-Key Lookups¶

# Simple lapse rate lookup
af = af.with_columns(
    lapse_table.lookup({"policy_duration": af["policy_duration"]})
)

Multi-Key Lookups¶

# Mortality lookup with age and gender/smoking status
af = af.with_columns(
    mortality_table.lookup({
        "age_last": af["age_last"],
        "variable": af["gender_smoking"]
    })
)

Vector Lookups¶

The most powerful feature - handle entire projection vectors:

# Project 480 months for each policy
af = af.with_columns(
    monthly_ages=af["issue_age"] + (af["projection_months"] / 12),
    monthly_durations=af["policy_duration"] + (af["projection_months"] / 12)
)

# Single lookup returns 480 mortality rates per policy
af = af.with_columns(
    mortality_table.lookup(
        age=af["monthly_ages"],
        duration=af["monthly_durations"]
    )
)

Complete Model Example¶

Here's how assumption tables integrate into a complete actuarial model:

import gaspatchio_core as gs
import polars as pl
from gaspatchio_core import ActuarialFrame

def setup_assumptions():
    """Load all assumption tables for the model"""

    # Load mortality table (wide format with overflow)
    mortality_df = pl.read_parquet("assumptions/mortality.parquet")
    mortality_table = gs.Table(
        name="mortality_rates",
        source=mortality_df,
        dimensions={
            "age-last": "age-last",
            "variable": gs.MeltDimension(
                columns=["MNS", "FNS", "MS", "FS"],
                name="variable"
            )
        },
        value="mortality_rate"
    )

    # Load lapse curve (simple 1D table)
    lapse_df = pl.read_parquet("assumptions/lapse.parquet")
    lapse_table = gs.Table(
        name="lapse_rates",
        source=lapse_df,
        dimensions={
            "policy_duration": "policy_duration"
        },
        value="lapse_rate"
    )

    # Load premium rates (wide format)
    premium_df = pl.read_parquet("assumptions/premium_rates.parquet")
    premium_table = gs.Table(
        name="premium_rates",
        source=premium_df,
        dimensions={
            "age-last": "age-last",
            "variable": gs.MeltDimension(
                columns=["MNS", "FNS", "MS", "FS"],
                name="variable"
            )
        },
        value="premium_rate"
    )

    return mortality_table, lapse_table, premium_table

def life_model(policies_df):
    """Complete life insurance projection model"""

    # Setup assumption tables
    mortality_table, lapse_table, premium_table = setup_assumptions()

    # Create ActuarialFrame
    af = ActuarialFrame(policies_df)

    # Setup projection vectors (480 months per policy)
    max_age = 101
    af["num_proj_months"] = (max_age - af["age"]) * 12
    af["proj_months"] = af.fill_series(af["num_proj_months"], 0, 1)

    # Calculate age and duration vectors
    af["age_last"] = (af["age"] + (af["proj_months"] / 12)).floor()
    af["policy_duration"] = (af["policy_duration"] + (af["proj_months"] / 12)).floor()

    # Create gender/smoking variable for lookups
    af["variable"] = af["gender"] + af["smoking_status"]

    # Vector lookups - get rates for all 480 months at once
    af["mortality_rate"] = mortality_table.lookup({
        "age_last": af["age_last"],
        "variable": af["variable"]
    })

    af["lapse_rate"] = lapse_table.lookup({
        "policy_duration": af["policy_duration"]
    })

    af["premium_rate"] = premium_table.lookup({
        "age_last": af["age_last"],
        "variable": af["variable"]
    })

    # Calculate probabilities and cash flows
    af["monthly_persist_prob"] = (1 - af["mortality_rate"] / 12) * (1 - af["lapse_rate"] / 12)

    # Probability in force (cumulative product with shift)
    af["prob_in_force"] = af["monthly_persist_prob"].list.eval(
        pl.element().cum_prod().shift(1).fill_null(1.0)
    )

    # Cash flows
    af["premium_cf"] = af["premium_rate"] / 12 * af["prob_in_force"] * af["sum_assured"] / 1000
    af["claims_cf"] = af["prob_in_force"] * af["mortality_rate"] / 12 * af["sum_assured"]
    af["profit_cf"] = af["premium_cf"] - af["claims_cf"]

    return af

# Run the model
policies = pl.read_csv("model_points.csv")
results = life_model(policies)

Using the TableBuilder Pattern¶

For complex table configurations, use the TableBuilder pattern:

# Build a complex table step by step
table = (
    gs.TableBuilder("complex_mortality")
    .from_source("mortality_data.csv")
    .with_data_dimension("issue_age", "issue_age")
    .with_data_dimension("policy_year", "policy_year")
    .with_computed_dimension(
        "attained_age",
        pl.col("issue_age") + pl.col("policy_year") - 1,
        "attained_age"
    )
    .with_melt_dimension(
        "duration",
        columns=[str(i) for i in range(1, 26)] + ["Ultimate"],
        overflow=gs.ExtendOverflow("Ultimate", to_value=100)
    )
    .with_value_column("mortality_rate")
    .build()
)

Performance Benefits¶

The assumption system provides significant performance improvements:

1. Pre-Computed Expansion¶

Overflow columns are expanded once at load time, not during every lookup:

# Table with durations 1-25 + "Ult." gets expanded to 1-120 immediately
table = gs.Table(
    name="mortality",
    source=df,
    dimensions={
        "age": "age",
        "duration": gs.MeltDimension(
            columns=duration_cols,
            overflow=gs.ExtendOverflow("Ult.", to_value=120)
        )
    },
    value="rate"
)

# All lookups are now O(1) hash operations - no overflow logic needed
af = af.with_columns(
    table.lookup({"age": af["age"], "duration": 100})  # duration=100 works instantly
)

2. Vector-Native Operations¶

No exploding, joining, or reaggregating required:

# Traditional approach: explode 1M policies × 480 months = 480M rows
# Gaspatchio: 1M policies with 480-element vectors = 1M rows

# Single operation handles entire projection
af = af.with_columns(
    mortality_table.lookup({"age": af["age_vector"]})
)

3. Optimized Hash-Based Lookups¶

Built on Rust HashMaps for maximum performance:

O(1) lookup time regardless of table size
Efficient memory usage with pre-indexed structures
Integrates with Polars' lazy evaluation for optimal query planning

API Reference¶

`Table` Class¶

gs.Table(
    name: str,                          # Table name for lookups
    source: str | pl.DataFrame,         # File path or DataFrame
    dimensions: dict[str, str | Dimension], # Dimension configuration
    value: str = "rate",                # Name for value column
    metadata: dict | None = None,       # Optional metadata storage
    validate: bool = True               # Enable validation
) -> Table

Parameters: - name: Unique identifier for the table in the lookup registry - source: Either a file path (.csv/.parquet) or a Polars DataFrame
- dimensions: Dictionary mapping dimension names to columns or Dimension objects - value: Name of the value column in the final tidy table - metadata: Optional dictionary stored with the table - validate: Whether to validate dimension configuration

`table.lookup()`¶

table.lookup(
    **dimensions: str | pl.Expr         # Dimension names mapped to columns/expressions
) -> pl.Expr

Returns a Polars expression that performs the lookup. Use within .with_columns() or similar Polars operations.

Dimension Types¶

DataDimension: Maps a column directly to a dimension
MeltDimension: Transforms wide columns into long format
CategoricalDimension: Adds a constant categorical value
ComputedDimension: Creates a dimension from an expression

Strategy Types¶

ExtendOverflow: Extends a specific column value to higher indices
AutoDetectOverflow: Automatically detects overflow columns
LinearInterpolate: Fills gaps with linear interpolation
FillConstant: Fills gaps with a constant value
FillForward: Forward fills missing values