Assumptions in Gaspatchio¶
Overview¶
Actuarial models rely heavily on assumption tables - mortality rates, lapse rates, expense assumptions, and other factors that drive projections. Gaspatchio provides a high-performance vector-based lookup system with a simple, intuitive API that handles the complexities of assumption table loading and transformation automatically.
One core principle of Gaspatchio is to "meet people where they are". With regard to assumption tables, this means recognizing that you've likely already got a table in a format that you like. That might come from Excel, another system, regulatory requirements, or a combination of all of those.
Gaspatchio's assumption system is designed to work with any table format and will automatically transform it into a format that is optimized for performance. That's what "meeting people where they are" means to us. Keep your data as it is, and let Gaspatchio do the rest.
The Table API¶
Gaspatchio's assumption system revolves around the dimension-based Table API:
Table()
- Load and register assumption tables with automatic format detection and dimension configuration. This happens ONCE before you start your projection/run.table.lookup()
- Perform high-performance vector lookups. This happens for each policy/projection period. Which is A LOT.
import gaspatchio_core as gs
import polars as pl
# Load any assumption table with the Table API
mortality_table = gs.Table(
name="mortality_rates",
source="mortality_table.csv", # or DataFrame
dimensions={
"age": "age" # Simple string shorthand for data dimensions
},
value="mortality_rate"
)
# Use in projections with vector lookups
af = af.with_columns(
mortality_table.lookup(age=af["age_last"])
)
Key Advantages¶
1. Dimension-Based Design¶
The API uses explicit dimensions for clarity and flexibility:
# Simple curve (1D table)
lapse_table = gs.Table(
name="lapse_rates",
source=lapse_df,
dimensions={
"duration": "duration" # Map duration column to duration dimension
},
value="lapse_rate"
)
# Wide table with melt dimension (age × duration grid)
mortality_table = gs.Table(
name="mortality_rates",
source=mortality_df,
dimensions={
"age": "age",
"duration": gs.MeltDimension(
columns=["1", "2", "3", "4", "5", "Ultimate"],
name="duration",
overflow=gs.ExtendOverflow("Ultimate", to_value=120)
)
},
value="qx"
)
# Multi-dimensional table
multi_dim_table = gs.Table(
name="vbt_2015",
source=vbt_df,
dimensions={
"age": "age",
"sex": "sex",
"smoker": "smoker_status" # Can map different column names
},
value="mortality_rate"
)
2. Automatic Format Detection with Analysis¶
Use the analyze_table()
function to get insights and configuration suggestions:
# Analyze any table to understand its structure
schema = gs.analyze_table(df)
print(schema.suggest_table_config())
# Output:
# Table(
# name="your_table_name",
# source=df,
# dimensions={
# "age": "age",
# "duration": MeltDimension(
# columns=["1", "2", "3", "4", "5", "Ultimate"],
# overflow=ExtendOverflow("Ultimate", to_value=120)
# )
# },
# value="rate"
# )
3. Smart Overflow Handling¶
Wide tables often have "Ultimate" or overflow columns for durations beyond the explicit range. The API handles this explicitly:
# Table with columns: Age, 1, 2, 3, 4, 5, "Ult."
mortality_table = gs.Table(
name="mortality_table",
source=df,
dimensions={
"age": "age",
"duration": gs.MeltDimension(
columns=["1", "2", "3", "4", "5", "Ult."],
overflow=gs.ExtendOverflow("Ult.", to_value=120), # Expands to duration 120
fill=gs.LinearInterpolate() # Optional: interpolate gaps
)
},
value="rate"
)
# Lookups work seamlessly for any duration
af = af.with_columns(
mortality_table.lookup(age=af["age"], duration=af["duration"])
)
4. Vector-Native Performance¶
Handle entire projection vectors without loops or exploding data:
# Age progresses as a vector per policy
df = df.with_columns(
age_vector=[[30, 31, 32, 33, ...]] # 480 months of ages
)
# Single lookup returns vector of rates for all ages
df = df.with_columns(
mortality_table.lookup(age=pl.col("age_vector"))
)
# Result: [0.0011, 0.0012, 0.0013, ...]
Rust-Powered Multi-Core Performance
Gaspatchio's assumption system is implemented in Rust and leverages all available CPU cores automatically. The core registry (PyAssumptionTableRegistry
) stores lookup indices as optimized Rust HashMap
structures, providing:
- O(1) hash-based lookups regardless of table size
- Zero-copy memory access through Rust's ownership system
- Automatic parallelization via Polars' multi-threaded query engine
- SIMD vectorization for mathematical operations on assumption vectors
When you perform a lookup on 1 million policies with 480-month projections (480M total lookups), Gaspatchio distributes the work across all CPU cores simultaneously. A 16-core machine can process assumption lookups 16x faster than traditional single-threaded approaches.
# This single operation uses ALL your CPU cores
af = af.with_columns(
mortality_table.lookup(age=af["age_vector"])
)
# 480M lookups completed in seconds, not minutes
Tidy Data Principles¶
Following Tidy Data Best Practices
Gaspatchio's assumption system is built around the tidy data principles outlined by Hadley Wickham in his seminal 2014 paper "Tidy Data" (Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10).
Tidy datasets follow three fundamental rules:
- Each variable is a column - keys (age, duration, gender) and values (mortality rates, lapse rates) are separate columns
- Each observation is a row - each row represents one lookup combination (e.g., age 30 + duration 5 = rate 0.0023)
- Each type of observational unit is a table - mortality assumptions, lapse assumptions, etc. are separate tables
Why Tidy Assumptions Matter¶
Traditional actuarial tables are often stored in "wide" format - convenient for human reading but inefficient for computation:
Wide Format (Human-Readable)
┌─────┬──────┬──────┬──────┬──────┐
│ Age │ 1 │ 2 │ 3 │ Ult. │
├─────┼──────┼──────┼──────┼──────┤
│ 30 │0.001 │0.002 │0.003 │0.005 │
│ 31 │0.001 │0.002 │0.003 │0.005 │
└─────┴──────┴──────┴──────┴──────┘
Tidy Format (Machine-Optimized)
┌─────┬──────────┬───────┐
│ Age │ Duration │ Rate │
├─────┼──────────┼───────┤
│ 30 │ 1 │ 0.001 │
│ 30 │ 2 │ 0.002 │
│ 30 │ 3 │ 0.003 │
│ 30 │ 120 │ 0.005 │
│ 31 │ 1 │ 0.001 │
└─────┴──────────┴───────┘
Automatic Tidy Transformation¶
The Table
class with MeltDimension
automatically converts wide tables to tidy format:
# Input: Wide mortality table
wide_table = pl.DataFrame({
"age": [30, 31, 32],
"1": [0.0011, 0.0012, 0.0013],
"2": [0.0012, 0.0013, 0.0014],
"3": [0.0013, 0.0014, 0.0015],
"Ult.": [0.0050, 0.0051, 0.0052]
})
# Automatic tidy transformation
mortality_table = gs.Table(
name="mortality",
source=wide_table,
dimensions={
"age": "age",
"duration": gs.MeltDimension(
columns=["1", "2", "3", "Ult."],
name="duration"
)
},
value="rate"
)
# Result: Tidy table ready for high-performance lookups
# Each age/duration combination becomes a separate row
The tidy format enables:
- Vectorized lookups: Query millions of age/duration combinations in microseconds
- Flexible filtering: Add conditions like gender, smoking status, or product type as additional columns
- Consistent API: Same lookup pattern works for all assumption types
- Memory efficiency: No duplicate storage of rates across multiple table formats
Loading Different Table Types¶
Curve Tables (1-Dimensional)¶
For simple tables with one key and one value:
# Lapse rates by policy duration
lapse_df = pl.DataFrame({
"policy_duration": [1, 2, 3, 4, 5],
"lapse_rate": [0.05, 0.04, 0.03, 0.02, 0.01]
})
lapse_table = gs.Table(
name="lapse_rates",
source=lapse_df,
dimensions={
"policy_duration": "policy_duration"
},
value="lapse_rate"
)
Wide Tables (Age × Duration Grids)¶
For mortality tables and similar multi-dimensional assumptions:
# Mortality table with multiple gender/smoking combinations
mortality_table = gs.Table(
name="mortality_vbt_2015",
source="mortality.parquet",
dimensions={
"age-last": "age-last",
"variable": gs.MeltDimension(
columns=["MNS", "FNS", "MS", "FS"], # Male/Female, Non-Smoker/Smoker
name="variable"
)
},
value="mortality_rate"
)
Input DataFrame:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ age-last │ MNS │ FNS │ MS │ FS │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ 30 │ 0.0011 │ 0.0010 │ 0.0021 │ 0.0019 │
│ 31 │ 0.0012 │ 0.0011 │ 0.0022 │ 0.0020 │
└──────────┴──────────┴──────────┴──────────┴──────────┘
Automatic transformation to tidy format:
┌──────────┬──────────┬───────────────┐
│ age-last │ variable │ mortality_rate│
├──────────┼──────────┼───────────────┤
│ 30 │ MNS │ 0.0011 │
│ 30 │ FNS │ 0.0010 │
│ 30 │ MS │ 0.0021 │
│ 30 │ FS │ 0.0019 │
│ 31 │ MNS │ 0.0012 │
│ 31 │ FNS │ 0.0011 │
└──────────┴──────────┴───────────────┘
Tables with Overflow Columns¶
For tables with "Ultimate" or "Term" columns representing rates beyond the explicit duration range:
# VBT 2015 table with durations 1-25 plus "Ult." column
vbt_table = gs.Table(
name="vbt_2015_female_smoker",
source="2015-VBT-FSM-ANB.csv",
dimensions={
"issue_age": "issue_age",
"duration": gs.MeltDimension(
columns=[str(i) for i in range(1, 26)] + ["Ult."],
name="duration",
overflow=gs.ExtendOverflow("Ult.", to_value=120)
)
},
value="qx"
)
This automatically creates lookup entries for durations 26, 27, 28, ... 120, all using the "Ultimate" rate from the original table.
Performing Lookups¶
Single-Key Lookups¶
# Simple lapse rate lookup
af = af.with_columns(
lapse_table.lookup({"policy_duration": af["policy_duration"]})
)
Multi-Key Lookups¶
# Mortality lookup with age and gender/smoking status
af = af.with_columns(
mortality_table.lookup({
"age_last": af["age_last"],
"variable": af["gender_smoking"]
})
)
Vector Lookups¶
The most powerful feature - handle entire projection vectors:
# Project 480 months for each policy
af = af.with_columns(
monthly_ages=af["issue_age"] + (af["projection_months"] / 12),
monthly_durations=af["policy_duration"] + (af["projection_months"] / 12)
)
# Single lookup returns 480 mortality rates per policy
af = af.with_columns(
mortality_table.lookup(
age=af["monthly_ages"],
duration=af["monthly_durations"]
)
)
Complete Model Example¶
Here's how assumption tables integrate into a complete actuarial model:
import gaspatchio_core as gs
import polars as pl
from gaspatchio_core import ActuarialFrame
def setup_assumptions():
"""Load all assumption tables for the model"""
# Load mortality table (wide format with overflow)
mortality_df = pl.read_parquet("assumptions/mortality.parquet")
mortality_table = gs.Table(
name="mortality_rates",
source=mortality_df,
dimensions={
"age-last": "age-last",
"variable": gs.MeltDimension(
columns=["MNS", "FNS", "MS", "FS"],
name="variable"
)
},
value="mortality_rate"
)
# Load lapse curve (simple 1D table)
lapse_df = pl.read_parquet("assumptions/lapse.parquet")
lapse_table = gs.Table(
name="lapse_rates",
source=lapse_df,
dimensions={
"policy_duration": "policy_duration"
},
value="lapse_rate"
)
# Load premium rates (wide format)
premium_df = pl.read_parquet("assumptions/premium_rates.parquet")
premium_table = gs.Table(
name="premium_rates",
source=premium_df,
dimensions={
"age-last": "age-last",
"variable": gs.MeltDimension(
columns=["MNS", "FNS", "MS", "FS"],
name="variable"
)
},
value="premium_rate"
)
return mortality_table, lapse_table, premium_table
def life_model(policies_df):
"""Complete life insurance projection model"""
# Setup assumption tables
mortality_table, lapse_table, premium_table = setup_assumptions()
# Create ActuarialFrame
af = ActuarialFrame(policies_df)
# Setup projection vectors (480 months per policy)
max_age = 101
af["num_proj_months"] = (max_age - af["age"]) * 12
af["proj_months"] = af.fill_series(af["num_proj_months"], 0, 1)
# Calculate age and duration vectors
af["age_last"] = (af["age"] + (af["proj_months"] / 12)).floor()
af["policy_duration"] = (af["policy_duration"] + (af["proj_months"] / 12)).floor()
# Create gender/smoking variable for lookups
af["variable"] = af["gender"] + af["smoking_status"]
# Vector lookups - get rates for all 480 months at once
af["mortality_rate"] = mortality_table.lookup({
"age_last": af["age_last"],
"variable": af["variable"]
})
af["lapse_rate"] = lapse_table.lookup({
"policy_duration": af["policy_duration"]
})
af["premium_rate"] = premium_table.lookup({
"age_last": af["age_last"],
"variable": af["variable"]
})
# Calculate probabilities and cash flows
af["monthly_persist_prob"] = (1 - af["mortality_rate"] / 12) * (1 - af["lapse_rate"] / 12)
# Probability in force (cumulative product with shift)
af["prob_in_force"] = af["monthly_persist_prob"].list.eval(
pl.element().cum_prod().shift(1).fill_null(1.0)
)
# Cash flows
af["premium_cf"] = af["premium_rate"] / 12 * af["prob_in_force"] * af["sum_assured"] / 1000
af["claims_cf"] = af["prob_in_force"] * af["mortality_rate"] / 12 * af["sum_assured"]
af["profit_cf"] = af["premium_cf"] - af["claims_cf"]
return af
# Run the model
policies = pl.read_csv("model_points.csv")
results = life_model(policies)
Using the TableBuilder Pattern¶
For complex table configurations, use the TableBuilder
pattern:
# Build a complex table step by step
table = (
gs.TableBuilder("complex_mortality")
.from_source("mortality_data.csv")
.with_data_dimension("issue_age", "issue_age")
.with_data_dimension("policy_year", "policy_year")
.with_computed_dimension(
"attained_age",
pl.col("issue_age") + pl.col("policy_year") - 1,
"attained_age"
)
.with_melt_dimension(
"duration",
columns=[str(i) for i in range(1, 26)] + ["Ultimate"],
overflow=gs.ExtendOverflow("Ultimate", to_value=100)
)
.with_value_column("mortality_rate")
.build()
)
Performance Benefits¶
The assumption system provides significant performance improvements:
1. Pre-Computed Expansion¶
Overflow columns are expanded once at load time, not during every lookup:
# Table with durations 1-25 + "Ult." gets expanded to 1-120 immediately
table = gs.Table(
name="mortality",
source=df,
dimensions={
"age": "age",
"duration": gs.MeltDimension(
columns=duration_cols,
overflow=gs.ExtendOverflow("Ult.", to_value=120)
)
},
value="rate"
)
# All lookups are now O(1) hash operations - no overflow logic needed
af = af.with_columns(
table.lookup({"age": af["age"], "duration": 100}) # duration=100 works instantly
)
2. Vector-Native Operations¶
No exploding, joining, or reaggregating required:
# Traditional approach: explode 1M policies × 480 months = 480M rows
# Gaspatchio: 1M policies with 480-element vectors = 1M rows
# Single operation handles entire projection
af = af.with_columns(
mortality_table.lookup({"age": af["age_vector"]})
)
3. Optimized Hash-Based Lookups¶
Built on Rust HashMaps for maximum performance:
- O(1) lookup time regardless of table size
- Efficient memory usage with pre-indexed structures
- Integrates with Polars' lazy evaluation for optimal query planning
API Reference¶
Table
Class¶
gs.Table(
name: str, # Table name for lookups
source: str | pl.DataFrame, # File path or DataFrame
dimensions: dict[str, str | Dimension], # Dimension configuration
value: str = "rate", # Name for value column
metadata: dict | None = None, # Optional metadata storage
validate: bool = True # Enable validation
) -> Table
Parameters:
- name
: Unique identifier for the table in the lookup registry
- source
: Either a file path (.csv/.parquet) or a Polars DataFrame
- dimensions
: Dictionary mapping dimension names to columns or Dimension objects
- value
: Name of the value column in the final tidy table
- metadata
: Optional dictionary stored with the table
- validate
: Whether to validate dimension configuration
table.lookup()
¶
table.lookup(
**dimensions: str | pl.Expr # Dimension names mapped to columns/expressions
) -> pl.Expr
Returns a Polars expression that performs the lookup. Use within .with_columns()
or similar Polars operations.
Dimension Types¶
DataDimension
: Maps a column directly to a dimensionMeltDimension
: Transforms wide columns into long formatCategoricalDimension
: Adds a constant categorical valueComputedDimension
: Creates a dimension from an expression
Strategy Types¶
ExtendOverflow
: Extends a specific column value to higher indicesAutoDetectOverflow
: Automatically detects overflow columnsLinearInterpolate
: Fills gaps with linear interpolationFillConstant
: Fills gaps with a constant valueFillForward
: Forward fills missing values