Assumptions in Gaspatchio¶
Overview¶
Actuarial models rely heavily on assumption tables - mortality rates, lapse rates, expense assumptions, and other factors that drive projections. Gaspatchio provides a high-performance vector-based lookup system with a simple, intuitive API that handles the complexities of assumption table loading and transformation automatically.
One core principle of Gaspatchio is to "meet people where they are". With regard to assumption tables, this means recognizing that you've likely already got a table in a format that you like. That might come from Excel, another system, regulatory requirements, or a combination of all of those.
Gaspatchio's assumption system is designed to work with any table format and will automatically transform it into a format that is optimized for performance. That's what "meeting people where they are" means to us. Keep your data as it is, and let Gaspatchio do the rest.
The Table API¶
Gaspatchio's assumption system revolves around the dimension-based Table API:
Table()- Load and register assumption tables with automatic format detection and dimension configuration. This happens ONCE before you start your projection/run.table.lookup()- Perform high-performance vector lookups. This happens for each policy/projection period. Which is A LOT.
import gaspatchio_core as gs
import polars as pl
# Load any assumption table with the Table API
mortality_table = gs.Table(
name="mortality_rates",
source="mortality_table.csv", # or DataFrame
dimensions={
"age": "age" # Simple string shorthand for data dimensions
},
value="mortality_rate"
)
# Use in projections with vector lookups
af.mort_rate = mortality_table.lookup(age=af.age_last)
Key Advantages¶
1. Dimension-Based Design¶
The API uses explicit dimensions for clarity and flexibility:
# Simple curve (1D table)
lapse_table = gs.Table(
name="lapse_rates",
source=lapse_df,
dimensions={
"duration": "duration" # Map duration column to duration dimension
},
value="lapse_rate"
)
# Wide table with melt dimension (age × duration grid)
mortality_table = gs.Table(
name="mortality_rates",
source=mortality_df,
dimensions={
"age": "age",
"duration": gs.assumptions.MeltDimension(
columns=["1", "2", "3", "4", "5", "Ultimate"],
name="duration",
overflow=gs.assumptions.ExtendOverflow("Ultimate", to_value=120)
)
},
value="qx"
)
# Multi-dimensional table
multi_dim_table = gs.Table(
name="vbt_2015",
source=vbt_df,
dimensions={
"age": "age",
"sex": "sex",
"smoker": "smoker_status" # Can map different column names
},
value="mortality_rate"
)
2. Automatic Format Detection with Analysis¶
Use the analyze_table() function to get insights and configuration suggestions:
# Analyze any table to understand its structure
schema = gs.assumptions.analyze_table(df)
print(schema.suggest_table_config())
# Output:
# Table(
# name="your_table_name",
# source=df,
# dimensions={
# "age": "age",
# "duration": MeltDimension(
# columns=["1", "2", "3", "4", "5", "Ultimate"],
# overflow=ExtendOverflow("Ultimate", to_value=120)
# )
# },
# value="rate"
# )
3. Smart Overflow Handling¶
Wide tables often have "Ultimate" or overflow columns for durations beyond the explicit range. The API handles this explicitly:
# Table with columns: Age, 1, 2, 3, 4, 5, "Ult."
mortality_table = gs.Table(
name="mortality_table",
source=df,
dimensions={
"age": "age",
"duration": gs.assumptions.MeltDimension(
columns=["1", "2", "3", "4", "5", "Ult."],
overflow=gs.assumptions.ExtendOverflow("Ult.", to_value=120), # Expands to duration 120
fill=gs.assumptions.LinearInterpolate() # Optional: interpolate gaps
)
},
value="rate"
)
# Lookups work seamlessly for any duration
af.mort_rate = mortality_table.lookup(age=af.age, duration=af.duration)
4. Vector-Native Performance¶
Handle entire projection vectors without loops or exploding data:
# Create projection timeline - each policy gets a list of months
af.month = af.date.create_projection_timeline(af.issue_date, af.maturity_date)
# Age progresses as a vector per policy (list column)
af.attained_age = af.issue_age + af.month // 12 # e.g., [30, 30, 30, ..., 31, 31, ...]
# Single lookup returns vector of rates for all ages
af.mort_rate = mortality_table.lookup(age=af.attained_age)
# Result: [0.0011, 0.0011, 0.0011, ..., 0.0012, 0.0012, ...]
Rust-Powered Multi-Core Performance
Gaspatchio's assumption system is implemented in Rust and uses an adaptive storage strategy that automatically selects the optimal backend for each table:
Array Storage (default for dense tables):
- ~3ns per lookup via direct array indexing
- Dictionary-encoded keys enable O(1) index computation
- Perfect cache locality with contiguous memory access
- GPU-ready: same arrays work seamlessly on CPU and GPU
Hash Storage (fallback for sparse tables):
- O(1) average-case lookups via Rust
AHashMap - ~20ns per lookup (hash computation + bucket probe)
- Memory-efficient for tables with many missing combinations
Gaspatchio automatically chooses array storage when tables are >30% dense (most actuarial tables are 70%+ dense). A typical mortality table [3 table_ids × 101 ages × 25 durations] = 7,575 values = 60KB - trivially small!
# This operation uses array indexing on ALL CPU cores
af.mort_rate = mortality_table.lookup(age=af.attained_age)
# 324M lookups completed in ~1 second, not 27 seconds
Tidy Data Principles¶
Following Tidy Data Best Practices
Gaspatchio's assumption system is built around the tidy data principles outlined by Hadley Wickham in his seminal 2014 paper "Tidy Data" (Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10).
Tidy datasets follow three fundamental rules:
- Each variable is a column - keys (age, duration, gender) and values (mortality rates, lapse rates) are separate columns
- Each observation is a row - each row represents one lookup combination (e.g., age 30 + duration 5 = rate 0.0023)
- Each type of observational unit is a table - mortality assumptions, lapse assumptions, etc. are separate tables
Why Tidy Assumptions Matter¶
Traditional actuarial tables are often stored in "wide" format - convenient for human reading but inefficient for computation:
Wide Format (Human-Readable)
┌─────┬──────┬──────┬──────┬──────┐
│ Age │ 1 │ 2 │ 3 │ Ult. │
├─────┼──────┼──────┼──────┼──────┤
│ 30 │0.001 │0.002 │0.003 │0.005 │
│ 31 │0.001 │0.002 │0.003 │0.005 │
└─────┴──────┴──────┴──────┴──────┘
Tidy Format (Machine-Optimized)
┌─────┬──────────┬───────┐
│ Age │ Duration │ Rate │
├─────┼──────────┼───────┤
│ 30 │ 1 │ 0.001 │
│ 30 │ 2 │ 0.002 │
│ 30 │ 3 │ 0.003 │
│ 30 │ 120 │ 0.005 │
│ 31 │ 1 │ 0.001 │
└─────┴──────────┴───────┘
Automatic Tidy Transformation¶
The Table class with MeltDimension automatically converts wide tables to tidy format:
# Input: Wide mortality table
wide_table = pl.DataFrame({
"age": [30, 31, 32],
"1": [0.0011, 0.0012, 0.0013],
"2": [0.0012, 0.0013, 0.0014],
"3": [0.0013, 0.0014, 0.0015],
"Ult.": [0.0050, 0.0051, 0.0052]
})
# Automatic tidy transformation
mortality_table = gs.Table(
name="mortality",
source=wide_table,
dimensions={
"age": "age",
"duration": gs.assumptions.MeltDimension(
columns=["1", "2", "3", "Ult."],
name="duration"
)
},
value="rate"
)
# Result: Tidy table ready for high-performance lookups
# Each age/duration combination becomes a separate row
The tidy format enables:
- Vectorized lookups: Query millions of age/duration combinations in microseconds
- Flexible filtering: Add conditions like gender, smoking status, or product type as additional columns
- Consistent API: Same lookup pattern works for all assumption types
- Memory efficiency: No duplicate storage of rates across multiple table formats
Loading Different Table Types¶
Curve Tables (1-Dimensional)¶
For simple tables with one key and one value:
# Lapse rates by policy duration
lapse_df = pl.DataFrame({
"policy_duration": [1, 2, 3, 4, 5],
"lapse_rate": [0.05, 0.04, 0.03, 0.02, 0.01]
})
lapse_table = gs.Table(
name="lapse_rates",
source=lapse_df,
dimensions={
"policy_duration": "policy_duration"
},
value="lapse_rate"
)
Wide Tables (Age × Duration Grids)¶
For mortality tables and similar multi-dimensional assumptions:
# Mortality table with multiple gender/smoking combinations
mortality_table = gs.Table(
name="mortality_vbt_2015",
source="mortality.parquet",
dimensions={
"age-last": "age-last",
"variable": gs.assumptions.MeltDimension(
columns=["MNS", "FNS", "MS", "FS"], # Male/Female, Non-Smoker/Smoker
name="variable"
)
},
value="mortality_rate"
)
Input DataFrame:
┌──────────┬──────────┬──────────┬──────────┬──────────┐
│ age-last │ MNS │ FNS │ MS │ FS │
├──────────┼──────────┼──────────┼──────────┼──────────┤
│ 30 │ 0.0011 │ 0.0010 │ 0.0021 │ 0.0019 │
│ 31 │ 0.0012 │ 0.0011 │ 0.0022 │ 0.0020 │
└──────────┴──────────┴──────────┴──────────┴──────────┘
Automatic transformation to tidy format:
┌──────────┬──────────┬───────────────┐
│ age-last │ variable │ mortality_rate│
├──────────┼──────────┼───────────────┤
│ 30 │ MNS │ 0.0011 │
│ 30 │ FNS │ 0.0010 │
│ 30 │ MS │ 0.0021 │
│ 30 │ FS │ 0.0019 │
│ 31 │ MNS │ 0.0012 │
│ 31 │ FNS │ 0.0011 │
└──────────┴──────────┴───────────────┘
Tables with Overflow Columns¶
For tables with "Ultimate" or "Term" columns representing rates beyond the explicit duration range:
# VBT 2015 table with durations 1-25 plus "Ult." column
vbt_table = gs.Table(
name="vbt_2015_female_smoker",
source="2015-VBT-FSM-ANB.csv",
dimensions={
"issue_age": "issue_age",
"duration": gs.assumptions.MeltDimension(
columns=[str(i) for i in range(1, 26)] + ["Ult."],
name="duration",
overflow=gs.assumptions.ExtendOverflow("Ult.", to_value=120)
)
},
value="qx"
)
This automatically creates lookup entries for durations 26, 27, 28, ... 120, all using the "Ultimate" rate from the original table.
Performing Lookups¶
Single-Key Lookups¶
# Simple lapse rate lookup
af.lapse_rate = lapse_table.lookup(policy_duration=af.policy_duration)
Multi-Key Lookups¶
# Mortality lookup with age and gender/smoking status
af.mort_rate = mortality_table.lookup(age_last=af.age_last, variable=af.gender_smoking)
Vector Lookups¶
The most powerful feature - handle entire projection vectors:
# Create projection timeline (list of months per policy)
af.month = af.date.create_projection_timeline(af.issue_date, af.maturity_date)
# Calculate age and duration as projection vectors
af.attained_age = af.issue_age + af.month // 12
af.duration = af.month // 12
# Single lookup returns mortality rates for all timesteps at once
af.mort_rate = mortality_table.lookup(age=af.attained_age, duration=af.duration)
Note the idiomatic patterns:
- Attribute notation (
af.column) instead of bracket notation - Direct assignment rather than intermediate
with_columnscalls - Inline expressions in lookups when simple enough
Inspecting Tables with describe()¶
The describe() method provides a human-readable summary of a table's structure. This is useful for debugging lookup issues, generating documentation, and understanding table configuration during model development.
Simple Table Example¶
import polars as pl
from gaspatchio_core.assumptions import Table
# Create a simple 1D lapse rate table
lapse_data = pl.DataFrame({
"duration": [1, 2, 3, 4, 5],
"lapse_rate": [0.08, 0.06, 0.04, 0.03, 0.02]
})
lapse_table = Table(
name="lapse_rates",
source=lapse_data,
dimensions={"duration": "duration"},
value="lapse_rate"
)
print(lapse_table.describe())
Output:
Table: lapse_rates
Rows: 5
Storage mode: array
Value column: lapse_rate
Key columns (1): duration
Dimensions (1): duration
- duration: DataDimension
Multi-Dimensional Table Example¶
import polars as pl
from gaspatchio_core.assumptions import Table, MeltDimension, ExtendOverflow
# Create a wide mortality table (age × duration grid)
mortality_data = pl.DataFrame({
"age": [30, 31, 32, 33],
"1": [0.0010, 0.0011, 0.0012, 0.0013],
"2": [0.0011, 0.0012, 0.0013, 0.0014],
"3": [0.0012, 0.0013, 0.0014, 0.0015],
"Ult.": [0.0050, 0.0051, 0.0052, 0.0053]
})
mortality_table = Table(
name="mortality_select",
source=mortality_data,
dimensions={
"age": "age",
"duration": MeltDimension(
columns=["1", "2", "3", "Ult."],
name="duration",
overflow=ExtendOverflow("Ult.", to_value=10)
)
},
value="qx"
)
print(mortality_table.describe())
Output:
Table: mortality_select
Rows: 40
Storage mode: array
Value column: qx
Key columns (2): age, duration
Dimensions (2): age, duration
- age: DataDimension
- duration: MeltDimension
Notice the row count is 40 (4 ages × 10 durations) because ExtendOverflow expanded the "Ult." column to durations 4 through 10.
What describe() Tells You¶
| Field | Description |
|---|---|
| Table | The table name used for identification |
| Rows | Total row count after tidy transformation and overflow expansion |
| Storage mode | array (fast, dense) or hash (flexible, sparse) |
| Value column | The column containing lookup values |
| Key columns | Columns used for indexing (excludes value column) |
| Dimensions | Configured dimensions with their types |
The dimension types help you understand how each dimension was configured:
- DataDimension - Direct column mapping
- MeltDimension - Wide-to-long transformation (unpivoted columns)
- CategoricalDimension - Constant categorical value
- ComputedDimension - Derived from an expression
Complete Model Example¶
Here's how assumption tables integrate into a complete actuarial model:
import gaspatchio_core as gs
import polars as pl
from gaspatchio_core import ActuarialFrame
def setup_assumptions():
"""Load all assumption tables for the model"""
# Load mortality table (wide format with MNS, FNS, MS, FS columns)
mortality_df = pl.read_parquet("assumptions/mortality.parquet")
mortality_table = gs.Table(
name="mortality_rates",
source=mortality_df,
dimensions={
"age_last": "age_last",
"variable": gs.assumptions.MeltDimension(
columns=["MNS", "FNS", "MS", "FS"],
name="variable"
)
},
value="mortality_rate"
)
# Load lapse curve (simple 1D table)
lapse_df = pl.read_parquet("assumptions/lapse.parquet")
lapse_table = gs.Table(
name="lapse_rates",
source=lapse_df,
dimensions={
"policy_duration": "policy_duration"
},
value="lapse_rate"
)
# Load premium rates (wide format)
premium_df = pl.read_parquet("assumptions/premium_rates.parquet")
premium_table = gs.Table(
name="premium_rates",
source=premium_df,
dimensions={
"age_last": "age_last",
"variable": gs.assumptions.MeltDimension(
columns=["MNS", "FNS", "MS", "FS"],
name="variable"
)
},
value="premium_rate"
)
return mortality_table, lapse_table, premium_table
def life_model(policies_df):
"""Complete life insurance projection model"""
# Setup assumption tables
mortality_table, lapse_table, premium_table = setup_assumptions()
# Create ActuarialFrame
af = ActuarialFrame(policies_df)
# Create projection timeline using fill_series
# Calculate projection length based on policy term (project until age 70)
max_age = 70
af.num_proj_months = (max_age - af.issue_age) * 12
af.month = af.fill_series(af.num_proj_months, start=0, increment=1)
# Calculate indexing columns as projection vectors
af.attained_age = af.issue_age + af.month // 12
af.duration = af.month // 12
# Create gender/smoking variable for lookups
af.variable = af.gender + af.smoking_status
# Vector lookups - get rates for all timesteps at once
af.mort_rate = mortality_table.lookup(age_last=af.attained_age, variable=af.variable)
af.lapse_rate = lapse_table.lookup(policy_duration=af.duration)
af.premium_rate = premium_table.lookup(age_last=af.attained_age, variable=af.variable)
# Calculate monthly persistence probability
af.monthly_persist = (1 - af.mort_rate / 12) * (1 - af.lapse_rate / 12)
# Probability in force using projection accessor
af.pols_if = af.monthly_persist.projection.cumulative_survival()
# Cash flows
af.premium_cf = af.premium_rate / 12 * af.pols_if * af.sum_assured / 1000
af.claims_cf = af.pols_if * af.mort_rate / 12 * af.sum_assured
af.profit_cf = af.premium_cf - af.claims_cf
return af
# Run the model
policies = pl.read_csv("model_points.csv")
results = life_model(policies)
Using the TableBuilder Pattern¶
For complex table configurations, use the TableBuilder pattern:
# Build a complex table step by step
table = (
gs.TableBuilder("complex_mortality")
.from_source("mortality_data.csv")
.with_data_dimension("issue_age", "issue_age")
.with_data_dimension("policy_year", "policy_year")
.with_computed_dimension(
"attained_age",
pl.col("issue_age") + pl.col("policy_year") - 1,
"attained_age"
)
.with_melt_dimension(
"duration",
columns=[str(i) for i in range(1, 26)] + ["Ultimate"],
overflow=gs.assumptions.ExtendOverflow("Ultimate", to_value=100)
)
.with_value_column("mortality_rate")
.build()
)
Performance Benefits¶
The assumption system provides significant performance improvements through intelligent storage selection and optimized data structures.
1. Adaptive Storage: Array vs Hash¶
Gaspatchio automatically selects the optimal storage backend based on table density:
| Storage | Lookup Time | Best For | Memory |
|---|---|---|---|
| Array | ~3ns | Dense tables (>30% filled) | Proportional to key range |
| Hash | ~20ns | Sparse tables (<30% filled) | Proportional to entries |
How Array Storage Works:
String keys (like table_id: "A", "B", "C") are dictionary-encoded to integers (0, 1, 2). Integer keys (age 18-100, duration 0-24) are used directly. The linear index is computed as:
index = table_idx × 101 × 25 + age × 25 + duration
value = data[index] # Direct O(1) array access
This eliminates hash computation entirely - just multiplication, addition, and array indexing.
Real-World Table Analysis:
| Lookup | Keys | Density | Array Suitable? |
|---|---|---|---|
mortality_select |
table_id, age, duration | ~30% | ✅ Yes |
lapse_rates |
lapse_id, duration | ~80% | ✅ Yes |
inv_returns |
scenario_id, t, fund_index | ~95% | ✅ Yes |
Most actuarial tables benefit from array storage automatically.
2. Pre-Computed Expansion¶
Overflow columns are expanded once at load time, not during every lookup:
# Table with durations 1-25 + "Ult." gets expanded to 1-120 immediately
table = gs.Table(
name="mortality",
source=df,
dimensions={
"age": "age",
"duration": gs.assumptions.MeltDimension(
columns=duration_cols,
overflow=gs.assumptions.ExtendOverflow("Ult.", to_value=120)
)
},
value="rate"
)
# All lookups are now O(1) array operations - no overflow logic needed
af.mort_rate = table.lookup(age=af.age, duration=100) # duration=100 works instantly
3. Vector-Native Operations¶
No exploding, joining, or reaggregating required:
# Traditional approach: explode 1M policies × 480 months = 480M rows
# Gaspatchio: 1M policies with 480-element vectors = 1M rows
# Single operation handles entire projection
af.mort_rate = mortality_table.lookup(age=af.attained_age)
4. Benchmark: 100k Policies × 3 Scenarios¶
A real-world benchmark with 100,000 model points, 3 scenarios, and 180 timesteps (324 million total lookups):
| Approach | Lookup Time | Total Model Time | Speedup |
|---|---|---|---|
| Hash storage | ~27s | ~43s | 1x |
| Array storage | ~1s | ~16s | 2.7x overall |
The lookup portion alone sees an 8x speedup, with the overall model running 2.7x faster.
5. GPU-Ready Architecture¶
The array storage strategy is designed with GPU acceleration in mind. Once tables are stored as dense arrays, GPU execution becomes trivial:
# Same arrays work on both CPU and GPU
# JAX/XLA automatically parallelizes across thousands of GPU threads
mort_rate = mort_array[table_idx, age, duration] # Direct array indexing
No GPU hash tables required - the array structure enables massive parallelism automatically.
API Reference¶
Table Class¶
gs.Table(
name: str, # Table name for lookups
source: str | pl.DataFrame, # File path or DataFrame
dimensions: dict[str, str | Dimension], # Dimension configuration
value: str = "rate", # Name for value column
metadata: dict | None = None, # Optional metadata storage
validate: bool = True # Enable validation
) -> Table
Parameters:
- name: Unique identifier for the table in the lookup registry
- source: Either a file path (.csv/.parquet) or a Polars DataFrame
- dimensions: Dictionary mapping dimension names to columns or Dimension objects
- value: Name of the value column in the final tidy table
- metadata: Optional dictionary stored with the table
- validate: Whether to validate dimension configuration
table.lookup()¶
table.lookup(
**dimensions: str | pl.Expr # Dimension names mapped to columns/expressions
) -> pl.Expr
Returns a Polars expression that performs the lookup. Use with attribute assignment (e.g., af.rate = table.lookup(...)).
Dimension Types¶
DataDimension: Maps a column directly to a dimensionMeltDimension: Transforms wide columns into long formatCategoricalDimension: Adds a constant categorical valueComputedDimension: Creates a dimension from an expression
Strategy Types¶
ExtendOverflow: Extends a specific column value to higher indicesAutoDetectOverflow: Automatically detects overflow columnsLinearInterpolate: Fills gaps with linear interpolationFillConstant: Fills gaps with a constant valueFillForward: Forward fills missing values