# Gaspatchio Gaspatchio is an actuarial modelling framework that allows you to build and run actuarial models in pure Python but built for use with LLMs from the ground up. # Concepts documentation # Assumptions in Gaspatchio ## Overview Actuarial models rely heavily on assumption tables - mortality rates, lapse rates, expense assumptions, and other factors that drive projections. Gaspatchio provides a high-performance vector-based lookup system with a simple, intuitive API that handles the complexities of assumption table loading and transformation automatically. One core principle of Gaspatchio is to "meet people where they are". With regard to assumption tables, this means recognizing that you've likely already got a table in a format that you like. That might come from Excel, another system, regulatory requirements, or a combination of all of those. Gaspatchio's assumption system is designed to work with any table format and will automatically transform it into a format that is optimized for performance. That's what "meeting people where they are" means to us. Keep your data as it is, and let Gaspatchio do the rest. ## The Table API Gaspatchio's assumption system revolves around the dimension-based Table API: - **`Table()`** - Load and register assumption tables with automatic format detection and dimension configuration. This happens ONCE before you start your projection/run. - **`table.lookup()`** - Perform high-performance vector lookups. This happens for each policy/projection period. Which is A LOT. ```python import gaspatchio_core as gs import polars as pl # Load any assumption table with the Table API mortality_table = gs.Table( name="mortality_rates", source="mortality_table.csv", # or DataFrame dimensions={ "age": "age" # Simple string shorthand for data dimensions }, value="mortality_rate" ) # Use in projections with vector lookups af = af.with_columns( mortality_table.lookup(age=af["age_last"]) ) ``` ## Key Advantages ### 1. **Dimension-Based Design** The API uses explicit dimensions for clarity and flexibility: ```python # Simple curve (1D table) lapse_table = gs.Table( name="lapse_rates", source=lapse_df, dimensions={ "duration": "duration" # Map duration column to duration dimension }, value="lapse_rate" ) # Wide table with melt dimension (age × duration grid) mortality_table = gs.Table( name="mortality_rates", source=mortality_df, dimensions={ "age": "age", "duration": gs.MeltDimension( columns=["1", "2", "3", "4", "5", "Ultimate"], name="duration", overflow=gs.ExtendOverflow("Ultimate", to_value=120) ) }, value="qx" ) # Multi-dimensional table multi_dim_table = gs.Table( name="vbt_2015", source=vbt_df, dimensions={ "age": "age", "sex": "sex", "smoker": "smoker_status" # Can map different column names }, value="mortality_rate" ) ``` ### 2. **Automatic Format Detection with Analysis** Use the `analyze_table()` function to get insights and configuration suggestions: ```python # Analyze any table to understand its structure schema = gs.analyze_table(df) print(schema.suggest_table_config()) # Output: # Table( # name="your_table_name", # source=df, # dimensions={ # "age": "age", # "duration": MeltDimension( # columns=["1", "2", "3", "4", "5", "Ultimate"], # overflow=ExtendOverflow("Ultimate", to_value=120) # ) # }, # value="rate" # ) ``` ### 3. **Smart Overflow Handling** Wide tables often have "Ultimate" or overflow columns for durations beyond the explicit range. The API handles this explicitly: ```python # Table with columns: Age, 1, 2, 3, 4, 5, "Ult." mortality_table = gs.Table( name="mortality_table", source=df, dimensions={ "age": "age", "duration": gs.MeltDimension( columns=["1", "2", "3", "4", "5", "Ult."], overflow=gs.ExtendOverflow("Ult.", to_value=120), # Expands to duration 120 fill=gs.LinearInterpolate() # Optional: interpolate gaps ) }, value="rate" ) # Lookups work seamlessly for any duration af = af.with_columns( mortality_table.lookup(age=af["age"], duration=af["duration"]) ) ``` ### 4. **Vector-Native Performance** Handle entire projection vectors without loops or exploding data: ```python # Age progresses as a vector per policy df = df.with_columns( age_vector=[[30, 31, 32, 33, ...]] # 480 months of ages ) # Single lookup returns vector of rates for all ages df = df.with_columns( mortality_table.lookup(age=pl.col("age_vector")) ) # Result: [0.0011, 0.0012, 0.0013, ...] ``` Rust-Powered Multi-Core Performance Gaspatchio's assumption system is **implemented in Rust** and leverages **all available CPU cores** automatically. The core registry (`PyAssumptionTableRegistry`) stores lookup indices as optimized Rust `HashMap` structures, providing: - **O(1) hash-based lookups** regardless of table size - **Zero-copy memory access** through Rust's ownership system - **Automatic parallelization** via Polars' multi-threaded query engine - **SIMD vectorization** for mathematical operations on assumption vectors When you perform a lookup on 1 million policies with 480-month projections (480M total lookups), Gaspatchio distributes the work across all CPU cores simultaneously. A 16-core machine can process assumption lookups **16x faster** than traditional single-threaded approaches. ```python # This single operation uses ALL your CPU cores af = af.with_columns( mortality_table.lookup(age=af["age_vector"]) ) # 480M lookups completed in seconds, not minutes ``` ## Tidy Data Principles Following Tidy Data Best Practices Gaspatchio's assumption system is built around the **tidy data** principles outlined by Hadley Wickham in his seminal 2014 paper "Tidy Data" (Journal of Statistical Software, 59(10), 1–23. https://doi.org/10.18637/jss.v059.i10). Tidy datasets follow three fundamental rules: 1. **Each variable is a column** - keys (age, duration, gender) and values (mortality rates, lapse rates) are separate columns 1. **Each observation is a row** - each row represents one lookup combination (e.g., age 30 + duration 5 = rate 0.0023) 1. **Each type of observational unit is a table** - mortality assumptions, lapse assumptions, etc. are separate tables ### Why Tidy Assumptions Matter Traditional actuarial tables are often stored in "wide" format - convenient for human reading but inefficient for computation: **Wide Format (Human-Readable)** ```text ┌─────┬──────┬──────┬──────┬──────┐ │ Age │ 1 │ 2 │ 3 │ Ult. │ ├─────┼──────┼──────┼──────┼──────┤ │ 30 │0.001 │0.002 │0.003 │0.005 │ │ 31 │0.001 │0.002 │0.003 │0.005 │ └─────┴──────┴──────┴──────┴──────┘ ``` **Tidy Format (Machine-Optimized)** ```text ┌─────┬──────────┬───────┐ │ Age │ Duration │ Rate │ ├─────┼──────────┼───────┤ │ 30 │ 1 │ 0.001 │ │ 30 │ 2 │ 0.002 │ │ 30 │ 3 │ 0.003 │ │ 30 │ 120 │ 0.005 │ │ 31 │ 1 │ 0.001 │ └─────┴──────────┴───────┘ ``` ### Automatic Tidy Transformation The `Table` class with `MeltDimension` automatically converts wide tables to tidy format: ```python # Input: Wide mortality table wide_table = pl.DataFrame({ "age": [30, 31, 32], "1": [0.0011, 0.0012, 0.0013], "2": [0.0012, 0.0013, 0.0014], "3": [0.0013, 0.0014, 0.0015], "Ult.": [0.0050, 0.0051, 0.0052] }) # Automatic tidy transformation mortality_table = gs.Table( name="mortality", source=wide_table, dimensions={ "age": "age", "duration": gs.MeltDimension( columns=["1", "2", "3", "Ult."], name="duration" ) }, value="rate" ) # Result: Tidy table ready for high-performance lookups # Each age/duration combination becomes a separate row ``` The tidy format enables: - **Vectorized lookups**: Query millions of age/duration combinations in microseconds - **Flexible filtering**: Add conditions like gender, smoking status, or product type as additional columns - **Consistent API**: Same lookup pattern works for all assumption types - **Memory efficiency**: No duplicate storage of rates across multiple table formats ## Loading Different Table Types ### Curve Tables (1-Dimensional) For simple tables with one key and one value: ```python # Lapse rates by policy duration lapse_df = pl.DataFrame({ "policy_duration": [1, 2, 3, 4, 5], "lapse_rate": [0.05, 0.04, 0.03, 0.02, 0.01] }) lapse_table = gs.Table( name="lapse_rates", source=lapse_df, dimensions={ "policy_duration": "policy_duration" }, value="lapse_rate" ) ``` ### Wide Tables (Age × Duration Grids) For mortality tables and similar multi-dimensional assumptions: ```python # Mortality table with multiple gender/smoking combinations mortality_table = gs.Table( name="mortality_vbt_2015", source="mortality.parquet", dimensions={ "age-last": "age-last", "variable": gs.MeltDimension( columns=["MNS", "FNS", "MS", "FS"], # Male/Female, Non-Smoker/Smoker name="variable" ) }, value="mortality_rate" ) ``` **Input DataFrame:** ```text ┌──────────┬──────────┬──────────┬──────────┬──────────┐ │ age-last │ MNS │ FNS │ MS │ FS │ ├──────────┼──────────┼──────────┼──────────┼──────────┤ │ 30 │ 0.0011 │ 0.0010 │ 0.0021 │ 0.0019 │ │ 31 │ 0.0012 │ 0.0011 │ 0.0022 │ 0.0020 │ └──────────┴──────────┴──────────┴──────────┴──────────┘ ``` **Automatic transformation to tidy format:** ```text ┌──────────┬──────────┬───────────────┐ │ age-last │ variable │ mortality_rate│ ├──────────┼──────────┼───────────────┤ │ 30 │ MNS │ 0.0011 │ │ 30 │ FNS │ 0.0010 │ │ 30 │ MS │ 0.0021 │ │ 30 │ FS │ 0.0019 │ │ 31 │ MNS │ 0.0012 │ │ 31 │ FNS │ 0.0011 │ └──────────┴──────────┴───────────────┘ ``` ### Tables with Overflow Columns For tables with "Ultimate" or "Term" columns representing rates beyond the explicit duration range: ```python # VBT 2015 table with durations 1-25 plus "Ult." column vbt_table = gs.Table( name="vbt_2015_female_smoker", source="2015-VBT-FSM-ANB.csv", dimensions={ "issue_age": "issue_age", "duration": gs.MeltDimension( columns=[str(i) for i in range(1, 26)] + ["Ult."], name="duration", overflow=gs.ExtendOverflow("Ult.", to_value=120) ) }, value="qx" ) ``` This automatically creates lookup entries for durations 26, 27, 28, ... 120, all using the "Ultimate" rate from the original table. ## Performing Lookups ### Single-Key Lookups ```python # Simple lapse rate lookup af = af.with_columns( lapse_table.lookup({"policy_duration": af["policy_duration"]}) ) ``` ### Multi-Key Lookups ```python # Mortality lookup with age and gender/smoking status af = af.with_columns( mortality_table.lookup({ "age_last": af["age_last"], "variable": af["gender_smoking"] }) ) ``` ### Vector Lookups The most powerful feature - handle entire projection vectors: ```python # Project 480 months for each policy af = af.with_columns( monthly_ages=af["issue_age"] + (af["projection_months"] / 12), monthly_durations=af["policy_duration"] + (af["projection_months"] / 12) ) # Single lookup returns 480 mortality rates per policy af = af.with_columns( mortality_table.lookup( age=af["monthly_ages"], duration=af["monthly_durations"] ) ) ``` ## Complete Model Example Here's how assumption tables integrate into a complete actuarial model: ```python import gaspatchio_core as gs import polars as pl from gaspatchio_core import ActuarialFrame def setup_assumptions(): """Load all assumption tables for the model""" # Load mortality table (wide format with overflow) mortality_df = pl.read_parquet("assumptions/mortality.parquet") mortality_table = gs.Table( name="mortality_rates", source=mortality_df, dimensions={ "age-last": "age-last", "variable": gs.MeltDimension( columns=["MNS", "FNS", "MS", "FS"], name="variable" ) }, value="mortality_rate" ) # Load lapse curve (simple 1D table) lapse_df = pl.read_parquet("assumptions/lapse.parquet") lapse_table = gs.Table( name="lapse_rates", source=lapse_df, dimensions={ "policy_duration": "policy_duration" }, value="lapse_rate" ) # Load premium rates (wide format) premium_df = pl.read_parquet("assumptions/premium_rates.parquet") premium_table = gs.Table( name="premium_rates", source=premium_df, dimensions={ "age-last": "age-last", "variable": gs.MeltDimension( columns=["MNS", "FNS", "MS", "FS"], name="variable" ) }, value="premium_rate" ) return mortality_table, lapse_table, premium_table def life_model(policies_df): """Complete life insurance projection model""" # Setup assumption tables mortality_table, lapse_table, premium_table = setup_assumptions() # Create ActuarialFrame af = ActuarialFrame(policies_df) # Setup projection vectors (480 months per policy) max_age = 101 af["num_proj_months"] = (max_age - af["age"]) * 12 af["proj_months"] = af.fill_series(af["num_proj_months"], 0, 1) # Calculate age and duration vectors af["age_last"] = (af["age"] + (af["proj_months"] / 12)).floor() af["policy_duration"] = (af["policy_duration"] + (af["proj_months"] / 12)).floor() # Create gender/smoking variable for lookups af["variable"] = af["gender"] + af["smoking_status"] # Vector lookups - get rates for all 480 months at once af["mortality_rate"] = mortality_table.lookup({ "age_last": af["age_last"], "variable": af["variable"] }) af["lapse_rate"] = lapse_table.lookup({ "policy_duration": af["policy_duration"] }) af["premium_rate"] = premium_table.lookup({ "age_last": af["age_last"], "variable": af["variable"] }) # Calculate probabilities and cash flows af["monthly_persist_prob"] = (1 - af["mortality_rate"] / 12) * (1 - af["lapse_rate"] / 12) # Probability in force (cumulative product with shift) af["prob_in_force"] = af["monthly_persist_prob"].list.eval( pl.element().cum_prod().shift(1).fill_null(1.0) ) # Cash flows af["premium_cf"] = af["premium_rate"] / 12 * af["prob_in_force"] * af["sum_assured"] / 1000 af["claims_cf"] = af["prob_in_force"] * af["mortality_rate"] / 12 * af["sum_assured"] af["profit_cf"] = af["premium_cf"] - af["claims_cf"] return af # Run the model policies = pl.read_csv("model_points.csv") results = life_model(policies) ``` ## Using the TableBuilder Pattern For complex table configurations, use the `TableBuilder` pattern: ```python # Build a complex table step by step table = ( gs.TableBuilder("complex_mortality") .from_source("mortality_data.csv") .with_data_dimension("issue_age", "issue_age") .with_data_dimension("policy_year", "policy_year") .with_computed_dimension( "attained_age", pl.col("issue_age") + pl.col("policy_year") - 1, "attained_age" ) .with_melt_dimension( "duration", columns=[str(i) for i in range(1, 26)] + ["Ultimate"], overflow=gs.ExtendOverflow("Ultimate", to_value=100) ) .with_value_column("mortality_rate") .build() ) ``` ## Performance Benefits The assumption system provides significant performance improvements: ### 1. **Pre-Computed Expansion** Overflow columns are expanded once at load time, not during every lookup: ```python # Table with durations 1-25 + "Ult." gets expanded to 1-120 immediately table = gs.Table( name="mortality", source=df, dimensions={ "age": "age", "duration": gs.MeltDimension( columns=duration_cols, overflow=gs.ExtendOverflow("Ult.", to_value=120) ) }, value="rate" ) # All lookups are now O(1) hash operations - no overflow logic needed af = af.with_columns( table.lookup({"age": af["age"], "duration": 100}) # duration=100 works instantly ) ``` ### 2. **Vector-Native Operations** No exploding, joining, or reaggregating required: ```python # Traditional approach: explode 1M policies × 480 months = 480M rows # Gaspatchio: 1M policies with 480-element vectors = 1M rows # Single operation handles entire projection af = af.with_columns( mortality_table.lookup({"age": af["age_vector"]}) ) ``` ### 3. **Optimized Hash-Based Lookups** Built on Rust HashMaps for maximum performance: - O(1) lookup time regardless of table size - Efficient memory usage with pre-indexed structures - Integrates with Polars' lazy evaluation for optimal query planning ## API Reference ### `Table` Class ```python gs.Table( name: str, # Table name for lookups source: str | pl.DataFrame, # File path or DataFrame dimensions: dict[str, str | Dimension], # Dimension configuration value: str = "rate", # Name for value column metadata: dict | None = None, # Optional metadata storage validate: bool = True # Enable validation ) -> Table ``` **Parameters:** - **`name`**: Unique identifier for the table in the lookup registry - **`source`**: Either a file path (.csv/.parquet) or a Polars DataFrame - **`dimensions`**: Dictionary mapping dimension names to columns or Dimension objects - **`value`**: Name of the value column in the final tidy table - **`metadata`**: Optional dictionary stored with the table - **`validate`**: Whether to validate dimension configuration ### `table.lookup()` ```python table.lookup( **dimensions: str | pl.Expr # Dimension names mapped to columns/expressions ) -> pl.Expr ``` Returns a Polars expression that performs the lookup. Use within `.with_columns()` or similar Polars operations. ### Dimension Types - **`DataDimension`**: Maps a column directly to a dimension - **`MeltDimension`**: Transforms wide columns into long format - **`CategoricalDimension`**: Adds a constant categorical value - **`ComputedDimension`**: Creates a dimension from an expression ### Strategy Types - **`ExtendOverflow`**: Extends a specific column value to higher indices - **`AutoDetectOverflow`**: Automatically detects overflow columns - **`LinearInterpolate`**: Fills gaps with linear interpolation - **`FillConstant`**: Fills gaps with a constant value - **`FillForward`**: Forward fills missing values # Assumption Table Examples in Gaspatchio ## Working with Mortality Tables This guide walks through using the 2015 VBT Female Smoker Mortality Table (ANB) as an example to demonstrate how to set up and use assumption tables in Gaspatchio. ### Understanding the Table Structure The 2015 VBT table is structured as follows: - Rows represent issue ages (18-95) - Columns represent policy durations (1-25 plus "Ultimate") - Values represent mortality rates per 1,000 Here's a small sample from the table: | Issue Age | Duration 1 | Duration 2 | Duration 3 | Duration 4 | Duration 5 | Ultimate | Attained Age | | --- | --- | --- | --- | --- | --- | --- | --- | | 30 | 0.20 | 0.25 | 0.31 | 0.38 | 0.45 | 4.84 | 55 | | 31 | 0.21 | 0.26 | 0.34 | 0.42 | 0.51 | 5.35 | 56 | | 32 | 0.22 | 0.28 | 0.37 | 0.47 | 0.58 | 5.93 | 57 | | 33 | 0.23 | 0.31 | 0.42 | 0.53 | 0.65 | 6.59 | 58 | | 34 | 0.25 | 0.35 | 0.48 | 0.61 | 0.73 | 7.31 | 59 | ### Loading the Assumption Table Loading assumption tables is straightforward with the dimension-based API. Gaspatchio provides tools to analyze table structure and configure dimensions: ```python import gaspatchio_core as gs # First, analyze the table structure (optional but helpful) df = pl.read_csv("2015-VBT-FSM-ANB.csv") schema = gs.analyze_table(df) print(schema.suggest_table_config()) # Load the mortality table with dimension configuration vbt_table = gs.Table( name="vbt_2015_female_smoker", source="2015-VBT-FSM-ANB.csv", dimensions={ "Issue Age": "Issue Age", # Simple data dimension "duration": gs.MeltDimension( columns=[str(i) for i in range(1, 26)] + ["Ultimate"], name="duration", overflow=gs.ExtendOverflow("Ultimate", to_value=200) ) }, value="mortality_rate" ) ``` The API explicitly configures: - Data dimensions (like Issue Age) that map directly from columns - Melt dimensions that transform wide columns (1-25, Ultimate) into long format - Overflow strategies that expand "Ultimate" values to higher durations - The value column name for the melted rates After loading, the internal data looks like this: | Issue Age | duration | mortality_rate | | --- | --- | --- | | 30 | 1 | 0.20 | | 30 | 2 | 0.25 | | 30 | 3 | 0.31 | | 30 | 4 | 0.38 | | 30 | 5 | 0.45 | | 30 | 26 | 4.84 | | 30 | 27 | 4.84 | | 30 | 150 | 4.84 | | ... | ... | ... | ### Using the Assumption Table in ActuarialFrame Now we can use this table for lightning-fast lookups: ```python # Create a simple policy dataset policy_data = pl.DataFrame({ "policy_id": ["A001", "A002", "A003", "A004"], "issue_age": [30, 35, 40, 45], "duration": [1, 3, 5, 10] }) # Convert to ActuarialFrame af = gs.ActuarialFrame(policy_data) # Look up mortality rates using the table's lookup method af = af.with_columns( vbt_table.lookup({ "Issue Age": af["issue_age"], "duration": af["duration"] }).alias("mortality_rate") ) print(af) ``` Result: ```text shape: (4, 4) ┌──────────┬───────────┬──────────┬───────────────┐ │ policy_id ┆ issue_age ┆ duration ┆ mortality_rate │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ f64 │ ╞══════════╪═══════════╪══════════╪═══════════════╡ │ A001 ┆ 30 ┆ 1 ┆ 0.20 │ │ A002 ┆ 35 ┆ 3 ┆ 0.54 │ │ A003 ┆ 40 ┆ 5 ┆ 1.15 │ │ A004 ┆ 45 ┆ 10 ┆ 4.10 │ └──────────┴───────────┴──────────┴───────────────┘ ``` ### Working with Overflow Durations The beauty of the API is that overflow handling is completely transparent. Even extreme durations work instantly: ```python # Test with durations beyond the table (> 25) extreme_data = pl.DataFrame({ "policy_id": ["X001", "X002"], "issue_age": [30, 40], "duration": [50, 100] # Way beyond table max of 25! }) af_extreme = gs.ActuarialFrame(extreme_data) af_extreme = af_extreme.with_columns( vbt_table.lookup({ "Issue Age": af_extreme["issue_age"], "duration": af_extreme["duration"] }).alias("mortality_rate") ) print(af_extreme) ``` Result: ```text shape: (2, 4) ┌──────────┬───────────┬──────────┬────────────────┐ │ policy_id ┆ issue_age ┆ duration ┆ mortality_rate │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ f64 │ ╞══════════╪═══════════╪══════════╪════════════════╡ │ X001 ┆ 30 ┆ 50 ┆ 4.84 │ │ X002 ┆ 40 ┆ 100 ┆ 9.32 │ └──────────┴───────────┴──────────┴────────────────┘ ``` Both policies get the "Ultimate" rate because the `ExtendOverflow` strategy pre-expanded the overflow during loading. ### Projecting Multiple Periods Gaspatchio's vector-based approach works seamlessly with the API: ```python # Create a policy with projection over multiple durations policy_projection = pl.DataFrame({ "policy_id": ["B001"], "issue_age": [30], "duration": [[1, 2, 3, 4, 5, 25, 26, 50, 100]] # Mix of regular and overflow }) af_proj = gs.ActuarialFrame(policy_projection) # Look up mortality rates for all durations at once af_proj = af_proj.with_columns( vbt_table.lookup({ "Issue Age": af_proj["issue_age"], "duration": af_proj["duration"] }).alias("mortality_rate") ) # Explode for visualization result = af_proj.explode(["duration", "mortality_rate"]) print(result) ``` Result: ```text shape: (9, 4) ┌──────────┬───────────┬──────────┬───────────────┐ │ policy_id ┆ issue_age ┆ duration ┆ mortality_rate │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ f64 │ ╞══════════╪═══════════╪══════════╪═══════════════╡ │ B001 ┆ 30 ┆ 1 ┆ 0.20 │ │ B001 ┆ 30 ┆ 2 ┆ 0.25 │ │ B001 ┆ 30 ┆ 3 ┆ 0.31 │ │ B001 ┆ 30 ┆ 4 ┆ 0.38 │ │ B001 ┆ 30 ┆ 5 ┆ 0.45 │ │ B001 ┆ 30 ┆ 25 ┆ 4.12 │ │ B001 ┆ 30 ┆ 26 ┆ 4.84 │ │ B001 ┆ 30 ┆ 50 ┆ 4.84 │ │ B001 ┆ 30 ┆ 100 ┆ 4.84 │ └──────────┴───────────┴──────────┴───────────────┘ ``` ### Loading Simple Curves For 1-dimensional tables (like lapse rates by age), the API is even simpler: ```python # Load a simple age → lapse rate curve lapse_table = gs.Table( name="lapse_2025", source="lapse_curve.csv", dimensions={ "age": "age" # Simple string shorthand }, value="lapse_rate" ) # Use it immediately af = af.with_columns( lapse_table.lookup({"age": af["age"]}).alias("lapse_rate") ) ``` ### Advanced Features For more complex scenarios, you have full control with the dimension-based API: ```python # Multi-dimensional table with selective column loading mortality_table = gs.Table( name="mortality_by_gender", source="mortality_m_f.csv", dimensions={ "age": "age", "gender": gs.MeltDimension( columns=["Male", "Female"], name="gender" ) }, value="mortality_rate" ) # Table with custom overflow limits salary_table = gs.Table( name="salary_scale", source="salary_by_service.csv", dimensions={ "grade": "grade", "service": gs.MeltDimension( columns=[str(i) for i in range(1, 21)] + ["20+"], name="service", overflow=gs.ExtendOverflow("20+", to_value=50) ) }, value="scale_factor" ) # Using computed dimensions complex_table = gs.Table( name="complex_assumptions", source=df, dimensions={ "issue_age": "issue_age", "policy_year": "policy_year", "attained_age": gs.ComputedDimension( pl.col("issue_age") + pl.col("policy_year") - 1, "attained_age" ) }, value="assumption_value" ) ``` ### Using the TableBuilder Pattern For step-by-step table construction, use the fluent `TableBuilder` API: ```python # Build a complex mortality table mortality_table = ( gs.TableBuilder("mortality_select_ultimate") .from_source("mortality_su.csv") .with_data_dimension("issue_age", "IssueAge") .with_data_dimension("gender", "Gender") .with_melt_dimension( "duration", columns=[f"Dur{i}" for i in range(1, 16)] + ["Ultimate"], overflow=gs.ExtendOverflow("Ultimate", to_value=100), fill=gs.LinearInterpolate() # Interpolate any gaps ) .with_value_column("qx_rate") .build() ) # The table is ready for lookups af = af.with_columns( mortality_table.lookup({ "issue_age": af["age"], "gender": af["sex"], "duration": af["policy_duration"] }).alias("mortality_rate") ) ``` ### Metadata and Table Discovery Tables can include metadata for documentation and discovery: ```python # Create table with rich metadata vbt_table = gs.Table( name="vbt_2015_complete", source="vbt_2015_all.csv", dimensions={ "age": "Age", "gender": "Gender", "smoking": "Smoker", "duration": gs.MeltDimension( columns=duration_columns, name="duration", overflow=gs.ExtendOverflow("Ultimate", to_value=120) ) }, value="mortality_rate", metadata={ "source": "2015 Valuation Basic Table", "basis": "ANB", "version": "2015", "effective_date": "2015-01-01", "description": "Industry standard mortality table", "tags": ["mortality", "vbt", "2015", "standard"] } ) # Discover tables all_tables = gs.list_tables() print(f"Available tables: {all_tables}") # Get metadata for a specific table metadata = gs.get_table_metadata("vbt_2015_complete") print(f"Table metadata: {metadata}") # List all tables with metadata tables_info = gs.list_tables_with_metadata() for name, meta in tables_info.items(): print(f"{name}: {meta.get('description', 'No description')}") ``` # Core Concepts in Gaspatchio ## Introduction Gaspatchio is a Python library designed specifically for actuarial modeling. It provides a domain-specific language (DSL) that makes it easier to express complex actuarial calculations while maintaining performance and readability. If you're a modeling actuary with Python experience, this library builds on concepts you might already know from pandas, but with specific optimizations and features for actuarial work. ## ActuarialFrame: The Foundation At the heart of Gaspatchio is the `ActuarialFrame`, a powerful alternative to pandas DataFrames. While pandas is excellent for general data manipulation, actuarial models often require: - Handling of projection periods across many time steps - Complex calculation dependencies - Performance optimization for large datasets - Vectorized operations on grouped data `ActuarialFrame` addresses these needs by wrapping [Polars](https://pola.rs), a lightning-fast DataFrame library, and adding actuarial-specific functionality. ```python from gaspatchio_core.dsl.core import ActuarialFrame # Create an ActuarialFrame from existing data af = ActuarialFrame(your_data) # Set calculation columns using natural Python syntax af["age-last"] = af.floor(af["age"]).cast(pl.Int64) af["premium_rate"] = af["base_rate"] * af["age_factor"] ``` ## Key Differences from pandas If you're familiar with pandas, here are the main differences: 1. **Lazy Evaluation**: Operations are captured and optimized before execution, rather than being executed immediately 1. **Expression Tracking**: The library tracks how columns are derived, enabling model auditing and optimization 1. **Actuarial Functions**: Built-in functions for common actuarial operations 1. **Performance Modes**: Debug and optimize modes to balance development speed with production performance ## Modeling Approach Gaspatchio encourages a functional, pipeline-based approach to model building: ```python def life_model(af): # Chain operations using pipe for cleaner code af = (setup_ages(af) .pipe(mortality_rate) .pipe(lapse_rate) .pipe(premium_rate)) # Define cashflows af["premium_cashflow"] = af["premium_rate"] * af["P[IF]"] * af["sum_assured"] / 1000 af["claims_cashflow"] = af["P[death]"] * af["sum_assured"] return af ``` This approach makes models more testable, maintainable and readable. - Testable: Each function can be tested independently. - Maintainable: Clear separation of concerns. - Readable: Operations flow naturally from inputs to outputs. ## Performance Optimization Gaspatchio provides two execution modes: 1. **Debug Mode**: Direct execution for easier debugging and development 1. **Optimize Mode**: Captures operations to optimize before execution, with optional Numba acceleration ```python # Set mode globally from gaspatchio_core.dsl.core import set_default_mode set_default_mode("optimize") # Or per frame af = ActuarialFrame(data, mode="optimize") ``` ## Table Lookups and Assumptions The library provides efficient ways to handle assumption tables common in actuarial work: ```python # Register a mortality table registry.register_table( name="mortality_rates", df=mortality_df, keys=["age-last", "gender_smoking"], value_column="mortality_rate" ) # Look up values in models af["mortality_rate"] = assumption_lookup( "age-last", "gender_smoking", table_name="mortality_rates" ) ``` ## Getting Started To start building your first model with Gaspatchio, you need to: 1. Define your model points (policy data) 1. Create an ActuarialFrame from your data 1. Define projection functions 1. Run your model with the `run_model` function For detailed examples and API documentation, see the subsequent sections of this guide. # Integrating Custom Python Code As an actuary using Gaspatchio, you might have existing Python functions or complex logic you want to integrate into your models. Perhaps you have a specific benefit calculation, a complex decrement logic, or a custom reserving method implemented in Python. Gaspatchio provides two primary ways to incorporate this custom logic into the `ActuarialFrame` workflow: 1. **Direct Application (`.apply`)**: For quick, one-off use cases or simple functions. 1. **Accessor Plugins**: For more complex, reusable logic that benefits from better organization and integration. ## 1. Direct Application with `.apply()` If you have a relatively simple Python function that operates on a single column's data element-wise, the quickest way to use it is via the `.apply()` method on a column proxy. Let's say you have a Python function to calculate a simple bonus amount based on the policy duration: ```python # Your existing Python function def calculate_bonus(duration: int) -> float: if duration <= 5: return 0.0 elif duration <= 10: return 50.0 else: return 100.0 + (duration - 10) * 5.0 ``` You can apply this directly within your model definition: ```python import polars as pl from gaspatchio_core.dsl.core import ActuarialFrame # Assume 'af' is your ActuarialFrame with a 'policy_duration' column # af = ActuarialFrame(...) # Apply the custom Python function # Note: We provide a return_dtype for better performance and type stability af["bonus_amount"] = af["policy_duration"].apply( calculate_bonus, return_dtype=pl.Float64 ) result = af.collect() print(result) ``` **Pros:** - **Simple:** Very straightforward for existing functions. - **Quick:** No extra setup required for one-off calculations. **Cons:** - **Performance:** Python function execution can be slower than native Polars/Gaspatchio operations, especially for large datasets. Providing `return_dtype` helps, but it won't be as fast as a pure expression. Gaspatchio might attempt Numba optimization in "optimize" mode if Numba is installed, but this isn't guaranteed. - **Readability:** Can clutter model logic if many complex `.apply` calls are used. - **Reusability:** Less discoverable and reusable across different models compared to plugins. - **Limited Scope:** Primarily designed for element-wise operations on single columns. Use `.apply()` when you need a quick integration and the performance impact is acceptable, or when prototyping logic before potentially converting it into a more optimized expression or plugin. Performance Considerations with `.apply()` Using `.apply()` executes your Python function row by row. This involves overhead for each element (calling the Python interpreter, type checking, etc.) and prevents vectorized optimizations that operate on entire columns simultaneously. As a result, it can be **orders of magnitude slower** than equivalent logic written using native Polars/Gaspatchio expressions, especially on large datasets. You might see a `PerformanceWarning` when using `.apply()` similar to this: ```text PerformanceWarning: Applying a Python function 'your_function_name' using map_elements. This is potentially slow. For better performance, consider using Polars expressions directly. ``` While convenient for quick tests or simple logic, relying heavily on `.apply()` for core calculations will significantly impact your model's performance. It's strongly recommended to rewrite the logic using native Polars expressions or within an accessor plugin for production use, as shown in the next section. ## 2. Accessor Plugins (Recommended for Reusability) If your custom logic is more complex, will be reused across different models, or involves multiple related calculations, creating an **accessor plugin** is the recommended approach. Accessor plugins extend `ActuarialFrame` (or its column/expression proxies) with custom namespaces. Think of the built-in `.dt` (for dates) or `.str` (for strings) namespaces in Polars – plugins let you create your own, like `.mortality` or `.reserving`. ### Why Create a Plugin? - **Organization:** Group related custom calculations under a single namespace (e.g., `af["premium"].finance.present_value(...)`). - **Reusability:** Define logic once and use it across multiple models or share it with colleagues. - **Readability:** Keeps model definitions cleaner by encapsulating complex logic within accessor methods. - **Discoverability:** Makes custom functions easily discoverable via standard attribute access (and `dir()`). - **Potential for Optimization:** Accessor methods can be written to leverage efficient Polars expressions internally. ### Creating a Simple Column Accessor Let's adapt our `calculate_bonus` function into a reusable column accessor plugin. **Step 1: Define the Accessor Class** Create a Python file (e.g., `my_company_accessors.py`) and define your class: ```python # my_company_accessors.py import polars as pl from gaspatchio_core.dsl.core import ActuarialFrame, ColumnProxy, ExpressionProxy from gaspatchio_core.dsl.plugins import register_accessor class BaseAccessor: """Optional base class for convenience.""" def __init__(self, obj): # obj will be the ColumnProxy or ExpressionProxy instance self._obj = obj @register_accessor("bonus", kind="column") # Register as .bonus for columns/expressions class BonusAccessor(BaseAccessor): def amount(self) -> ExpressionProxy: """Calculates the bonus amount based on the proxied duration column.""" # We use Polars expressions *inside* the accessor for performance duration_expr = self._obj # self._obj is the duration column/expression proxy bonus_expr = ( pl.when(duration_expr <= 5).then(0.0) .when(duration_expr <= 10).then(50.0) .otherwise(100.0 + (duration_expr - 10) * 5.0) .cast(pl.Float64) # Ensure consistent output type ) # Important: Return an ExpressionProxy # We assume self._obj has a ._parent attribute (true for Column/ExpressionProxy) return ExpressionProxy(bonus_expr, self._obj._parent) def is_eligible(self, threshold: int = 5) -> ExpressionProxy: """Checks if bonus is eligible based on duration.""" duration_expr = self._obj eligibility_expr = duration_expr > threshold return ExpressionProxy(eligibility_expr, self._obj._parent) # IMPORTANT: Ensure this module (my_company_accessors.py) is imported somewhere # in your application *after* gaspatchio_core.dsl.core is defined. # e.g., in __init__.py or main.py: # import my_company_accessors ``` **Key Points:** - `@register_accessor("bonus", kind="column")`: This decorator registers the `BonusAccessor` class. It will be available as `.bonus` on `ColumnProxy` and `ExpressionProxy` instances. - `__init__(self, obj)`: Stores the proxy object (`ColumnProxy` or `ExpressionProxy`) the accessor is attached to. - `amount(self)`: Implements the bonus logic using efficient Polars `when/then/otherwise` expressions instead of a Python function. It returns a new `ExpressionProxy`. - Returning `ExpressionProxy`: Accessor methods that perform calculations should generally return `ExpressionProxy` objects to keep the operations within the Gaspatchio/Polars expression system for optimal performance and lazy evaluation. **Step 2: Import Your Accessor Module** Somewhere in your project (e.g., your main script or a relevant `__init__.py`), make sure to import the module containing your accessor definition. This triggers the registration decorator. ```python # main_model.py import polars as pl from gaspatchio_core.dsl.core import ActuarialFrame import my_company_accessors # <--- Import to register .bonus accessor af = ActuarialFrame({ "policy_duration": [3, 7, 12] }) # Use the accessor! af["bonus_amount"] = af["policy_duration"].bonus.amount() af["is_bonus_eligible"] = af["policy_duration"].bonus.is_eligible() # You can chain accessors with other operations af["eligible_bonus"] = af["is_bonus_eligible"] * af["bonus_amount"] print(af.collect()) ``` ### Frame Accessors and Entry Points You can also create `frame` accessors (`kind="frame"`) that attach to the `ActuarialFrame` itself, useful for portfolio-level calculations. Furthermore, if you are developing a package of reusable actuarial components, you can use **entry points** to make your accessors automatically discoverable when someone installs your package, without requiring them to explicitly import your accessor module. These are more advanced topics covered in the technical reference documentation. For most users integrating their own project-specific code, the `@register_accessor` decorator provides the best balance of organization and ease of use. Choose the method that best suits the complexity and reusability needs of your custom Python code. For simple, infrequent use, `.apply()` is sufficient. For structured, reusable, and potentially performance-critical logic, invest the time to create an accessor plugin. - Why Polars? - Shimming polars - Column wise operations # API ## `gaspatchio_core.frame.base.ActuarialFrame` A lazy, chainable, and traceable DataFrame for actuarial modeling. The ActuarialFrame provides a high-level API for common actuarial calculations and data manipulations, leveraging Polars LazyFrames for performance. It supports tracing of operations for optimization and introspection, and provides convenient accessors for specialized functionality (e.g., date, finance, excel operations). Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `data` | `dict | DataFrame | LazyFrame | None` | Initial data to populate the frame. Can be a Python dictionary, a Polars DataFrame, or a Polars LazyFrame. If None, an empty frame is initialized. Defaults to None. | `None` | | `mode` | `str | None` | The operational mode: "run", "optimize", or "debug". - "run": Executes operations eagerly. - "optimize": Defers execution and builds a computation graph. - "debug": Provides more verbose output. Defaults to the global default mode (get_default_mode). | `None` | | `verbose` | `bool | None` | Enables or disables verbose logging. Defaults to the global default verbosity (get_default_verbose). | `None` | | `threads` | `int | None` | Number of threads for parallel operations. Defaults to a system-dependent value or \_DEFAULT_THREADS. | `None` | Attributes: | Name | Type | Description | | --- | --- | --- | | `date` | `DateFrameAccessor` | Accessor for date-related operations. | | `excel` | `ExcelFrameAccessor` | Accessor for Excel-like operations. | | `finance` | `FinanceFrameAccessor` | Accessor for financial calculations. | | `columns` | `list[str]` | A list of column names in their current order. | Examples: **Initialization and Basic Operations** ```pycon >>> from gaspatchio_core import ActuarialFrame >>> data = { ... "policy_id": [1, 1, 2, 2, 3], ... "inception_date": ["2020-01-01", "2020-01-01", "2021-05-10", "2021-05-10", "2022-02-20"], ... "premium": [100, 150, 200, 50, 300], ... "claims": [0, 50, 10, 0, 120] ... } >>> af = ActuarialFrame(data) >>> af["loss_ratio"] = af["claims"] / af["premium"] >>> result = af.collect() >>> print(result.head(3)) shape: (3, 5) ┌───────────┬────────────────┬─────────┬────────┬────────────┐ │ policy_id ┆ inception_date ┆ premium ┆ claims ┆ loss_ratio │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 ┆ i64 ┆ f64 │ ╞═══════════╪════════════════╪═════════╪════════╪════════════╡ │ 1 ┆ 2020-01-01 ┆ 100 ┆ 0 ┆ 0.0 │ │ 1 ┆ 2020-01-01 ┆ 150 ┆ 50 ┆ 0.333333 │ │ 2 ┆ 2021-05-10 ┆ 200 ┆ 10 ┆ 0.05 │ └───────────┴────────────────┴─────────┴────────┴────────────┘ ``` **Using `sum` over a group** ```pycon >>> af = ActuarialFrame(data) >>> af["total_premium_per_policy"] = af["premium"].sum().over("policy_id") >>> result_with_sum = af.collect() >>> print(result_with_sum) shape: (5, 5) ┌───────────┬────────────────┬─────────┬────────┬──────────────────────────┐ │ policy_id ┆ inception_date ┆ premium ┆ claims ┆ total_premium_per_policy │ │ --- ┆ --- ┆ --- ┆ --- ┆ --- │ │ i64 ┆ str ┆ i64 ┆ i64 ┆ i64 │ ╞═══════════╪════════════════╪═════════╪════════╪══════════════════════════╡ │ 1 ┆ 2020-01-01 ┆ 100 ┆ 0 ┆ 250 │ │ 1 ┆ 2020-01-01 ┆ 150 ┆ 50 ┆ 250 │ │ 2 ┆ 2021-05-10 ┆ 200 ┆ 10 ┆ 250 │ │ 2 ┆ 2021-05-10 ┆ 50 ┆ 0 ┆ 250 │ │ 3 ┆ 2022-02-20 ┆ 300 ┆ 120 ┆ 300 │ └───────────┴────────────────┴─────────┴────────┴──────────────────────────┘ ``` **Using an accessor (e.g., date accessor)** Assume 'inception_date' needs to be parsed to a date type first. For simplicity, let's imagine it's already a date type for this example. (Actual parsing would use `af["inception_date"].str.to_date("%Y-%m-%d")` or similar) ```pycon >>> # If 'inception_date' was a date type: >>> # af["inception_year"] = af.date.year("inception_date") >>> # af_with_year = af.collect() >>> # print(af_with_year.select(["policy_id", "inception_year"])) ``` ### `columns` Return the names of the columns in the current order. ### `date` Access date-related frame operations. ### `excel` Access excel-related frame operations. ### `finance` Access finance-related frame operations. ### `__dir__()` Enhance dir() output to include standard methods, df methods, and accessors. ### `__getattr__(name)` Dynamically instantiate and return registered frame accessors. ### `__getitem__(key)` Allow df['column'] access, returning a ColumnProxy. ### `__repr__()` Return a string representation of the ActuarialFrame. ### `__setitem__(key, value)` Handle column assignment using df['column'] = value. ### `collect()` Execute and materialize the dataframe. ### `count()` Count non-null values in each column. Returns a single-row frame containing the count of non-null values for each column. Essential for data quality assessment, completeness checks, and exposure calculations in actuarial analysis. When to use - **Data Quality:** Assess completeness of critical fields like policy ID, sum assured, or premium to identify missing data issues. - **Exposure Calculation:** Count policies, lives, or claims for exposure-based calculations in pricing and reserving. - **Cohort Analysis:** Determine size of different risk groups, age bands, or product segments for credibility assessment. - **Validation:** Verify record counts match expected values after data processing, joins, or filtering operations. ##### Returns pl.DataFrame A frame with one row containing non-null counts for each column. ##### Examples **Scalar Example: Data Completeness Check** ```python from gaspatchio_core import ActuarialFrame data = { "policy_id": ["P001", "P002", "P003", "P004", None], "age": [25, 45, None, 35, 52], "sum_assured": [100000, 500000, 250000, None, 300000], "status": ["Active", "Active", "Lapsed", "Active", "Active"], } af = ActuarialFrame(data) counts = af.count() print(counts) print("Complete policies:", counts["policy_id"]) print("Complete ages:", counts["age"]) print("Data completeness %:", counts["age"] / 5 * 100) ``` ```text shape: (1, 4) ┌───────────┬─────┬─────────────┬────────┐ │ policy_id ┆ age ┆ sum_assured ┆ status │ │ --- ┆ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 ┆ u32 │ ╞═══════════╪═════╪═════════════╪════════╡ │ 4 ┆ 4 ┆ 4 ┆ 5 │ └───────────┴─────┴─────────────┴────────┘ Complete policies: 4 Complete ages: 4 Data completeness %: 80.0 ``` **Vector Example: Monthly Activity Counts** ```python from gaspatchio_core import ActuarialFrame data = { "month": ["Jan", "Feb"], "daily_claims": [ [5, 3, 0, 4, None, 2, 1, 0, 3, None, 4, 2, 0, 1, 5], [2, None, 3, 1, 0, 4, None, 2, 0, 3, 1, None, 4, 2, 0] ], "daily_lapses": [ [1, 0, 0, 2, 1, 0, 0, 1, 0, 0, 1, 0, 2, 0, 1], [0, 1, 0, 0, 2, 0, 1, 0, 1, 0, 0, 1, 0, 2, 0] ] } af = ActuarialFrame(data) # Count valid daily observations counts = af.count() print(counts) ``` ```text shape: (1, 3) ┌───────┬──────────────┬──────────────┐ │ month ┆ daily_claims ┆ daily_lapses │ │ --- ┆ --- ┆ --- │ │ u32 ┆ u32 ┆ u32 │ ╞═══════╪══════════════╪══════════════╡ │ 2 ┆ 2 ┆ 2 │ └───────┴──────────────┴──────────────┘ ``` ### `fill_series(column, start=0, increment=1)` Apply fill_series using the core function. ### `get_column_order()` Return the tracked order of columns. ### `max()` Calculate maximum values across all numeric columns. Returns a single-row frame containing the maximum value for each column. Essential for identifying outliers, validating data ranges, and determining upper bounds in actuarial calculations. When to use - **Data Validation:** Identify outliers in premium amounts, sum assured, or claim values that may require investigation. - **Experience Analysis:** Find maximum claim amounts, policy sizes, or ages in a portfolio for risk assessment. - **Regulatory Reporting:** Determine maximum exposure amounts for solvency calculations and stress testing. - **Pricing Boundaries:** Identify upper limits for age bands, benefit amounts, or policy terms in product design. ##### Returns pl.DataFrame A frame with one row containing maximum values for each column. ##### Examples **Scalar Example: Portfolio Maximum Values** ```python from gaspatchio_core import ActuarialFrame data = { "policy_id": ["P001", "P002", "P003", "P004"], "age": [25, 45, 67, 35], "sum_assured": [100000, 500000, 250000, 1000000], "annual_premium": [1200, 6000, 8500, 15000], } af = ActuarialFrame(data) max_values = af.max() print(max_values) print("Max age:", max_values["age"][0]) print("Max sum assured:", max_values["sum_assured"][0]) ``` ```text shape: (1, 4) ┌───────────┬─────┬─────────────┬────────────────┐ │ policy_id ┆ age ┆ sum_assured ┆ annual_premium │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞═══════════╪═════╪═════════════╪════════════════╡ │ P004 ┆ 67 ┆ 1000000 ┆ 15000 │ └───────────┴─────┴─────────────┴────────────────┘ Max age: 67 Max sum assured: 1000000 ``` **Vector Example: Maximum Monthly Claims** ```python from gaspatchio_core import ActuarialFrame data = { "policy_id": ["P001", "P002"], "policy_year": [1, 2], "monthly_claims": [ [0, 500, 0, 1200, 0, 0, 800, 0, 0, 0, 0, 2500], [0, 0, 3000, 0, 0, 1500, 0, 0, 0, 4000, 0, 0] ], "monthly_premiums": [ [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000], [1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500, 1500] ] } af = ActuarialFrame(data) # Get maximum values to understand worst-case scenarios max_values = af.max() print(max_values) print("Max policy year:", max_values["policy_year"][0]) ``` ```text shape: (1, 4) ┌───────────┬─────────────┬─────────────────────────────────────┬─────────────────────────────────────┐ │ policy_id ┆ policy_year ┆ monthly_claims ┆ monthly_premiums │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ list[i64] ┆ list[i64] │ ╞═══════════╪═════════════╪═════════════════════════════════════╪═════════════════════════════════════╡ │ P002 ┆ 2 ┆ [0, 500, 3000, 1200, … 4000, 0, 0] ┆ [1500, 1500, 1500, 1500, … 1500] │ └───────────┴─────────────┴─────────────────────────────────────┴─────────────────────────────────────┘ Max policy year: 2 ``` ### `mean()` Calculate mean values across all numeric columns. Returns a single-row frame containing the mean value for each numeric column. Essential for portfolio analysis, experience studies, and establishing benchmarks in actuarial calculations. When to use - **Experience Analysis:** Calculate average claim amounts, policy sizes, or premium levels for portfolio segmentation and pricing. - **Trend Analysis:** Determine average lapse rates, mortality rates, or expense ratios over observation periods. - **Benchmarking:** Establish portfolio averages for age, sum assured, or duration to compare against industry standards. - **Reserve Calculations:** Compute average policy values, benefit amounts, or reserve factors for grouped calculations. ##### Returns pl.DataFrame A frame with one row containing mean values for numeric columns. ##### Examples **Scalar Example: Portfolio Averages** ```python from gaspatchio_core import ActuarialFrame data = { "policy_id": ["P001", "P002", "P003", "P004"], "age": [25, 45, 67, 35], "sum_assured": [100000, 500000, 250000, 1000000], "annual_premium": [1200, 6000, 8500, 15000], } af = ActuarialFrame(data) mean_values = af.mean() print(mean_values) print("Average age:", mean_values["age"]) print("Average sum assured:", mean_values["sum_assured"]) ``` ```text shape: (1, 3) ┌──────┬──────────────┬─────────────────┐ │ age ┆ sum_assured ┆ annual_premium │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════╪══════════════╪═════════════════╡ │ 43.0 ┆ 462500.0 ┆ 7425.0 │ └──────┴──────────────┴─────────────────┘ Average age: 43.0 Average sum assured: 462500.0 ``` **Vector Example: Average Monthly Experience** ```python from gaspatchio_core import ActuarialFrame data = { "policy_id": ["P001", "P002"], "policy_year": [1, 2], "monthly_claims": [ [0, 500, 0, 1200, 0, 0, 800, 0, 0, 0, 0, 2500], [0, 0, 3000, 0, 0, 1500, 0, 0, 0, 4000, 0, 0] ], "monthly_lapses": [ [2, 1, 3, 0, 1, 2, 1, 0, 1, 0, 2, 1], [1, 0, 2, 1, 0, 1, 0, 1, 0, 2, 1, 0] ] } af = ActuarialFrame(data) # Get average monthly experience mean_values = af.mean() print(mean_values) ``` ```text shape: (1, 4) ┌─────────────┬───────────────────────────────┬──────────────────────────────┐ │ policy_year ┆ monthly_claims ┆ monthly_lapses │ │ --- ┆ --- ┆ --- │ │ f64 ┆ list[f64] ┆ list[f64] │ ╞═════════════╪═══════════════════════════════╪══════════════════════════════╡ │ 1.5 ┆ [0.0, 250.0, 1500.0, … 0.0] ┆ [1.5, 0.5, 2.5, … 0.5] │ └─────────────┴───────────────────────────────┴──────────────────────────────┘ ``` ### `median()` Calculate median values across all numeric columns. Returns a single-row frame containing the median value for each numeric column. Useful for robust central tendency measures that are less affected by outliers in actuarial data. When to use - **Robust Analysis:** Use median instead of mean when data contains outliers, such as large claims or extreme ages in the portfolio. - **Income Analysis:** Analyze median policyholder income or premium levels for market segmentation and product design. - **Experience Studies:** Calculate median time to claim, policy duration, or age at lapse for more representative measures. - **Pricing Benchmarks:** Determine median rates or factors when comparing across competitors or market segments. ##### Returns pl.DataFrame A frame with one row containing median values for numeric columns. ##### Examples **Scalar Example: Median Policy Metrics** ```python from gaspatchio_core import ActuarialFrame data = { "policy_id": ["P001", "P002", "P003", "P004", "P005"], "duration_years": [1, 3, 5, 7, 15], "annual_premium": [1200, 3500, 2800, 4200, 12000], "age": [25, 35, 42, 38, 65], } af = ActuarialFrame(data) median_values = af.median() print(median_values) print("Median duration:", median_values["duration_years"]) print("Median premium:", median_values["annual_premium"]) ``` ```text shape: (1, 3) ┌────────────────┬────────────────┬──────┐ │ duration_years ┆ annual_premium ┆ age │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞════════════════╪════════════════╪══════╡ │ 5.0 ┆ 3500.0 ┆ 38.0 │ └────────────────┴────────────────┴──────┘ Median duration: 5.0 Median premium: 3500.0 ``` **Vector Example: Median Monthly Performance** ```python from gaspatchio_core import ActuarialFrame data = { "agent": ["A001", "A002"], "monthly_sales": [ [3, 5, 2, 8, 4, 6, 3, 7, 5, 4, 6, 9], [12, 15, 10, 18, 14, 16, 11, 20, 13, 17, 15, 22] ], "monthly_commission": [ [450, 750, 300, 1200, 600, 900, 450, 1050, 750, 600, 900, 1350], [1800, 2250, 1500, 2700, 2100, 2400, 1650, 3000, 1950, 2550, 2250, 3300] ] } af = ActuarialFrame(data) # Calculate median for typical performance assessment median_values = af.median() print(median_values) print("Agent A001 median sales:", median_values["monthly_sales"][0]) print("Agent A002 median sales:", median_values["monthly_sales"][1]) ``` ```text shape: (1, 3) ┌────────────┬────────────────────┬──────────────────────┐ │ agent ┆ monthly_sales ┆ monthly_commission │ │ --- ┆ --- ┆ --- │ │ str ┆ list[f64] ┆ list[f64] │ ╞════════════╪════════════════════╪══════════════════════╡ │ null ┆ [5.0, 15.0] ┆ [750.0, 2250.0] │ └────────────┴────────────────────┴──────────────────────┘ Agent A001 median sales: 5.0 Agent A002 median sales: 15.0 ``` ### `min()` Calculate minimum values across all numeric columns. Returns a single-row frame containing the minimum value for each column. Essential for identifying baseline values, detecting anomalies, and establishing lower bounds in actuarial calculations. When to use - **Data Quality Checks:** Identify potential data errors like negative ages, zero premiums, or missing values coded as extreme minimums. - **Portfolio Analysis:** Find minimum entry ages, smallest policy sizes, or lowest premium amounts for market segmentation. - **Risk Assessment:** Determine minimum coverage levels, deductibles, or retention limits in reinsurance analysis. - **Product Design:** Establish minimum benefit guarantees, surrender values, or contribution limits for new products. ##### Returns pl.DataFrame A frame with one row containing minimum values for each column. ##### Examples **Scalar Example: Portfolio Minimum Values** ```python from gaspatchio_core import ActuarialFrame data = { "policy_id": ["P001", "P002", "P003", "P004"], "age": [25, 45, 67, 35], "sum_assured": [100000, 500000, 250000, 1000000], "annual_premium": [1200, 6000, 8500, 15000], } af = ActuarialFrame(data) min_values = af.min() print(min_values) print("Min age:", min_values["age"]) print("Min sum assured:", min_values["sum_assured"]) ``` ```text shape: (1, 4) ┌───────────┬─────┬─────────────┬────────────────┐ │ policy_id ┆ age ┆ sum_assured ┆ annual_premium │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ i64 ┆ i64 │ ╞═══════════╪═════╪═════════════╪════════════════╡ │ P001 ┆ 25 ┆ 100000 ┆ 1200 │ └───────────┴─────┴─────────────┴────────────────┘ Min age: 25 Min sum assured: 100000 ``` **Vector Example: Minimum Monthly Claims** ```python from gaspatchio_core import ActuarialFrame data = { "policy_id": ["P001", "P002"], "policy_year": [1, 2], "monthly_claims": [ [0, 500, 0, 1200, 0, 0, 800, 0, 0, 0, 0, 2500], [0, 0, 3000, 0, 0, 1500, 0, 0, 0, 4000, 0, 0] ], "monthly_retention": [ [1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000, 1000], [500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500] ] } af = ActuarialFrame(data) # Get minimum values to understand retention levels min_values = af.min() print(min_values) print("Min retention level:", min_values["monthly_retention"]) ``` ```text shape: (1, 4) ┌───────────┬─────────────┬─────────────────────────────────────┬─────────────────────────────────────┐ │ policy_id ┆ policy_year ┆ monthly_claims ┆ monthly_retention │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ list[i64] ┆ list[i64] │ ╞═══════════╪═════════════╪═════════════════════════════════════╪═════════════════════════════════════╡ │ P001 ┆ 1 ┆ [0, 0, 0, 0, … 0, 0, 0] ┆ [500, 500, 500, 500, … 500] │ └───────────┴─────────────┴─────────────────────────────────────┴─────────────────────────────────────┘ Min retention level: [500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500, 500] ``` ### `pipe(func, *args, **kwargs)` Apply a function that accepts and returns an ActuarialFrame. ### `product()` Calculate the product of values in each numeric column. Returns a single-row frame containing the product of all values for each numeric column. Useful for compound calculations, probability chains, and multiplicative factors in actuarial modeling. When to use - **Compound Interest:** Calculate accumulated values using multiple period growth factors or discount factors. - **Probability Chains:** Multiply survival probabilities, persistency rates, or success rates across multiple periods. - **Factor Application:** Apply multiple adjustment factors, loading factors, or credibility factors in sequence. - **Index Calculations:** Compute cumulative index values from period-to-period change factors. ##### Returns pl.DataFrame A frame with one row containing products for numeric columns. ##### Examples **Scalar Example: Survival Probability Chain** ```python from gaspatchio_core import ActuarialFrame data = { "year": [1, 2, 3, 4, 5], "annual_survival": [0.999, 0.998, 0.997, 0.995, 0.993], "annual_persistency": [0.95, 0.92, 0.90, 0.88, 0.85], } af = ActuarialFrame(data) products = af.product() print(products) print("5-year survival probability:", round(products["annual_survival"], 6)) print("5-year persistency:", round(products["annual_persistency"], 4)) ``` ```text shape: (1, 3) ┌──────┬─────────────────┬────────────────────┐ │ year ┆ annual_survival ┆ annual_persistency │ │ --- ┆ --- ┆ --- │ │ i64 ┆ f64 ┆ f64 │ ╞══════╪═════════════════╪════════════════════╡ │ 120 ┆ 0.982089 ┆ 0.59262 │ └──────┴─────────────────┴────────────────────┘ 5-year survival probability: 0.982089 5-year persistency: 0.5926 ``` **Vector Example: Discount Factor Chains** ```python from gaspatchio_core import ActuarialFrame data = { "scenario": ["Base", "Stressed"], "monthly_discount": [ [0.9992, 0.9992, 0.9992, 0.9992, 0.9992, 0.9992], [0.9990, 0.9990, 0.9990, 0.9990, 0.9990, 0.9990] ], "monthly_survival": [ [0.9999, 0.9999, 0.9999, 0.9999, 0.9999, 0.9999], [0.9998, 0.9998, 0.9998, 0.9998, 0.9998, 0.9998] ] } af = ActuarialFrame(data) # Calculate cumulative factors products = af.product() print(products) ``` ```text shape: (1, 3) ┌──────────┬──────────────────┬──────────────────┐ │ scenario ┆ monthly_discount ┆ monthly_survival │ │ --- ┆ --- ┆ --- │ │ str ┆ list[f64] ┆ list[f64] │ ╞══════════╪══════════════════╪══════════════════╡ │ null ┆ [0.9952, 0.9940] ┆ [0.9994, 0.9988] │ └──────────┴──────────────────┴──────────────────┘ ``` ### `profile()` Execute and materialize the dataframe with profiling, returning (result_df, profile_info). ### `quantile(quantile, interpolation='nearest')` Calculate quantile values across all numeric columns. Returns a single-row frame containing the specified quantile for each numeric column. Essential for risk assessment, percentile-based analysis, and regulatory reporting in actuarial applications. When to use - **Risk Assessment:** Calculate VaR (Value at Risk) at different confidence levels (e.g., 95th, 99th percentile) for solvency calculations. - **Experience Analysis:** Determine percentile thresholds for large claims, high-risk ages, or outlier detection in portfolios. - **Pricing Segmentation:** Identify quantile boundaries for premium bands, risk tiers, or underwriting categories. - **Regulatory Reporting:** Calculate required percentiles for stress testing, capital requirements, or reserve adequacy testing. ##### Parameters quantile : float Quantile value between 0 and 1 (e.g., 0.5 for median, 0.95 for 95th percentile). interpolation : str, default "nearest" Interpolation method: "nearest", "higher", "lower", "midpoint", or "linear". ##### Returns pl.DataFrame A frame with one row containing quantile values for numeric columns. ##### Examples **Scalar Example: Claims Distribution Analysis** ```python from gaspatchio_core import ActuarialFrame data = { "claim_id": list(range(1, 101)), "claim_amount": [1000, 1500, 2000, 2500, 3000, 3500, 4000, 5000, 6000, 7500, 8000, 9000, 10000, 12000, 15000, 18000, 20000, 25000, 30000, 35000, 40000, 45000, 50000, 60000, 75000, 85000, 95000, 100000, 120000, 150000] + [2000] * 70, "processing_days": list(range(5, 35)) + list(range(10, 80)), } af = ActuarialFrame(data) # Calculate key percentiles p90 = af.quantile(0.90) p95 = af.quantile(0.95) p99 = af.quantile(0.99) print("90th percentile:") print(p90) print("\nClaim amount 90th percentile:", p90["claim_amount"]) print("Claim amount 95th percentile:", p95["claim_amount"]) print("Claim amount 99th percentile:", p99["claim_amount"]) ``` ```text 90th percentile: shape: (1, 3) ┌──────────┬──────────────┬─────────────────┐ │ claim_id ┆ claim_amount ┆ processing_days │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞══════════╪══════════════╪═════════════════╡ │ 90.0 ┆ 85000.0 ┆ 71.0 │ └──────────┴──────────────┴─────────────────┘ Claim amount 90th percentile: 85000.0 Claim amount 95th percentile: 100000.0 Claim amount 99th percentile: 150000.0 ``` **Vector Example: Portfolio Risk Percentiles** ```python from gaspatchio_core import ActuarialFrame data = { "product": ["Term Life", "Whole Life"], "claim_amounts": [ [10000, 15000, 20000, 25000, 30000, 35000, 40000, 50000, 75000, 100000, 150000, 200000, 250000, 300000, 500000, 750000, 1000000, 1500000, 2000000, 3000000], [50000, 75000, 100000, 125000, 150000, 175000, 200000, 250000, 300000, 400000, 500000, 600000, 750000, 900000, 1000000, 1250000, 1500000, 2000000, 2500000, 5000000] ] } af = ActuarialFrame(data) # Calculate 95th percentile for risk assessment var_95 = af.quantile(0.95) print("95% VaR by product:") print(var_95) ``` ```text 95% VaR by product: shape: (1, 2) ┌────────────┬──────────────────────────────────┐ │ product ┆ claim_amounts │ │ --- ┆ --- │ │ str ┆ list[f64] │ ╞════════════╪══════════════════════════════════╡ │ null ┆ [2000000.0, 2500000.0] │ └────────────┴──────────────────────────────────┘ ``` ### `select(*exprs, **named_exprs)` Select columns from the DataFrame. Accepts positional expressions (column names, proxies, or expressions) and keyword arguments for renamed/new expressions. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `*exprs` | `IntoExprColumn` | Columns or expressions to select. | `()` | | `**named_exprs` | `IntoExprColumn` | Expressions to select with specific output names. | `{}` | Returns: | Type | Description | | --- | --- | | `Self` | The modified ActuarialFrame. | ### `show_query_plan(enabled=True)` Enable or disable query plan logging (basic implementation). ### `std(ddof=1)` Calculate standard deviation across all numeric columns. Returns a single-row frame containing the standard deviation for each numeric column. Essential for risk assessment, volatility analysis, and confidence interval calculations in actuarial modeling. When to use - **Risk Assessment:** Measure volatility in claim amounts, premium variations, or mortality experience for pricing and reserving. - **Experience Monitoring:** Quantify variability in lapse rates, expense ratios, or benefit utilization for assumption setting. - **Confidence Intervals:** Calculate standard errors for mortality estimates, reserve factors, or pricing assumptions. - **Portfolio Analysis:** Assess homogeneity of risk groups by comparing standard deviations across segments. ##### Parameters ddof : int, default 1 Delta degrees of freedom. The divisor is N - ddof. ##### Returns pl.DataFrame A frame with one row containing standard deviations for numeric columns. ##### Examples **Scalar Example: Premium Volatility Analysis** ```python from gaspatchio_core import ActuarialFrame data = { "policy_id": ["P001", "P002", "P003", "P004", "P005"], "age_band": ["25-35", "25-35", "36-45", "36-45", "46-55"], "annual_premium": [1200, 1350, 3500, 3200, 8500], "sum_assured": [100000, 150000, 350000, 300000, 500000], } af = ActuarialFrame(data) std_values = af.std() print(std_values) print("Premium volatility:", std_values["annual_premium"]) ``` ```text shape: (1, 2) ┌──────────────────┬─────────────┐ │ annual_premium ┆ sum_assured │ │ --- ┆ --- │ │ f64 ┆ f64 │ ╞══════════════════╪═════════════╡ │ 2913.8 ┆ 158113.9 │ └──────────────────┴─────────────┘ Premium volatility: 2913.8 ``` **Vector Example: Monthly Claims Volatility** ```python from gaspatchio_core import ActuarialFrame data = { "product": ["Term Life", "Whole Life"], "monthly_claims": [ [0, 1000, 500, 2000, 0, 3000, 1500, 0, 2500, 1000, 0, 4000], [5000, 6000, 4500, 7000, 5500, 8000, 6500, 5000, 7500, 6000, 9000, 10000] ], "monthly_premiums": [ [50000, 50000, 52000, 51000, 50000, 49000, 50000, 51000, 50000, 50000, 51000, 50000], [120000, 125000, 122000, 128000, 124000, 130000, 126000, 123000, 127000, 125000, 129000, 132000] ] } af = ActuarialFrame(data) # Calculate standard deviation for risk assessment std_values = af.std() print(std_values) print("Term Life claims volatility:", round(std_values["monthly_claims"][0], 2)) print("Whole Life claims volatility:", round(std_values["monthly_claims"][1], 2)) ``` ```text shape: (1, 3) ┌────────────┬──────────────────────────────┬───────────────────────────────┐ │ product ┆ monthly_claims ┆ monthly_premiums │ │ --- ┆ --- ┆ --- │ │ str ┆ list[f64] ┆ list[f64] │ ╞════════════╪══════════════════════════════╪═══════════════════════════════╡ │ null ┆ [1443.38, 1443.38] ┆ [831.66, 3207.14] │ └────────────┴──────────────────────────────┴───────────────────────────────┘ Term Life claims volatility: 1443.38 Whole Life claims volatility: 1443.38 ``` ### `sum()` Calculate sum totals across all numeric columns. Returns a single-row frame containing the sum total for each numeric column. Critical for calculating portfolio totals, aggregate exposures, and overall metrics in actuarial reporting. When to use - **Portfolio Totals:** Calculate total sum assured, total premiums collected, or total claims paid for financial reporting. - **Exposure Analysis:** Sum total lives covered, total benefits, or total risk amounts for reinsurance and capital calculations. - **Revenue Reporting:** Aggregate premium income, fee revenue, or investment income across product lines or time periods. - **Claims Analysis:** Total claim counts, amounts paid, or reserves across different claim types or cohorts. ##### Returns pl.DataFrame A frame with one row containing sum totals for numeric columns. ##### Examples **Scalar Example: Portfolio Totals** ```python from gaspatchio_core import ActuarialFrame data = { "product": ["Term", "Whole Life", "Universal", "Term", "Endowment"], "policies_inforce": [1250, 890, 445, 2100, 325], "annual_premium": [1500000, 3200000, 2100000, 2800000, 1900000], "sum_assured": [125000000, 89000000, 67000000, 315000000, 48000000], } af = ActuarialFrame(data) sum_values = af.sum() print(sum_values) print("Total policies:", sum_values["policies_inforce"]) print("Total premium:", sum_values["annual_premium"]) print("Total exposure:", sum_values["sum_assured"]) ``` ```text shape: (1, 3) ┌──────────────────┬────────────────┬─────────────┐ │ policies_inforce ┆ annual_premium ┆ sum_assured │ │ --- ┆ --- ┆ --- │ │ i64 ┆ i64 ┆ i64 │ ╞══════════════════╪════════════════╪═════════════╡ │ 5010 ┆ 11500000 ┆ 644000000 │ └──────────────────┴────────────────┴─────────────┘ Total policies: 5010 Total premium: 11500000 Total exposure: 644000000 ``` **Vector Example: Monthly Totals** ```python from gaspatchio_core import ActuarialFrame data = { "branch": ["North", "South"], "monthly_new_business": [ [120, 135, 110, 145, 130, 125, 140, 155, 135, 140, 130, 160], [95, 100, 90, 105, 110, 95, 100, 115, 105, 100, 95, 120] ], "monthly_premium": [ [180000, 202500, 165000, 217500, 195000, 187500, 210000, 232500, 202500, 210000, 195000, 240000], [142500, 150000, 135000, 157500, 165000, 142500, 150000, 172500, 157500, 150000, 142500, 180000] ] } af = ActuarialFrame(data) # Get total new business and premiums sum_values = af.sum() print(sum_values) ``` ```text shape: (1, 2) ┌───────────────────────────────────────┬───────────────────────────────────────┐ │ monthly_new_business ┆ monthly_premium │ │ --- ┆ --- │ │ list[i64] ┆ list[i64] │ ╞═══════════════════════════════════════╪═══════════════════════════════════════╡ │ [215, 235, 200, 250, … 240, 225, 280] ┆ [322500, 352500, 300000, … 420000] │ └───────────────────────────────────────┴───────────────────────────────────────┘ ``` ### `trace(func)` Decorator to capture operations within a function call in optimize mode. ### `var(ddof=1)` Calculate variance across all numeric columns. Returns a single-row frame containing the variance for each numeric column. Used for risk metrics, ANOVA calculations, and statistical modeling in actuarial applications. When to use - **Risk Metrics:** Calculate variance in loss ratios, combined ratios, or expense ratios for enterprise risk management. - **Statistical Testing:** Perform ANOVA on mortality rates, lapse rates, or claim frequencies across different cohorts. - **Credibility Theory:** Calculate variance components for Bühlmann credibility factors in experience rating. - **Asset-Liability Modeling:** Measure variance in investment returns, liability cash flows, or surplus positions. ##### Parameters ddof : int, default 1 Delta degrees of freedom. The divisor is N - ddof. ##### Returns pl.DataFrame A frame with one row containing variances for numeric columns. ##### Examples **Scalar Example: Claims Variance Analysis** ```python from gaspatchio_core import ActuarialFrame data = { "month": [1, 2, 3, 4, 5, 6], "claims_count": [45, 52, 38, 61, 43, 55], "claims_amount": [125000, 145000, 95000, 185000, 120000, 165000], } af = ActuarialFrame(data) var_values = af.var() print(var_values) print("Claims count variance:", var_values["claims_count"]) print("Claims amount variance:", var_values["claims_amount"]) ``` ```text shape: (1, 3) ┌───────┬──────────────┬──────────────────┐ │ month ┆ claims_count ┆ claims_amount │ │ --- ┆ --- ┆ --- │ │ f64 ┆ f64 ┆ f64 │ ╞═══════╪══════════════╪══════════════════╡ │ 3.5 ┆ 70.3 ┆ 1.091e9 │ └───────┴──────────────┴──────────────────┘ Claims count variance: 70.3 Claims amount variance: 1091000000.0 ``` **Vector Example: Experience Variance Components** ```python from gaspatchio_core import ActuarialFrame data = { "region": ["North", "South"], "quarterly_lapse_rates": [ [0.025, 0.028, 0.022, 0.026], [0.031, 0.029, 0.033, 0.030] ], "quarterly_mortality_rates": [ [0.0010, 0.0011, 0.0009, 0.0010], [0.0012, 0.0013, 0.0011, 0.0014] ] } af = ActuarialFrame(data) # Calculate variance for credibility analysis var_values = af.var() print(var_values) print("North region lapse variance:", var_values["quarterly_lapse_rates"][0]) print("South region lapse variance:", var_values["quarterly_lapse_rates"][1]) ``` ```text shape: (1, 3) ┌────────────┬────────────────────────┬──────────────────────────────┐ │ region ┆ quarterly_lapse_rates ┆ quarterly_mortality_rates │ │ --- ┆ --- ┆ --- │ │ str ┆ list[f64] ┆ list[f64] │ ╞════════════╪════════════════════════╪══════════════════════════════╡ │ null ┆ [0.000007, 0.000003] ┆ [0.0000000067, 0.0000000167] │ └────────────┴────────────────────────┴──────────────────────────────┘ North region lapse variance: 0.000007 South region lapse variance: 0.000003 ``` ### `with_columns(*exprs)` Add columns to the DataFrame. ## `gaspatchio_core.column.namespaces.dt_proxy.DtNamespaceProxy` A proxy for Polars datetime (dt) namespace operations, enabling type-hinting and IDE intellisense for `ActuarialFrame` datetime manipulations. This proxy intercepts calls to datetime methods, retrieves the underlying Polars expression from its parent proxy (either a `ColumnProxy` or `ExpressionProxy`), applies the datetime operation, and then wraps the resulting Polars expression back into an `ExpressionProxy`. ### `__getattr__(name)` Dynamically handle any other methods available on Polars' dt namespace. This provides a fallback for dt methods not explicitly defined on this proxy. It attempts to call the method via `_call_dt_method`. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `name` | `str` | The name of the dt method to access. | *required* | Returns: | Type | Description | | --- | --- | | `Callable[..., 'ExpressionProxy']` | A callable that, when invoked, will execute the corresponding | | `Callable[..., 'ExpressionProxy']` | Polars dt method and return an ExpressionProxy. | Raises: | Type | Description | | --- | --- | | `AttributeError` | If the method does not exist on the Polars dt namespace (raised by \_call_dt_method if the underlying Polars call fails). | ### `__init__(parent_proxy, parent_af)` Initialize the DtNamespaceProxy. This constructor is typically not called directly by users. It's used internally when accessing the `.dt` attribute of an `ActuarialFrame` column or expression proxy (e.g., `af["my_date_col"].dt`). Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `parent_proxy` | `'ParentProxyType'` | The parent proxy (ColumnProxy or ExpressionProxy) from which this dt namespace is accessed. | *required* | | `parent_af` | `Optional['ActuarialFrame']` | The parent ActuarialFrame, if available, for context. | *required* | ### `day()` Extract the day number of the month (1-31) from a date/datetime expression. This function isolates the day component from a date or datetime, returning it as an integer (e.g., 15 for the 15th of the month). It works for both individual dates and lists of dates. When to use Extracting the day of the month can be useful in actuarial contexts for: * **Specific Date Checks:** Identifying events occurring on particular days (e.g., end-of-month processing). * **Intra-month Analysis:** Analyzing patterns within a month, though less common than month or year analysis. * **Data Validation:** Ensuring dates fall within expected day ranges for specific calculations. ##### Examples Scalar example:: ```python import polars as pl from gaspatchio_core import ActuarialFrame af = ActuarialFrame({"d": pl.Series(["2023-06-05", "2023-06-15"]).str.to_date()}) print(af.select(af["d"].dt.day().alias("day")).collect()) ``` ```text shape: (2, 1) ┌─────┐ │ day │ │ --- │ │ i8 │ ╞═════╡ │ 5 │ │ 15 │ └─────┘ ``` Vector (list) example – loss-event days:: ```python import datetime import polars as pl from gaspatchio_core import ActuarialFrame data = { "policy_id": ["E005", "F006"], "loss_event_dates": [ [datetime.date(2023, 6, 5), datetime.date(2023, 6, 15)], [datetime.date(2024, 2, 1), datetime.date(2024, 2, 29)], ], } af = ActuarialFrame(data).with_columns( pl.col("loss_event_dates").cast(pl.List(pl.Date)) ) days_expr = af["loss_event_dates"].dt.day() print(af.select("policy_id", days_expr.alias("event_days")).collect()) ``` ```text shape: (2, 2) ┌───────────┬────────────┐ │ literal ┆ event_days │ │ --- ┆ --- │ │ str ┆ list[i8] │ ╞═══════════╪════════════╡ │ policy_id ┆ [5, 15] │ │ policy_id ┆ [1, 29] │ └───────────┴────────────┘ ``` ### `month()` Extract the month number (1-12) from a date or datetime expression. This function allows you to isolate the month component from a series of dates or datetimes. The result is an integer representing the month, where January is 1 and December is 12. When to use In actuarial modeling, extracting the month from dates is crucial for various analyses. For instance, you might use this to: - Analyze seasonality in claims (e.g., identifying if certain types of claims are more frequent in specific months). - Group policies by their issue month for cohort analysis or to study underwriting patterns. - Determine premium due dates or benefit payment schedules that occur on a monthly basis. - Calculate fractional year components for financial calculations. ##### Examples Scalar example:: ```python import polars as pl from gaspatchio_core import ActuarialFrame af = ActuarialFrame({"d": pl.Series(["2022-01-01", "2022-02-01", "2022-03-01"]).str.to_date("%Y-%m-%d")}) print(af.select(af["d"].dt.month().alias("m")).collect()) ``` ```text shape: (3, 1) ┌─────┐ │ m │ │ --- │ │ i8 │ ╞═════╡ │ 1 │ │ 2 │ │ 3 │ └─────┘ ``` Vector (list) example – claim-lodgement months:: ```python import datetime import polars as pl from gaspatchio_core import ActuarialFrame data = { "policy_id": ["C003", "D004"], "claim_lodgement_dates": [ [datetime.date(2022, 3, 10), datetime.date(2022, 4, 5)], [datetime.date(2023, 1, 20), datetime.date(2023, 11, 30)], ], } af = ActuarialFrame(data).with_columns( pl.col("claim_lodgement_dates").cast(pl.List(pl.Date)) ) months_expr = af["claim_lodgement_dates"].dt.month() print(af.select(pl.col("policy_id"), months_expr.alias("lodgement_months")).collect()) ``` ```text shape: (2, 2) ┌───────────┬──────────────────┐ │ policy_id ┆ lodgement_months │ │ --- ┆ --- │ │ str ┆ list[i8] │ ╞═══════════╪══════════════════╡ │ C003 ┆ [3, 4] │ │ D004 ┆ [1, 11] │ └───────────┴──────────────────┘ ``` ### `year()` Extract the year from the underlying datetime expression. This function isolates the year component from a date or datetime, returning it as an integer (e.g., 2023). It is applicable to both single date values and lists of dates within your `ActuarialFrame`. When to use Extracting the year is fundamental in actuarial analysis for: * **Valuation and Reporting:** Determining the calendar year for financial reporting or regulatory submissions. * **Experience Studies:** Grouping data by calendar year of event (e.g., year of claim, year of lapse) to analyze trends. * **Cohort Analysis:** Defining cohorts based on the year of policy issue or birth year. * **Projection Models:** Calculating durations or projecting cash flows based on calendar years. ##### Examples Scalar example (single-date column):: ```python import polars as pl from gaspatchio_core import ActuarialFrame data = { "dates": pl.Series(["2020-01-15", "2021-07-20"]).str.to_date(format="%Y-%m-%d") } af = ActuarialFrame(data) year_expr = af["dates"].dt.year() print(af.select(year_expr.alias("year")).collect()) ``` ```text shape: (2, 1) ┌──────┐ │ year │ │ --- │ │ i32 │ ╞══════╡ │ 2020 │ │ 2021 │ └──────┘ ``` Vector example (list-of-dates per policy):: ```python import datetime import polars as pl from gaspatchio_core import ActuarialFrame data_vec = { "policy_id": ["A001", "B002"], "policy_event_dates": [ [datetime.date(2019, 12, 1), datetime.date(2020, 1, 20)], [datetime.date(2021, 5, 10), datetime.date(2021, 8, 15), datetime.date(2022, 2, 25)], ], } af_vec = ActuarialFrame(data_vec) af_vec = af_vec.with_columns(pl.col("policy_event_dates").cast(pl.List(pl.Date))) years_expr = af_vec["policy_event_dates"].dt.year() print(af_vec.select(pl.col("policy_id"), years_expr.alias("event_years")).collect()) ``` ```text shape: (2, 2) ┌───────────┬────────────────────┐ │ policy_id ┆ event_years │ │ --- ┆ --- │ │ str ┆ list[i32] │ ╞═══════════╪════════════════════╡ │ A001 ┆ [2019, 2020] │ │ B002 ┆ [2021, 2021, 2022] │ └───────────┴────────────────────┘ ``` ## `gaspatchio_core.accessors.excel.ExcelColumnAccessor` Bases: `BaseColumnAccessor` Provides Excel-related methods applicable to columns or expressions. Accessed via `.excel` on an ActuarialFrame column or expression proxy, e.g., `af["my_excel_col"].excel`. ### `__init__(proxy)` Initializes the accessor with the parent proxy. Internal initialization method for the Excel column accessor. ### `from_excel_serial(epoch='1900')` Converts Excel serial numbers (integers or floats) to Polars Date. Follows logic similar to openpyxl for compatibility. This method handles Excel's date serialization system, including the notorious Excel 1900 leap year bug where Excel incorrectly treats 1900 as a leap year. When to use - **Excel File Import:** When importing Excel files that contain date columns stored as serial numbers rather than proper date values. - **Legacy Data Processing:** When working with older Excel files or systems that export dates as numeric serial values. - **Cross-Platform Compatibility:** When handling Excel files that may have been created on different platforms (Windows vs Mac) with different epoch systems. - **Data Validation:** When you need to convert and validate date serial numbers from external Excel-based data sources. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `epoch` | `str` | The epoch system used by Excel ('1900' or '1904'). Defaults to '1900'. 1900 Epoch (WINDOWS_1900_EPOCH = 1899-12-30): Serial 1 is 1900-01-01. Excel's serial 60 (phantom 1900-02-29) is mapped to 1900-03-01. Serials > 60 are adjusted by -1 day before adding to epoch. 1904 Epoch (MAC_1904_EPOCH = 1904-01-01): Serial 1 is 1904-01-01. Days to add from epoch are serial - 1. | `'1900'` | Returns: | Type | Description | | --- | --- | | `ExpressionProxy` | An ExpressionProxy representing the converted date column. | Raises: | Type | Description | | --- | --- | | `ValueError` | If an invalid epoch is provided. | Examples: ```python from gaspatchio_core import ActuarialFrame # Excel serial numbers for some dates data = { "policy_id": ["P001", "P002", "P003"], "excel_date_serial": [44197, 44562, 44927], # Excel serial numbers } af = ActuarialFrame(data) # Convert Excel serial numbers to proper dates af_with_dates = af.with_columns( actual_date=af["excel_date_serial"].excel.from_excel_serial(epoch="1900") ) print(af_with_dates.collect()) ``` ```text shape: (3, 3) ┌───────────┬────────────────────┬─────────────┐ │ policy_id ┆ excel_date_serial ┆ actual_date │ │ --- ┆ --- ┆ --- │ │ str ┆ i64 ┆ date │ ╞═══════════╪════════════════════╪═════════════╡ │ P001 ┆ 44197 ┆ 2021-01-01 │ │ P002 ┆ 44562 ┆ 2021-12-31 │ │ P003 ┆ 44927 ┆ 2022-12-31 │ └───────────┴────────────────────┴─────────────┘ ``` ### `yearfrac(end_date_expr, basis='act/act')` Calculate the year fraction between two dates, similar to Excel's YEARFRAC. This function computes the fraction of a year represented by the number of whole days between a start date (the column/expression this accessor is on) and an end date. It uses a specified day count basis. The function can operate on individual dates (scalars or columns) and also handles scenarios where one of the date inputs is a list of dates within a column. When to use - **Premium Proration**: Calculate the portion of an annual premium that corresponds to a partial policy term, for example, if a policy starts or ends mid-year. - **Exposure Calculation**: Determine fractional exposure periods for reserving or IBNR (Incurred But Not Reported) calculations, especially when dealing with policies that are not in force for a full year. - **Investment Analysis**: Compute fractional year periods for accrued interest calculations or for annualizing returns on investments held for parts of a year. - **Performance Metrics**: Analyze time-based metrics such as time-to-claim or duration of an event, expressed as a fraction of a year. ##### Parameters end_date_expr : IntoExprColumn An expression or column representing the end dates. Can be a scalar date, a column of dates, or a column of `List[Date]` if the start date is a scalar/column of dates (and vice-versa). basis : int or str, optional The day count basis to use. Can be an integer (0-4) or a string name. Defaults to "act/act" (which is basis 1). ```text Supported bases: - `0` or `'us_nasd_30_360'` (30/360 US NASD) - US (NASD) 30/360 convention - `1` or `'act/act'` (Actual/Actual) - Simplified version (uses 365.25 days) - `2` or `'actual_360'` (Actual/360) - Not Implemented - `3` or `'actual_365'` (Actual/365 fixed) - Not Implemented - `4` or `'european_30_360'` (30/360 European) - Not Implemented ``` ##### Returns ExpressionProxy An expression representing the calculated year fraction as a `Float64`. If one of the inputs was a `List[Date]`, the output will be a `List[Float64]`. ##### Raises NotImplementedError If a `basis` other than the currently supported basis values is specified, or if both start and end date expressions resolve to `List[Date]` columns (which requires a more complex UDF or explode/aggregate pattern). TypeError If the underlying proxy for the start date is not a `ColumnProxy` or `ExpressionProxy`. RuntimeError If the operation requires an `ActuarialFrame` context that is not available. ValueError If an invalid basis is provided. ##### Examples Calculating Policy Term as Year Fraction (Scalar/Column Operations):: ````text Scenario: You have policy start and end dates and want to calculate the policy term in years. ```python import datetime from gaspatchio_core import ActuarialFrame data = { "policy_id": ["P001", "P002", "P003"], "start_date": [ datetime.date(2020, 1, 1), datetime.date(2021, 6, 15), datetime.date(2022, 3, 1), ], "end_date": [ datetime.date(2021, 1, 1), datetime.date(2022, 6, 15), datetime.date(2022, 9, 1), # Partial year ], } af = ActuarialFrame(data) # Calculate year fraction using 'act/act' (simplified) af_with_term = af["start_date"].excel.yearfrac(af["end_date"], basis="act/act") print(af_with_term.collect()) ```` ``` shape: (3, 4) ┌───────────┬────────────┬────────────┬────────────┐ │ policy_id ┆ start_date ┆ end_date ┆ term_years │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ date ┆ f64 │ ╞═══════════╪════════════╪════════════╪════════════╡ │ P001 ┆ 2020-01-01 ┆ 2021-01-01 ┆ 1.002053 │ │ P002 ┆ 2021-06-15 ┆ 2022-06-15 ┆ 0.999316 │ │ P003 ┆ 2022-03-01 ┆ 2022-09-01 ┆ 0.503765 │ └───────────┴────────────┴────────────┴────────────┘ ``` ```` Fractional Exposure for Multiple Claim Events from a Single Policy Start (List Operation):: ```text Scenario: A policy has a single start date, but multiple claim event dates. Calculate the time from policy start to each claim event as a year fraction. ```python import datetime import polars as pl from gaspatchio_core import ActuarialFrame data = { "policy_id": ["PolicyA", "PolicyB"], "policy_start_date": [datetime.date(2020, 1, 1), datetime.date(2021, 1, 1)], "claim_event_dates": [ [datetime.date(2020, 7, 1), datetime.date(2021, 3, 15)], # Events for PolicyA [datetime.date(2021, 2, 1)], # Event for PolicyB ], } # Ensure claim_event_dates is typed as List[Date] af = ActuarialFrame(data, schema_overrides={"claim_event_dates": pl.List(pl.Date)}) af_with_frac = af.with_columns( time_to_event_years = af["policy_start_date"].excel.yearfrac(af["claim_event_dates"]) ) print(af_with_frac.collect()) ```` ``` shape: (2, 4) ┌───────────┬───────────────────┬───────────────────────────────────────────┬─────────────────────────────┐ │ policy_id ┆ policy_start_date ┆ claim_event_dates ┆ time_to_event_years │ │ --- ┆ --- ┆ --- ┆ --- │ │ str ┆ date ┆ list[date] ┆ list[f64] │ ╞═══════════╪═══════════════════╪═══════════════════════════════════════════╪═════════════════════════════╡ │ PolicyA ┆ 2020-01-01 ┆ [2020-07-01, 2021-03-15] ┆ [0.50016, 1.200046] │ │ PolicyB ┆ 2021-01-01 ┆ [2021-02-01] ┆ [0.084873] │ └───────────┴───────────────────┴───────────────────────────────────────────┴─────────────────────────────┘ ``` ``` ``` ## `gaspatchio_core.column.namespaces.string_proxy.StringNamespaceProxy` A proxy for Polars expression string (str) namespace operations. This proxy is typically accessed via the `.str` attribute of a `ColumnProxy` or `ExpressionProxy` that refers to a string or list-of-strings column within an `ActuarialFrame`. It allows for intuitive, Polars-like string manipulations while remaining integrated with the ActuarialFrame ecosystem. It automatically handles shimming for `List[String]` columns, applying string methods element-wise to the contents of the lists. Examples: **Scalar Example: Uppercasing policyholder names** This demonstrates applying a string operation to a scalar string column. We'll convert policyholder names to uppercase. ```python from gaspatchio_core.frame.base import ActuarialFrame data_for_class_doctest = { # Renamed to avoid conflict with other examples "policy_holder_name": ["John Doe", "Jane Smith", "Robert Jones"], "policy_type_codes": [["TERM", "WL"], ["UL"], ["TERM", "CI"]] } af_scalar = ActuarialFrame(data_for_class_doctest) af_upper_names = af_scalar.select( af_scalar["policy_holder_name"].str.to_uppercase().alias("upper_name") ) print(af_upper_names.collect()) ``` ```text shape: (3, 1) ┌──────────────┐ │ upper_name │ │ --- │ │ str │ ╞══════════════╡ │ JOHN DOE │ │ JANE SMITH │ │ ROBERT JONES │ └──────────────┘ ``` **Vector (List Shimming) Example: Lowercasing policy type codes** This demonstrates applying a string operation to a list-of-strings column. We'll convert lists of policy type codes to lowercase. ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_for_class_doctest = { "policy_holder_name": ["John Doe", "Jane Smith", "Robert Jones"], "policy_type_codes": [["TERM", "WL"], ["UL"], ["TERM", "CI"]] } af_vector = ActuarialFrame(data_for_class_doctest).with_columns( pl.col("policy_type_codes").cast(pl.List(pl.String)) ) af_lower_codes = af_vector.select( af_vector["policy_type_codes"].str.to_lowercase().alias("lower_codes") ) print(af_lower_codes.collect()) ``` ```text shape: (3, 1) ┌────────────────┐ │ lower_codes │ │ --- │ │ list[str] │ ╞════════════════╡ │ ["term", "wl"] │ │ ["ul"] │ │ ["term", "ci"] │ └────────────────┘ ``` ### `__getattr__(name)` Dynamically handle calls to Polars string methods not explicitly defined. This allows the proxy to support any method available on Polars' str namespace without needing to define each one explicitly on this proxy class. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `name` | `str` | The name of the string method to call. | *required* | Returns: | Type | Description | | --- | --- | | `Callable[..., 'ExpressionProxy']` | A callable that, when invoked, will execute the corresponding Polars | | `Callable[..., 'ExpressionProxy']` | string method via \_call_string_method. | Raises: | Type | Description | | --- | --- | | `AttributeError` | If the method does not exist on the Polars string namespace (this is typically raised by \_call_string_method), or if a dunder method (e.g. __repr__) is accessed that isn't defined. | ### `__init__(parent_proxy, parent_af)` Initialize the StringNamespaceProxy. This constructor is not typically called directly by users. Instances are created by the dispatch mechanism when accessing `.str` on a ColumnProxy or ExpressionProxy. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `parent_proxy` | `'ProxyType'` | The parent ColumnProxy or ExpressionProxy from which .str was accessed. | *required* | | `parent_af` | `Optional['ActuarialFrame']` | The parent ActuarialFrame, providing context such as the underlying DataFrame/LazyFrame and schema. | *required* | ### `contains(pattern, literal=False, strict=False)` Checks if strings in a column contain a specified pattern. This method searches for a pattern within string values, returning a boolean indicating if the pattern exists in each string. It's useful for filtering, data categorization, and identifying records with specific text patterns. When to use - Identify policies with specific riders or endorsements from description fields - Find claims that mention particular medical conditions or causes - Filter customer feedback containing specific keywords for risk analysis - Segment policyholders based on address information (e.g., rural vs urban) - Flag policies or claims with special handling notes (e.g., "legal review") - Screen underwriting notes for high-risk indicators Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `pattern` | `str | Expr` | The substring or regex pattern to search for. Can be a literal string (e.g., "RiderX") or a Polars expression (e.g., pl.col("other_column_with_patterns")). | *required* | | `literal` | `bool` | If True, pattern is treated as a literal string. If False (default), pattern is treated as a regex. | `False` | | `strict` | `bool` | If True and pattern is a Polars expression, an error is raised if pattern is not a string type. If False (default), pattern is cast to string if possible. | `False` | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | A new ExpressionProxy containing a boolean Series indicating for each input string whether the pattern was found. If the input was List[String], the output will be List[bool]. | Examples: **Scalar Example: Identifying policies with an Accidental Death Benefit (ADB) rider** Imagine you have a dataset of policy descriptions and you want to flag all policies that include an "ADB" rider. ```python from gaspatchio_core.frame.base import ActuarialFrame data = { "policy_id": ["POL001", "POL002", "POL003", "POL004"], "description": [ "Term Life Plan with ADB rider", "Whole Life - Standard", "Universal Life, includes ADB rider and Accidental Death Benefit (ADB)", "Term Life, no Accidental Death Benefit rider" ] } af = ActuarialFrame(data) af_with_adb_rider = af.select( af["description"].str.contains("ADB rider", literal=True).alias("has_adb_rider") ) print(af_with_adb_rider.collect()) ``` ```text shape: (4, 1) ┌───────────────┐ │ has_adb_rider │ │ --- │ │ bool │ ╞═══════════════╡ │ true │ │ false │ │ true │ │ false │ └───────────────┘ ``` **Vector Example: Checking underwriter notes for high-risk keywords** Suppose each policy has a list of notes from underwriters. We want to check if any note for a given policy contains keywords like "medical history" or "hazardous occupation", which might indicate higher risk. ```python from gaspatchio_core.frame.base import ActuarialFrame uw_notes_data = { "policy_id": ["UW001", "UW002", "UW003"], "underwriter_notes": [ "Standard risk. Family history clear.", "Applicant works in construction. Reviewed medical history: smoker.", "No concerning notes. Possible hazardous occupation mentioned." ] } af_notes = ActuarialFrame(uw_notes_data) af_results = af_notes.select( af_notes["underwriter_notes"].str.contains("medical history").alias("mentions_medical_history"), af_notes["underwriter_notes"].str.contains("(?i)hazardous occupation").alias("mentions_hazardous_occupation"), ) print(af_results.collect()) ``` ```text shape: (3, 2) ┌──────────────────────────┬───────────────────────────────┐ │ mentions_medical_history ┆ mentions_hazardous_occupation │ │ --- ┆ --- │ │ bool ┆ bool │ ╞══════════════════════════╪═══════════════════════════════╡ │ false ┆ false │ │ true ┆ false │ │ false ┆ true │ └──────────────────────────┴───────────────────────────────┘ ``` **Using `contains` with a list of patterns (regex and literal)** Suppose we want to check for multiple keywords in underwriter notes using both literal and regex matching. ```python from gaspatchio_core.frame.base import ActuarialFrame uw_notes_data_multi = { # Renamed to avoid conflict "policy_id": ["UW001", "UW002", "UW003"], "underwriter_notes": [ "Standard risk. Family history clear.", "Applicant works in construction. Reviewed medical history: smoker.", "No concerning notes. Possible hazardous occupation mentioned." ] } af_multi = ActuarialFrame(uw_notes_data_multi) af_multi_processed = af_multi.select( # Literal check af_multi["underwriter_notes"].str.contains("medical history", literal=True).alias("mentions_medical_history_literal"), # Regex check (case insensitive) af_multi["underwriter_notes"].str.contains(r"(?i)hazardous occupation").alias("mentions_hazardous_occupation_regex"), # Another Regex check (case insensitive) for medical history af_multi["underwriter_notes"].str.contains(r"(?i)medical history").alias("mentions_medical_history_regex") ) print(af_multi_processed.collect()) ``` ```text shape: (3, 3) ┌──────────────────────────────────┬─────────────────────────────────────┬────────────────────────────────┐ │ mentions_medical_history_literal ┆ mentions_hazardous_occupation_regex ┆ mentions_medical_history_regex │ │ --- ┆ --- ┆ --- │ │ bool ┆ bool ┆ bool │ ╞══════════════════════════════════╪═════════════════════════════════════╪════════════════════════════════╡ │ false ┆ false ┆ false │ │ true ┆ false ┆ true │ │ false ┆ true ┆ false │ └──────────────────────────────────┴─────────────────────────────────────┴────────────────────────────────┘ ``` ### `ends_with(suffix)` Check if strings end with a specific substring. This method returns a boolean expression showing whether each string value ends with the provided suffix. For columns containing `List[String]`, the check is applied to every element within each list. When to use - Verify that policy identifiers end with region or product codes. - Flag claim or log entries that end with status markers like "OK" or "PENDING". - Validate strings against suffixes supplied in another column, such as checking payout account numbers. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `suffix` | `str | Expr` | The substring to test for at the end of each string. It can be a literal value or a Polars expression. | *required* | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | A boolean result indicating whether each string | | | `'ExpressionProxy'` | ends with suffix. For list columns, the result is a list of | | | `'ExpressionProxy'` | booleans. | Examples: **Scalar example – region codes** ```python from gaspatchio_core.frame.base import ActuarialFrame af = ActuarialFrame({ "policy_id": ["P100-US", "P101-CA", "P102-US", None, "P103-EU"] }) result = af.select( af["policy_id"].str.ends_with("-US").alias("is_us_policy") ) print(result.collect()) ``` ```text shape: (5, 1) ┌──────────────┐ │ is_us_policy │ │ --- │ │ bool │ ╞══════════════╡ │ true │ │ false │ │ true │ │ null │ │ false │ └──────────────┘ ``` **Vector (list) example – status flags** ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl logs = { "policy_id": ["A100", "A101"], "update_notes_str": [ "Issued OK,Review PENDING", "None,Paid OK", ], } af_logs = ActuarialFrame(logs) af_logs = af_logs.with_columns( af_logs["update_notes_str"].str.split(",").alias("update_notes").map_elements( lambda x: [None if item == "None" else item for item in x], return_dtype=pl.List(pl.String) ) ) status_ok = af_logs.select( af_logs["update_notes"].str.ends_with("OK").alias("ends_with_ok") ) print(status_ok.collect()) ``` ```text shape: (2, 1) ┌───────────────┐ │ ends_with_ok │ │ --- │ │ list[bool] │ ╞═══════════════╡ │ [true, false] │ │ [null, true] │ └───────────────┘ ``` ### `extract(pattern, group_index=1)` Extract a capturing group from a regex pattern. This method returns the specified group from each string that matches `pattern`. It operates element-wise on list columns, making it ideal for pulling identifiers or amounts embedded in free-text fields. When to use - Retrieve policy or claim numbers from combined identifiers or descriptive text - Capture monetary amounts from claim notes for validation - Isolate classification codes embedded within longer strings Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `pattern` | `str` | The regex pattern with capturing groups. | *required* | | `group_index` | `int` | The 1-based index of the group to extract. | `1` | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | An ExpressionProxy containing the extracted group. | Examples: **Scalar Example: Extracting policy numbers from combined IDs** ```python from gaspatchio_core.frame.base import ActuarialFrame data = { "full_id": ["POLICY-12345-AB", "CLAIM-67890-CD", "POLICY-ABCDE-FG"], } af = ActuarialFrame(data) af_extracted = af.select( af["full_id"].str.extract(r"POLICY-([A-Z0-9]+)-.*", group_index=1).alias("policy_num") ) print(af_extracted.collect()) ``` ```text shape: (3, 1) ┌────────────┐ │ policy_num │ │ --- │ │ str │ ╞════════════╡ │ 12345 │ │ null │ │ ABCDE │ └────────────┘ ``` **Vector Example: Extracting amounts from transaction descriptions** ```python from gaspatchio_core.frame.base import ActuarialFrame data_list = { "policy_id": ["P001"], "transactions": ["Premium paid: $100.50, Fee: $10.00, Adjustment: $-5.25"], } af_list = ActuarialFrame(data_list) af_list = af_list.with_columns( af_list["transactions"].str.split(", ").alias("transactions") ) af_list_extracted = af_list.select( af_list["transactions"].str.extract(r"\$?([-+]?[0-9]+\.[0-9]{2})", group_index=1).alias("amounts_str") ) print(af_list_extracted.collect()) ``` ```text shape: (1, 1) ┌──────────────────────────────┐ │ amounts_str │ │ --- │ │ list[str] │ ╞══════════════════════════════╡ │ ["100.50", "10.00", "-5.25"] │ └──────────────────────────────┘ ``` ### `extract_all(pattern)` Extract all non-overlapping regex matches as a list. Mirrors Polars' `Expr.str.extract_all`. For `List[String]` columns, the extraction is applied element-wise. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `pattern` | `str` | The regex pattern to search for. | *required* | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | An ExpressionProxy containing a list of all matches for each row. | When to use - Collect every monetary amount mentioned in claim notes for validation against the claim ledger. - Extract all policy reference numbers from free-text fields when reconciling cross-policy transactions. - Gather every ICD code from a medical report to determine claim triggers. - Capture all state abbreviations from an address string when assessing geographical concentration risk. Examples: **Scalar example – Extracting amounts from claim descriptions** ```python from gaspatchio_core.frame.base import ActuarialFrame data = { "claim_id": ["C1", "C2"], "details": ["Paid $150.00 and $25.50 fee", "Refunded $10.00"] } af = ActuarialFrame(data) af_amounts = af.select( af["details"].str.extract_all(r"\$([0-9]+\.[0-9]{2})").alias("amounts") ) print(af_amounts.collect()) ``` ```text shape: (2, 1) ┌───────────────────────┐ │ amounts │ │ --- │ │ list[str] │ ╞═══════════════════════╡ │ ["$150.00", "$25.50"] │ │ ["$10.00"] │ └───────────────────────┘ ``` **Vector example – Extracting policy numbers from lists of notes** ```python from gaspatchio_core.frame.base import ActuarialFrame notes = { "claim_id": ["C1"], "notes": ["Policy 12345 reported, Adjustment for policy 98765"] } af = ActuarialFrame(notes) af_list = af.with_columns( af["notes"].str.split(", ").alias("notes") ) result = af_list.select( af_list["notes"].str.extract_all(r"[0-9]+").alias("policy_numbers") ) print(result.collect()) ``` ```text shape: (1, 1) ┌────────────────────────┐ │ policy_numbers │ │ --- │ │ list[list[str]] │ ╞════════════════════════╡ │ [["12345"], ["98765"]] │ └────────────────────────┘ ``` ### `len_bytes()` Get the number of bytes in each string. Calculates the byte length of each string in a column. This is particularly useful when dealing with multi-byte character encodings (like UTF-8) where the number of characters may not equal the number of bytes. When to use - **Data Storage Estimation:** Accurately estimating storage requirements for datasets containing text fields, especially with international character sets (e.g., policyholder names, addresses from various regions). - **System Integration Limits:** Ensuring that string data, when exported or sent to other systems, conforms to byte-length restrictions imposed by those systems (e.g., fixed-width file formats or database field constraints defined in bytes). - **Performance Considerations:** Recognizing that operations on strings with many multi-byte characters might be more resource-intensive. - **Encoding Issue Detection:** While not a direct detection method, unexpected byte lengths compared to character lengths might hint at encoding problems or the presence of unusual characters. Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | An ExpressionProxy with the byte count (as UInt32) for each string. If the input was List[String], the output will be List[UInt32]. | Examples: **Scalar Example: Byte length of UTF-8 encoded client names** Scenario: You have client names that may include characters from various languages, and you need to understand their storage size in bytes. ```python from gaspatchio_core.frame.base import ActuarialFrame data = { "client_id": ["C001", "C002", "C003", "C004"], "client_name": ["René", "沐宸", "Zoë", "John Doe"] # French, Chinese, German, English names } af = ActuarialFrame(data) af_byte_len = af.select( af["client_name"].str.len_bytes().alias("name_byte_length") ) print(af_byte_len.collect()) ``` ```text shape: (4, 1) ┌──────────────────┐ │ name_byte_length │ │ --- │ │ u32 │ ╞══════════════════╡ │ 5 │ │ 6 │ │ 4 │ │ 8 │ └──────────────────┘ ``` **Vector Example: Byte length of free-text comments in a list** Scenario: A policy record contains a list of comments, potentially with special characters or different languages. You need to find the byte length of each comment. ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_list_comments = { "policy_id": ["P501", "P502"], "comments_list": [ ["Test € symbol", "Standard comment.", None], # Euro symbol is multi-byte ["Résumé", "日本語のコメント"] # French with accent, Japanese comment ] } af_comments = ActuarialFrame(data_list_comments) # Ensure the list column has the correct Polars type af_comments = af_comments.with_columns( af_comments["comments_list"].cast(pl.List(pl.String)) ) af_comment_byte_len = af_comments.select( af_comments["comments_list"].str.len_bytes().alias("comment_byte_lengths") ) print(af_comment_byte_len.collect()) ``` ```text shape: (2, 1) ┌──────────────────────────┐ │ comment_byte_lengths │ │ --- │ │ list[u32] │ ╞══════════════════════════╡ │ [13, 17, null] │ │ [7, 21] │ └──────────────────────────┘ ``` ### `len_chars()` Alias for `n_chars`. Get the number of characters in each string. Calculates the length of each string in a column, returning an integer representing the number of characters. This is an alias for `n_chars()`. When to use - **Data Validation:** Ensuring identifiers like policy numbers, social security numbers, or postal codes adhere to expected length constraints, helping to identify data entry errors. - **System Integration:** Verifying that string data, such as client names or addresses, does not exceed length limitations of downstream systems or databases. - **Feature Engineering:** Using the length of free-text fields (e.g., claim descriptions, underwriter notes) as a potential feature in predictive models, where length might correlate with complexity or severity. - **Data Quality Assessment:** Identifying outliers or anomalies in string lengths that might indicate corrupted or incomplete data. Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | An ExpressionProxy with the character count (as UInt32) for each string. If the input was List[String], the output will be List[UInt32]. | Examples: **Scalar Example: Validating policy number length** Scenario: You need to check if policy numbers in your dataset conform to an expected length, say 7 characters. ```python from gaspatchio_core.frame.base import ActuarialFrame data = { "policy_id_raw": ["POL1234", "POL567", "POL89012", None, "POL3456"], "premium": [100.0, 150.0, 200.0, 50.0, 120.0] } af = ActuarialFrame(data) # Calculate the length of each policy_id_raw af_len_check = af.select( af["policy_id_raw"].str.len_chars().alias("policy_id_length") ) print(af_len_check.collect()) ``` ```text shape: (5, 1) ┌──────────────────┐ │ policy_id_length │ │ --- │ │ u32 │ ╞══════════════════╡ │ 7 │ │ 6 │ │ 8 │ │ null │ │ 7 │ └──────────────────┘ ``` **Vector Example: Character count of claim notes** Scenario: Each policy may have a list of associated claim notes. You want to find the character length of each note to understand the verbosity or for display purposes. ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_list = { "policy_id": ["P7001", "P7002"], "claim_notes_list": [ ["Short note.", "This is a much longer note regarding the claim details.", None], ["Urgent review needed!", "All clear."] ] } af_list_notes = ActuarialFrame(data_list) # Ensure the list column has the correct Polars type af_list_notes = af_list_notes.with_columns( af_list_notes["claim_notes_list"].cast(pl.List(pl.String)) ) af_notes_len = af_list_notes.select( af_list_notes["claim_notes_list"].str.len_chars().alias("note_char_lengths") ) print(af_notes_len.collect()) ``` ```text shape: (2, 1) ┌───────────────────────────┐ │ note_char_lengths │ │ --- │ │ list[u32] │ ╞═══════════════════════════╡ │ [11, 53, null] │ │ [20, 9] │ └───────────────────────────┘ ``` ### `ljust(width, fill_char=' ')` Left-align strings by padding on the right. Strings shorter than `width` are padded on the right with `fill_char`. When the column contains `List[String]` values, each element is padded individually. When to use - Formatting account or policy identifiers for fixed-width exports. - Preparing ledger extracts where text fields must be left-aligned. - Normalizing rider or sub-account codes stored as lists so they compare consistently. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `width` | `int` | The desired total length of the string after padding. | *required* | | `fill_char` | `str` | The character to pad with. Defaults to a space. | `' '` | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | An ExpressionProxy with strings padded at the end. | Examples: **Scalar example – fixed-width account codes** ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame with pl.Config(fmt_str_lengths=100): data = {"account_code": ["A1", "B123", None, "C"]} af = ActuarialFrame(data) af_ljust = af.select( af["account_code"].str.ljust(6, "-").alias("ljust_code") ) print(af_ljust.collect()) ``` ```text shape: (4, 1) ┌────────────┐ │ ljust_code │ │ --- │ │ str │ ╞════════════╡ │ A1---- │ │ B123-- │ │ null │ │ C----- │ └────────────┘ ``` **Vector example – padding elements in a list column** ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame with pl.Config(fmt_str_lengths=100): data_list = { "batch_id": ["X01"], "sub_codes": [["S1", "LONGCODE", "S23"]], } af_list = ActuarialFrame(data_list) af_list = af_list.with_columns( af_list["sub_codes"].cast(pl.List(pl.String)) ) af_list_ljust = af_list.select( af_list["sub_codes"].str.ljust(8, "X").alias("ljust_sub_codes") ) print(af_list_ljust.collect()) ``` ```text shape: (1, 1) ┌──────────────────────────────────────┐ │ ljust_sub_codes │ │ --- │ │ list[str] │ ╞══════════════════════════════════════╡ │ ["S1XXXXXX", "LONGCODE", "S23XXXXX"] │ └──────────────────────────────────────┘ ``` ### `n_chars()` Get the number of characters in each string. This function calculates the length of each string in a column, returning an integer representing the number of characters. It's a fundamental operation for understanding string data characteristics. When to use - **Data Quality Checks:** Identifying unexpectedly short or long strings that might indicate data entry errors or truncation (e.g., validating the length of policy numbers, postal codes, or identification numbers). - **Feature Engineering:** Creating new features based on string length for predictive models (e.g., the length of a claim description might correlate with claim complexity). - **Data Cleaning & Transformation:** Deciding on padding or truncation strategies if string fields need to conform to a fixed length for system integration or reporting. - **Understanding Free-Text Fields:** Analyzing the distribution of lengths in fields like underwriter notes or medical descriptions to gauge the amount of detail typically provided. - **Filtering or Segmenting Data:** Selecting records based on the length of a specific string field (e.g., finding all policyholder names shorter than 3 characters for review). Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | An ExpressionProxy with the character count (as UInt32) for each string. | Examples: **Scalar Example: Length of product names** To understand the typical length of product names in your portfolio, or to identify names that might be too long for certain display formats. ```python from gaspatchio_core.frame.base import ActuarialFrame data = { "product_code": ["L-TERM-10", "L-WL-P", "ANN-SDA"], "product_name": ["Term Life 10 Year", "Whole Life Par", "Single Deferred Annuity"] } af = ActuarialFrame(data) af_len = af.select( af["product_name"].str.n_chars().alias("name_length") ) print(af_len.collect()) ``` ```text shape: (3, 1) ┌─────────────┐ │ name_length │ │ --- │ │ u32 │ ╞═════════════╡ │ 17 │ │ 14 │ │ 23 │ └─────────────┘ ``` **Vector Example: Length of beneficiary names in a list** For policies with multiple beneficiaries, you might want to check the length of each beneficiary's name, perhaps to ensure it fits within system limits or for data validation. ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_list = { "policy_id": ["P001", "P002"], "beneficiaries": [["John A. Doe", "Jane B. Smith"], ["Robert King", None, "Alice Wonderland"]] } af_list_initial = ActuarialFrame(data_list) af_list = af_list_initial.with_columns( af_list_initial["beneficiaries"].cast(pl.List(pl.String)) ) af_bene_len = af_list.select( af_list["beneficiaries"].str.n_chars().alias("beneficiary_name_lengths") ) print(af_bene_len.collect()) ``` ```text shape: (2, 1) ┌──────────────────────────┐ │ beneficiary_name_lengths │ │ --- │ │ list[u32] │ ╞══════════════════════════╡ │ [11, 13] │ │ [11, null, 16] │ └──────────────────────────┘ ``` ### `pad_end(width, fill_char=' ')` Left-align strings by padding on the right. Strings shorter than `width` are padded on the right with `fill_char`. If the column is `List[String]` the padding is applied to each element of the list. When to use - Format policy numbers or claim identifiers for extracts that require fixed-width fields. - Pad abbreviations in list columns (such as rider codes) so that they line up cleanly in cross-system feeds. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `width` | `int` | The desired total length of the string after padding. | *required* | | `fill_char` | `str` | The character to pad with. Defaults to a space. | `' '` | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | An ExpressionProxy with strings padded at the end. | Examples: **Scalar example – fixed-width policy codes** ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame with pl.Config(fmt_str_lengths=100): data = {"policy_code": ["L101", "L20", None]} af = ActuarialFrame(data) result = af.select( af["policy_code"].str.pad_end(6, "0").alias("fixed_length_code") ) print(result.collect()) ``` ```text shape: (3, 1) ┌───────────────────┐ │ fixed_length_code │ │ --- │ │ str │ ╞═══════════════════╡ │ L10100 │ │ L20000 │ │ null │ └───────────────────┘ ``` **Vector example – padding claim codes in a list** ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame with pl.Config(fmt_str_lengths=100): data_list = {"batch_id": ["B200"], "claim_codes": [["A1", "XYZ", "C1234"]]} af_list = ActuarialFrame(data_list).with_columns( pl.col("claim_codes").cast(pl.List(pl.String)) ) result = af_list.select( af_list["claim_codes"].str.pad_end(6, "_").alias("aligned_codes") ) print(result.collect()) ``` ```text shape: (1, 1) ┌────────────────────────────────┐ │ aligned_codes │ │ --- │ │ list[str] │ ╞════════════════════════════════╡ │ ["A1____", "XYZ___", "C1234_"] │ └────────────────────────────────┘ ``` ### `pad_start(width, fill_char=' ')` Alias for `rjust`. Pads the start of strings (right-aligns content). Adds characters to the beginning of each string until it reaches the given width. This is handy when preparing fixed-width extracts or aligning numeric text fields in actuarial reports. When to use - Preparing policy identifiers for legacy mainframe interfaces that expect fixed-width fields. - Aligning premium or reserve amounts in textual summaries generated for regulators or management. - Standardizing rider codes stored in lists so that they can be compared consistently across policies. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `width` | `int` | The desired minimum length of the string. | *required* | | `fill_char` | `str` | The character to pad with. Defaults to a space. | `' '` | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | An ExpressionProxy with strings padded at the | | | `'ExpressionProxy'` | start. | Examples: **Scalar Example: Align premium amounts in a report** ```python # Test with pl.Config to ensure consistent display import polars as pl from gaspatchio_core.frame.base import ActuarialFrame with pl.Config(fmt_str_lengths=100): data = { "premium_str": ["1200.5", "85.75", None] } af = ActuarialFrame(data) result = af.select( af["premium_str"].str.pad_start(8, " ").alias("padded_premium") ) print(result.collect()) ``` ```text shape: (3, 1) ┌────────────────┐ │ padded_premium │ │ --- │ │ str │ ╞════════════════╡ │ 1200.5 │ │ 85.75 │ │ null │ └────────────────┘ ``` **Vector Example: Pad rider codes stored as a list** ```python # Test with pl.Config to ensure consistent display import polars as pl from gaspatchio_core.frame.base import ActuarialFrame with pl.Config(fmt_str_lengths=100): data_list = { "policy_id": ["P01"], "rider_codes": [["RID1", "LONGRID", "R2"]] } af_list = ActuarialFrame(data_list).with_columns( pl.col("rider_codes").cast(pl.List(pl.String)) ) result = af_list.select( af_list["rider_codes"].str.pad_start(8, "0").alias("padded_rider_codes") ) print(result.collect()) ``` ```text shape: (1, 1) ┌──────────────────────────────────────────┐ │ padded_rider_codes │ │ --- │ │ list[str] │ ╞══════════════════════════════════════════╡ │ ["0000RID1", "0LONGRID", "000000R2"] │ └──────────────────────────────────────────┘ ``` ### `remove_prefix(prefix)` Alias for `strip_prefix`. Remove a prefix from each string. The prefix is removed from the beginning of every string. Strings without that prefix remain unchanged. `List[String]` columns are processed element by element. When to use - **Standardizing vendor codes** before mapping them to your base product dictionary. - **Cleaning temporary policy identifiers** created during data migrations. - **Dropping country prefixes** from location codes when you need only the state or province. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `prefix` | `str | Expr` | The substring to remove. May be a literal string or an expression resolving to one. | *required* | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | The expression with the prefix removed. | Examples: **Scalar example – clean temporary policy IDs** ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame data = { "policy_id_raw": ["TMP-001", "TMP-002", "003", None], "processing_prefix": ["TMP-", "TMP-", "TMP-", "TMP-"], } with pl.Config(set_tbl_width_chars=100): af_fixed = ActuarialFrame(data) fixed = af_fixed.select( af_fixed["policy_id_raw"].str.remove_prefix("TMP-").alias("policy_id") ).collect() print(fixed) af_dynamic = ActuarialFrame(data) dynamic = af_dynamic.select( af_dynamic["policy_id_raw"].str.remove_prefix( af_dynamic["processing_prefix"] ).alias("policy_id") ).collect() print() print("Dynamic prefix removal:") print(dynamic) ``` ```text shape: (4, 1) ┌───────────┐ │ policy_id │ │ --- │ │ str │ ╞═══════════╡ │ 001 │ │ 002 │ │ 003 │ │ null │ └───────────┘ Dynamic prefix removal: shape: (4, 1) ┌───────────┐ │ policy_id │ │ --- │ │ str │ ╞═══════════╡ │ 001 │ │ 002 │ │ 003 │ │ null │ └───────────┘ ``` **Vector example – remove `LEGACY-` from feature codes** ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame af_list = ActuarialFrame({ "policy_key": ["P1", "P2"], "feature_codes_raw": [ ["LEGACY-RIDER1", "BENEFIT_A"], [None, "LEGACY-OPTION_B"], ], }) af_list = af_list.with_columns( af_list["feature_codes_raw"].cast(pl.List(pl.String)) ) with pl.Config(set_tbl_width_chars=100, fmt_str_lengths=100): result = af_list.select( af_list["feature_codes_raw"].str.remove_prefix("LEGACY-").alias( "feature_codes" ) ).collect() print(result) ``` ```text shape: (2, 1) ┌─────────────────────────┐ │ feature_codes │ │ --- │ │ list[str] │ ╞═════════════════════════╡ │ ["RIDER1", "BENEFIT_A"] │ │ [null, "OPTION_B"] │ └─────────────────────────┘ ``` ### `remove_suffix(suffix)` Alias for `strip_suffix`. Remove a suffix from each string. This method behaves identically to meth:`strip_suffix`, removing the specified trailing substring from each string value. If a string does not end with the provided suffix it is returned unchanged. When the column is a list of strings, the removal is applied element-wise. When to use - **Normalizing Product Names:** Stripping version tags like "-2024" or "\_NEW" from product identifiers so that experience can be grouped by the base product. - **Cleaning Import Data:** Eliminating temporary indicators such as "-DRAFT" that may be appended to policy numbers imported from administration systems. - **Simplifying Text Fields:** Removing trailing notes like "\*cancelled" from agent remarks prior to text analytics or matching. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `suffix` | `str | Expr` | The suffix to remove. Can be a literal string or a Polars expression that evaluates to a string. | *required* | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | A new ExpressionProxy with the suffix removed. | Examples: **Scalar Example: Removing '-OLD' from policy codes** Scenario: Historical policy codes may include a trailing `-OLD` suffix that should be dropped for reporting. ```python from gaspatchio_core.frame.base import ActuarialFrame data = {"policy_code": ["TERM10-OLD", "WL-OLD", "ANN"]} af = ActuarialFrame(data) af_clean = af.select( af["policy_code"].str.remove_suffix("-OLD").alias("code_clean") ) print(af_clean.collect()) ``` ```text shape: (3, 1) ┌─────────────┐ │ code_clean │ │ --- │ │ str │ ╞═════════════╡ │ TERM10 │ │ WL │ │ ANN │ └─────────────┘ ``` **Vector (list) example: Removing trailing '\*exp' from lists of underwriting notes** ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl notes_data = { "policy_id": [1, 2], "uw_notes": [ ["Declined*exp", "Check later*exp"], ["Approved", None], ], } af_notes = ActuarialFrame(notes_data) af_notes = af_notes.with_columns( af_notes["uw_notes"].cast(pl.List(pl.String)) ) af_notes_clean = af_notes.select( af_notes["uw_notes"].str.remove_suffix("*exp").alias("notes_clean") ) print(af_notes_clean.collect()) ``` ```text shape: (2, 1) ┌────────────────────────────┐ │ notes_clean │ │ --- │ │ list[str] │ ╞════════════════════════════╡ │ ["Declined", "Check later"] │ │ ["Approved", null] │ └────────────────────────────┘ ``` ### `replace(pattern, value, literal=False, n=1)` Replace occurrences of a pattern in each string. This method searches every string in the column for a given substring or regular expression pattern and replaces the first `n` matches with the provided `value`. When `literal` is `True` the `pattern` is treated as a plain string; otherwise it is interpreted as a regex. When to use - **Updating Legacy Codes:** Converting outdated product or policy codes to a new standard so assumption tables align across systems. - **Cleaning Free-Text Fields:** Removing or altering specific phrases in underwriting or claim notes prior to text analysis. - **Normalizing Reference Data:** Adjusting naming conventions in data feeds before merging them with internal models. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `pattern` | `str | Expr` | Substring or regex pattern to search for. May also be a Polars expression yielding the pattern. | *required* | | `value` | `str | Expr` | Replacement text. Can be a string or a Polars expression. | *required* | | `literal` | `bool` | If True, pattern is treated as a literal string. | `False` | | `n` | `int` | Maximum number of replacements per string. Defaults to 1. | `1` | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | A new expression with the specified replacements | | | `'ExpressionProxy'` | applied. | Examples: **Scalar Example: Normalizing policy status descriptions** Scenario: Some policy statuses contain the phrase `"IN FORCE"`. Replace it with `"INFORCE"` for consistency. ```python from gaspatchio_core.frame.base import ActuarialFrame data = { "policy_id": ["P1", "P2", "P3"], "status_raw": ["IN FORCE", "LAPSED", "IN FORCE"], } af = ActuarialFrame(data) af_clean = af.select( af["status_raw"].str.replace("IN FORCE", "INFORCE", literal=True).alias("status") ) print(af_clean.collect()) ``` ```text shape: (3, 1) ┌─────────┐ │ status │ │ --- │ │ str │ ╞═════════╡ │ INFORCE │ │ LAPSED │ │ INFORCE │ └─────────┘ ``` **Vector Example: Removing 'NOTE: ' from lists of claim notes** Scenario: Each policy has a list of claim notes and some entries start with `"NOTE: "`. Remove this prefix from each note. ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl notes_data = { "policy_id": ["A1", "A2"], "claim_notes_str": [ "NOTE: Initial review,Payment authorised", "None,NOTE: Follow up required", ], } af_notes = ActuarialFrame(notes_data) af_notes = af_notes.with_columns( af_notes["claim_notes_str"].str.split(",").alias("claim_notes").map_elements( lambda x: [None if item == "None" else item for item in x], return_dtype=pl.List(pl.String) ) ) af_clean_notes = af_notes.select( af_notes["claim_notes"].str.replace("NOTE: ", "", literal=True, n=1).alias("clean_notes") ) result = af_clean_notes.collect() print(result) ``` ```text shape: (2, 1) ┌──────────────────────────────────────────┐ │ clean_notes │ │ --- │ │ list[str] │ ╞══════════════════════════════════════════╡ │ ["Initial review", "Payment authorised"] │ │ [null, "Follow up required"] │ └──────────────────────────────────────────┘ ``` ### `rjust(width, fill_char=' ')` Right-align strings by padding on the left. Strings shorter than `width` are padded on the left with `fill_char`. If the column is `List[String]` the padding is applied to each element of the list. When to use - Aligning premium or claim amounts before exporting to legacy ledger systems. - Presenting policy identifiers or rider codes in uniformly padded columns for regulatory or management reports. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `width` | `int` | The desired total length of the string after padding. | *required* | | `fill_char` | `str` | The character to pad with. Defaults to a space. | `' '` | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | An ExpressionProxy with strings padded at the start. | Examples: **Scalar example – formatting premium amounts** ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data = {"premium_str": ["123.45", "7", None]} af = ActuarialFrame(data) af_rjust = af.select( af["premium_str"].str.rjust(8).alias("rjust_premium") ) with pl.Config(fmt_str_lengths=100, tbl_width_chars=100): print(af_rjust.collect()) ``` ```text shape: (3, 1) ┌───────────────┐ │ rjust_premium │ │ --- │ │ str │ ╞═══════════════╡ │ 123.45 │ │ 7 │ │ null │ └───────────────┘ ``` **Vector example – aligning claim references** ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_list = { "batch_id": ["B100"], "claim_refs": [["C1", "C234", "C56789"]], } af_list = ActuarialFrame(data_list).with_columns( pl.col("claim_refs").cast(pl.List(pl.String)) ) result = af_list.select( af_list["claim_refs"].str.rjust(6, "0").alias("formatted_refs") ) with pl.Config(fmt_str_lengths=100, tbl_width_chars=100): print(result.collect()) ``` ```text shape: (1, 1) ┌────────────────────────────────┐ │ formatted_refs │ │ --- │ │ list[str] │ ╞════════════════════════════════╡ │ ["0000C1", "00C234", "C56789"] │ └────────────────────────────────┘ ``` ### `starts_with(prefix)` Check if strings in a column start with a given substring. This is useful for categorizing or flagging records based on prefixes in textual data. For example, identifying policies based on product code prefixes (e.g., "TERM-" for term life, "WL-" for whole life) or segmenting claims by a prefix in their claim ID (e.g., "AUTO-" for auto claims). When applied to a column of `List[String]`, such as a list of associated product features for a policy, the operation is performed element-wise on each string within each list, returning a list of booleans. When to use - Classify policies by prefix to drive product-specific assumptions. - Identify riders with a particular prefix (e.g., primary benefits) when stored in a list column. - Validate codes against expected prefixes coming from another column. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `prefix` | `str | Expr` | The substring to check for at the beginning of each string. Can be a literal string (e.g., "TERM-") or a Polars expression (e.g., pl.col("another_column_with_prefixes")). | *required* | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | A new ExpressionProxy containing a boolean Series indicating for each input string whether it starts with the prefix. If the input was List[String], the output will be List[bool]. | Examples: **Scalar example – policy prefixes** ```python from gaspatchio_core.frame.base import ActuarialFrame data_policies = { "policy_no": ["TERM-1001", "WL-2002", "TERM-1003", None, "UL-3004", "TERM-1004"], "issue_age": [25, 30, 28, 45, 35, 40] } af = ActuarialFrame(data_policies) # Check if policy_no starts with "TERM-" af_term_policies = af.select( af["policy_no"].str.starts_with("TERM-").alias("is_term_policy") ) print(af_term_policies.collect()) ``` ```text shape: (6, 1) ┌────────────────┐ │ is_term_policy │ │ --- │ │ bool │ ╞════════════════╡ │ true │ │ false │ │ true │ │ null │ │ false │ │ true │ └────────────────┘ ``` **Vector (list) example – rider prefixes** ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_policy_riders = { "policy_id": ["P201", "P202", "P203"], "rider_codes_list": [ ["B-ADB", "S-WP", "S-CI"], # B-AccidentalDeathBenefit, S-WaiverOfPremium, S-CriticalIllness ["S-LTC", None, "B-GIO"], # S-LongTermCare, B-GuaranteedInsurabilityOption ["S-WPR", "S-CIR"] ] } af_riders = ActuarialFrame(data_policy_riders).with_columns( pl.col("rider_codes_list").cast(pl.List(pl.String)) ) af_primary_benefit_check = af_riders.select( af_riders["rider_codes_list"].str.starts_with("B-").alias("has_primary_benefit_rider") ) print(af_primary_benefit_check.collect()) ``` ```text shape: (3, 1) ┌───────────────────────────┐ │ has_primary_benefit_rider │ │ --- │ │ list[bool] │ ╞═══════════════════════════╡ │ [true, false, false] │ │ [false, null, true] │ │ [false, false] │ └───────────────────────────┘ ``` ### `strip_chars(characters=None)` Removes specified leading and trailing characters from strings. This is useful for cleaning data, such as removing unwanted prefixes, suffixes, or whitespace from policy numbers, client names, or address fields. It mirrors Polars' `Expr.str.strip_chars`. If no characters are specified, it defaults to removing leading and trailing whitespace. For `List[String]` columns, like a list of addresses for a client, the operation is applied element-wise to each string in the list. When to use - **Cleanse Identifier Fields:** Remove extraneous characters (e.g., spaces, hyphens, special symbols) from policy numbers, claim IDs, or client identifiers to ensure consistency for matching and lookups. For example, "POL- 123\* " could become "POL-123" by stripping " \*". - **Standardize Textual Data:** Trim leading/trailing whitespace from free-text fields like occupation descriptions, addresses, or underwriter notes before analysis or storage. - **Prepare Data for Joins:** Ensure that join keys consisting of string data are clean and consistently formatted to avoid join failures due to subtle differences like trailing spaces. - **Sanitize User Input:** Clean user-provided search terms or filter values by removing unwanted characters before using them in queries. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `characters` | `str | Expr` | A string of characters to remove from both ends of each string. Can also be a Polars expression that evaluates to a string of characters. If None (default), removes whitespace (spaces, tabs, newlines, etc.). | `None` | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | A new ExpressionProxy with the specified characters stripped from the strings. | Examples: **Scalar Example 1: Cleaning policy numbers by removing specific prefixes/suffixes and whitespace** Policy numbers might be recorded with inconsistent characters (e.g., "ID-", "\*", spaces). We want to standardize them by removing these specific characters and any surrounding whitespace. ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_policy_nos = { "raw_policy_id": [ "ID-A123-XYZ*", " B456 ", "ID-C789*", "D012-XYZ", None, " ID-E345* ", ], "chars_to_remove_col": ["ID-*XYZ ", " ", "ID-*", "-XYZ", None, " *ID-"] } af = ActuarialFrame(data_policy_nos) # Example 1a: Remove a fixed set of characters "ID-*XYZ " from policy IDs af_cleaned_fixed = af.select( af["raw_policy_id"].str.strip_chars("ID-*XYZ ").alias("cleaned_fixed_chars") ) print("Cleaned with fixed characters 'ID-*XYZ ':") print(af_cleaned_fixed.collect()) # Example 1b: Remove characters specified in another column # This dynamically strips characters based on the 'chars_to_remove_col' for each row. af_cleaned_dynamic = af.select( af["raw_policy_id"].str.strip_chars(pl.col("chars_to_remove_col")).alias("cleaned_dynamic_chars") ) print("\nCleaned with characters from 'chars_to_remove_col':") print(af_cleaned_dynamic.collect()) # Example 1c: Remove only leading and trailing whitespace af_trimmed_whitespace = af.select( af["raw_policy_id"].str.strip_chars().alias("trimmed_whitespace_only") # characters=None ) print("\nCleaned with default whitespace stripping:") print(af_trimmed_whitespace.collect()) ``` ```text Cleaned with fixed characters 'ID-*XYZ ': shape: (6, 1) ┌─────────────────────┐ │ cleaned_fixed_chars │ │ --- │ │ str │ ╞═════════════════════╡ │ A123 │ │ B456 │ │ C789 │ │ D012 │ │ null │ │ E345 │ └─────────────────────┘ Cleaned with characters from 'chars_to_remove_col': shape: (6, 1) ┌───────────────────────┐ │ cleaned_dynamic_chars │ │ --- │ │ str │ ╞═══════════════════════╡ │ A123 │ │ B456 │ │ C789 │ │ D012 │ │ null │ │ E345 │ └───────────────────────┘ Cleaned with default whitespace stripping: shape: (6, 1) ┌───────────────────────────┐ │ trimmed_whitespace_only │ │ --- │ │ str │ ╞═══════════════════════════╡ │ ID-A123-XYZ* │ │ B456 │ │ ID-C789* │ │ D012-XYZ │ │ null │ │ ID-E345* │ └───────────────────────────┘ ``` **Vector (List Shimming) Example: Cleaning lists of product add-on codes** Product codes for add-ons might be stored in a list, with potential unwanted characters like asterisks, hyphens, or spaces. ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_addons = { "policy_id": ["P1001", "P1002"], "addon_codes_raw": [ ["*RIDER_A- ", " -RIDER_B*", "BASE_PLAN"], [None, " *-RIDER_C- ", "\tRIDER_D\t*"] ] } af_addons = ActuarialFrame(data_addons).with_columns( pl.col("addon_codes_raw").cast(pl.List(pl.String)) ) # Strip asterisks, hyphens, spaces, and tabs from each code in the lists af_cleaned_addons = af_addons.select( af_addons["addon_codes_raw"].str.strip_chars(" *-#\t").alias("cleaned_addon_codes") # Added '#' to demonstrate it's ignored if not present ) print(af_cleaned_addons.collect()) ``` ```text shape: (2, 1) ┌───────────────────────────────────┐ │ cleaned_addon_codes │ │ --- │ │ list[str] │ ╞═══════════════════════════════════╡ │ ["RIDER_A", "RIDER_B", "BASE_PLA… │ │ [null, "RIDER_C", "RIDER_D"] │ └───────────────────────────────────┘ ``` ### `strip_chars_start(characters=None)` Removes specified leading characters from strings. Useful for standardizing data by removing known prefixes or initial whitespace. For instance, cleaning policy numbers by removing a "TEMP-" prefix or trimming spaces from the beginning of address lines. It mirrors Polars' `Expr.str.strip_chars_start`. If no characters are specified, it defaults to removing leading whitespace. When applied to `List[String]` columns (e.g., a list of historical status codes for a policy), the operation is performed element-wise. When to use - **Normalizing Prefixed Identifiers:** Removing consistent prefixes from identifiers like policy numbers (e.g., "PN-", "TEMP\_"), claim codes (e.g., "CL-"), or agent codes to get the core identifier. - **Cleaning Leading Characters in Text Fields:** Removing leading non-essential characters (e.g., bullets, numbers, special symbols, spaces) from free-text fields like notes, descriptions, or imported data before further processing. - **Standardizing Data from Multiple Sources:** If different source systems prefix the same data differently, this function can help unify them by removing those specific leading characters. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `characters` | `str | Expr` | A string of characters to remove from the start of each string. Can also be a Polars expression that evaluates to a string of characters. If None (default), removes leading whitespace (spaces, tabs, newlines, etc.). | `None` | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | A new ExpressionProxy with specified leading characters stripped from the strings. | Examples: **Scalar Example: Removing prefixes from legacy system IDs and leading whitespace** Legacy system IDs might have prefixes like "LEG\_", "OLD-", or be padded with spaces. ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_ids = { "legacy_id": [ "LEG_POL123", " OLD-CLM456", "POL789", None, "LEG_ UW001", # Note the space after LEG_ " TRN999" ], "prefixes_to_strip": ["LEG_", "OLD-", "NONEXISTENT_", None, "LEG_ ", " "] } af = ActuarialFrame(data_ids) # Example 1a: Remove a fixed prefix "LEG_" af_no_leg_prefix = af.select( af["legacy_id"].str.strip_chars_start("LEG_").alias("id_no_leg_prefix") ) print("Stripping fixed prefix 'LEG_':") print(af_no_leg_prefix.collect()) # Example 1b: Remove leading whitespace only (characters=None) af_trimmed_space = af.select( af["legacy_id"].str.strip_chars_start().alias("id_trimmed_leading_space") ) print("\nStripping leading whitespace only:") print(af_trimmed_space.collect()) # Example 1c: Remove prefixes defined in another column # This will strip any character found in the corresponding 'prefixes_to_strip' string from the start. af_dynamic_prefix = af.select( af["legacy_id"].str.strip_chars_start(pl.col("prefixes_to_strip")).alias("id_dynamic_prefix_removed") ) print("\nStripping prefixes from 'prefixes_to_strip' column (character-wise from start):") print(af_dynamic_prefix.collect()) ``` ```text Stripping fixed prefix 'LEG_': shape: (6, 1) ┌────────────────────┐ │ id_no_leg_prefix │ │ --- │ │ str │ ╞════════════════════╡ │ POL123 │ │ OLD-CLM456 │ │ POL789 │ │ null │ │ UW001 │ │ TRN999 │ └────────────────────┘ Stripping leading whitespace only: shape: (6, 1) ┌───────────────────────────┐ │ id_trimmed_leading_space │ │ --- │ │ str │ ╞═══════════════════════════╡ │ LEG_POL123 │ │ OLD-CLM456 │ │ POL789 │ │ null │ │ LEG_ UW001 │ │ TRN999 │ └───────────────────────────┘ Stripping prefixes from 'prefixes_to_strip' column (character-wise from start): shape: (6, 1) ┌─────────────────────────────┐ │ id_dynamic_prefix_removed │ │ --- │ │ str │ ╞═════════════════════════════╡ │ POL123 │ │ CLM456 │ │ POL789 │ │ null │ │ UW001 │ │ TRN999 │ └─────────────────────────────┘ ``` **Vector (List Shimming) Example: Cleaning lists of temporary transaction remarks** Transaction remarks might be stored in lists, with some prefixed by "TEMP: " or spaces. ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_remarks = { "policy_id": ["TRN01", "TRN02"], "transaction_remarks_raw": [ ["TEMP: Initial assessment", " Adjustment processed", "Final Review"], [None, "TEMP: Hold for now", "TEMP: Resolved", "Status: OK"] ] } af_remarks = ActuarialFrame(data_remarks).with_columns( pl.col("transaction_remarks_raw").cast(pl.List(pl.String)) ) # Example 2a: Strip fixed prefix "TEMP: " from each remark in the lists af_cleaned_remarks_prefix = af_remarks.select( af_remarks["transaction_remarks_raw"].str.strip_chars_start("TEMP: ").alias("cleaned_remarks_prefix") ) print("Cleaned remarks (prefix 'TEMP: '):") print(af_cleaned_remarks_prefix.collect()) # Example 2b: Strip leading whitespace from list elements af_cleaned_remarks_space = af_remarks.select( af_remarks["transaction_remarks_raw"].str.strip_chars_start().alias("cleaned_remarks_space") ) print("\nCleaned remarks (leading whitespace):") print(af_cleaned_remarks_space.collect()) ``` ```text Cleaned remarks (prefix 'TEMP: '): shape: (2, 1) ┌────────────────────────────────────────────────────────────────────────────┐ │ cleaned_remarks_prefix │ │ --- │ │ list[str] │ ╞════════════════════════════════════════════════════════════════════════════╡ │ ["Initial assessment", " Adjustment processed", "Final Review"] │ │ [null, "Hold for now", "Resolved", "Status: OK"] │ └────────────────────────────────────────────────────────────────────────────┘ Cleaned remarks (leading whitespace): shape: (2, 1) ┌────────────────────────────────────────────────────────────────────────────┐ │ cleaned_remarks_space │ │ --- │ │ list[str] │ ╞════════════════════════════════════════════════════════════════════════════╡ │ ["TEMP: Initial assessment", "Adjustment processed", "Final Review"] │ │ [null, "TEMP: Hold for now", "TEMP: Resolved", "Status: OK"] │ └────────────────────────────────────────────────────────────────────────────┘ ``` ### `strip_prefix(prefix)` Remove a prefix from each string. The prefix is stripped whenever it occurs at the start of the string. Strings without the prefix are returned unchanged. On columns containing lists of strings, the removal happens element by element. When to use - Cleaning temporary identifiers such as `TEMP-123` once a policy is fully underwritten. - Harmonizing product codes from different administration systems before mapping them to an actuarial model. - Stripping `LEGACY-` markers from lists of rider codes imported from historical sources. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `prefix` | `str | Expr` | Prefix to remove. May be a literal string or an expression that evaluates to a string. | *required* | Returns: | Type | Description | | --- | --- | | `'ExpressionProxy'` | ExpressionProxy with the prefix removed. | Examples: **Scalar example – cleaning policy IDs** ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame with pl.Config(set_tbl_width_chars=100): af = ActuarialFrame({"pol_id_raw": ["TEMP-001", "TEMP-002", "003", None]}) cleaned = af.select( af["pol_id_raw"].str.strip_prefix("TEMP-").alias("pol_id") ).collect() print(cleaned) ``` ```text shape: (4, 1) ┌────────┐ │ pol_id │ │ --- │ │ str │ ╞════════╡ │ 001 │ │ 002 │ │ 003 │ │ null │ └────────┘ ``` **Vector example – removing `LEGACY-` from feature codes** ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame af = ActuarialFrame({ "policy_key": ["POLICY_A", "POLICY_B"], "feature_codes_raw": [ ["LEGACY-RIDER1", "NEW_FEATURE_X", "LEGACY-BENEFIT2"], [None, "LEGACY-COVERAGE_Y", "STANDARD_Z"], ], }) af = af.with_columns( af["feature_codes_raw"].cast(pl.List(pl.String)) ) with pl.Config(set_tbl_width_chars=120, fmt_str_lengths=100): cleaned = af.select( af["feature_codes_raw"].str.strip_prefix("LEGACY-").alias("cleaned_feature_codes") ).collect() print(cleaned) ``` ```text shape: (2, 1) ┌─────────────────────────────────────────┐ │ cleaned_feature_codes │ │ --- │ │ list[str] │ ╞═════════════════════════════════════════╡ │ ["RIDER1", "NEW_FEATURE_X", "BENEFIT2"] │ │ [null, "COVERAGE_Y", "STANDARD_Z"] │ └─────────────────────────────────────────┘ ``` ### `strip_suffix(suffix)` Remove a suffix from each string. If a string does not end with the given suffix, it is returned unchanged. For `List[String]` columns, the operation is applied element-wise. When to use - **Normalizing coverage names** that include trailing version codes such as "-OLD". - **Preparing ledger accounts** by removing year suffixes like "-2024" before comparing periods. - **Cleaning temporary identifiers** imported from external systems (for example, removing a trailing "-TMP"). Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `suffix` | `str | Expr` | The suffix to remove. Either a string literal or an expression resolving to a string. | *required* | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | The expression with the suffix removed. | Examples: **Scalar example – normalize plan names** ```python from gaspatchio_core.frame.base import ActuarialFrame data = { "plan_name_raw": ["Term Basic-OLD", "Income Protection-OLD", "Annuity Plus", None] } af = ActuarialFrame(data) result = af.select( af["plan_name_raw"].str.strip_suffix("-OLD").alias("plan_name") ) print(result.collect()) ``` ```text shape: (4, 1) ┌───────────────────────┐ │ plan_name │ │ --- │ │ str │ ╞═══════════════════════╡ │ Term Basic │ │ Income Protection │ │ Annuity Plus │ │ null │ └───────────────────────┘ ``` **Vector (list) example – clean trailing punctuation in claim notes** ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame notes_data = { "claim_id": ["C1", "C2"], "notes": [["Approved.", "Paid."], [None, "In Review."]], } af_list = ActuarialFrame(notes_data) af_list = af_list.with_columns( af_list["notes"].cast(pl.List(pl.String)) ) cleaned = af_list.select( af_list["notes"].str.strip_suffix(".").alias("notes_cleaned") ) print(cleaned.collect()) ``` ```text shape: (2, 1) ┌────────────────────────┐ │ notes_cleaned │ │ --- │ │ list[str] │ ╞════════════════════════╡ │ ["Approved", "Paid"] │ │ [null, "In Review"] │ └────────────────────────┘ ``` ### `strptime(dtype, format=None, *, strict=True, exact=True, cache=True, ambiguous='raise', **kwargs)` Convert string values to Date, Datetime, or Time. This method parses textual date or time information into Polars temporal types. For `List[String]` columns, each element is parsed individually. When to use - Convert policy issue or claim reporting dates that are stored as strings in raw data extracts. - Parse lists of event timestamps—such as claim status updates—when building experience studies or exposure models. - Ingest external datasets from underwriting or administration systems where date fields come in a variety of text formats. Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `dtype` | `'PolarsTemporalType'` | The Polars temporal type to convert to (pl.Date, pl.Datetime, or pl.Time). | *required* | | `format` | `Optional[str]` | The strf/strptime format string. If None, the format is inferred where possible. | `None` | | `strict` | `bool` | If True (default), raise an error on parsing failure. | `True` | | `exact` | `bool` | If True (default), require an exact format match. | `True` | | `cache` | `bool` | If True (default), cache parsing results for performance. | `True` | | `ambiguous` | `str | Expr` | How to handle ambiguous datetimes, such as daylight-saving transitions. Options are "raise" (default), "earliest", "latest", or "null". Can also be a Polars expression. | `'raise'` | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | Strings converted to the specified temporal type. | Examples: **Scalar Example: Parsing policy issue dates** ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data = { "policy_id": ["A100", "B200", "C300"], "issue_date_str": [ "2021-01-15", "20/02/2022", "2023-03-10 14:30:00" ] } af = ActuarialFrame(data) af_parsed_dates = af.select( af["issue_date_str"].str.strptime(pl.Date, "%Y-%m-%d", strict=False).alias("issue_date_strict_fmt"), af["issue_date_str"].str.strptime(pl.Date, "%d/%m/%Y", strict=False).alias("issue_date_dmy_fmt"), af["issue_date_str"].str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S", strict=False).alias("issue_datetime"), ) result = af_parsed_dates.collect() print(result) ``` ```text shape: (3, 3) ┌───────────────────────┬────────────────────┬─────────────────────┐ │ issue_date_strict_fmt ┆ issue_date_dmy_fmt ┆ issue_datetime │ │ --- ┆ --- ┆ --- │ │ date ┆ date ┆ datetime[μs] │ ╞═══════════════════════╪════════════════════╪═════════════════════╡ │ 2021-01-15 ┆ null ┆ null │ │ null ┆ 2022-02-20 ┆ null │ │ null ┆ null ┆ 2023-03-10 14:30:00 │ └───────────────────────┴────────────────────┴─────────────────────┘ ``` **Vector Example: Parsing lists of event timestamps** ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_list = { "claim_id": ["CL001"], "event_timestamps_str": [["2023-04-01T10:00:00", "2023-04-01T10:05:00", "Invalid"]], } af_list = ActuarialFrame(data_list).with_columns( pl.col("event_timestamps_str").cast(pl.List(pl.String)) ) af_parsed_list = af_list.select( af_list["event_timestamps_str"].str.strptime( pl.Datetime, "%Y-%m-%dT%H:%M:%S", strict=False ).alias("event_datetimes_μs") ) result = af_parsed_list.collect() print(result) ``` ```text shape: (1, 1) ┌──────────────────────────────────────────────────┐ │ event_datetimes_μs │ │ --- │ │ list[datetime[μs]] │ ╞══════════════════════════════════════════════════╡ │ [2023-04-01 10:00:00, 2023-04-01 10:05:00, null] │ └──────────────────────────────────────────────────┘ ``` ### `to_lowercase()` Converts all characters in string columns to lowercase. This function standardizes textual data by converting all characters in a string column to lowercase. This is essential for ensuring consistency in data fields critical for actuarial analysis, such as system codes, free-text fields like occupation or medical conditions, or external data sources, facilitating accurate matching, aggregation, and text analysis. When to use - **Normalizing Text for Analysis:** Preparing free-text fields (e.g., underwriting notes, claim descriptions, occupation details) for text mining or NLP by ensuring terms like "SMOKER", "Smoker", and "smoker" are treated identically. - **Improving Data Matching with External Sources:** When integrating data from various systems or third-party providers where case consistency is not guaranteed (e.g., matching addresses, names, or city information). - **Standardizing User Input:** Converting user-entered data (e.g., search terms, filter criteria) to a consistent case before processing or querying. Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | An ExpressionProxy with strings converted to lowercase. | Examples: **Scalar Example: Normalizing occupation descriptions for risk analysis** Occupation descriptions might be entered in various casings. Converting to lowercase helps in standardizing them for consistent risk factor analysis or grouping. ```python from gaspatchio_core.frame.base import ActuarialFrame data = { "policy_id": ["POL001", "POL002", "POL003", "POL004"], "occupation_raw": ["Engineer", "software DEVELOPER", "Teacher", "Project Manager"] } af = ActuarialFrame(data) af_lower_occupation = af.select( af["occupation_raw"].str.to_lowercase().alias("occupation_normalized") ) print(af_lower_occupation.collect()) ``` ```text shape: (4, 1) ┌───────────────────────┐ │ occupation_normalized │ │ --- │ │ str │ ╞═══════════════════════╡ │ engineer │ │ software developer │ │ teacher │ │ project manager │ └───────────────────────┘ ``` **Vector Example: Lowercasing medical condition codes from multiple sources** Medical condition codes might come from different systems with varying casing. Lowercasing them ensures they can be consistently mapped or analyzed. ```python from gaspatchio_core.frame.base import ActuarialFrame import polars as pl data_medical_codes = { "claim_id": ["C001", "C002"], "condition_codes_list": [ ["DIAB_T2", "HBP", "ASTHMA"], # DIAB_T2 = Type 2 Diabetes, HBP = High Blood Pressure ["hbp", None, "copd"] # COPD = Chronic Obstructive Pulmonary Disease ] } af_codes = ActuarialFrame(data_medical_codes) # Ensure the list column has the correct Polars type for the string operation af_codes = af_codes.with_columns( af_codes["condition_codes_list"].cast(pl.List(pl.String)) ) af_lower_codes = af_codes.select( af_codes["condition_codes_list"].str.to_lowercase().alias("lower_condition_codes") ) print(af_lower_codes.collect()) ``` ```text shape: (2, 1) ┌─────────────────────────────────────┐ │ lower_condition_codes │ │ --- │ │ list[str] │ ╞═════════════════════════════════════╡ │ ["diab_t2", "hbp", "asthma"] │ │ ["hbp", null, "copd"] │ └─────────────────────────────────────┘ ``` ### `to_uppercase()` Converts all characters in string columns to uppercase. This function standardizes textual data by converting all characters in a string column to uppercase. This is essential for ensuring consistency in data fields critical for actuarial analysis, such as policy status codes, product identifiers, or geographical regions, facilitating accurate matching, aggregation, and reporting. When to use - **Standardizing Categorical Data:** Ensuring that codes like policy status (e.g., "active", "Lapsed", "ACTIVE" all become "ACTIVE"), gender codes (e.g., "m", "F" become "M", "F"), or smoker status (e.g. "non-smoker", "Smoker" become "NON-SMOKER", "SMOKER") are consistent for grouping and analysis. - **Improving Data Matching:** Facilitating joins and lookups between different datasets where case sensitivity might cause mismatches (e.g., matching policyholder names or addresses from different sources). - **Enhancing Readability and Reporting:** Presenting data in a uniform case for reports and dashboards, especially for identifiers or codes. - **Preparing Text for Analysis:** As a preprocessing step before text mining or natural language processing tasks on fields like claim descriptions or underwriter notes, where case normalization can simplify pattern recognition. - **Simplifying Rule-Based Logic:** When applying business rules that depend on string comparisons (e.g., identifying policies with specific rider codes like "ADB" or "WP" irrespective of their original casing). Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | A new ExpressionProxy with strings converted to uppercase. | Examples: **Scalar Example: Standardizing policy status codes** Policy status might be entered in various cases ("active", "lapsed", "ACTIVE"). Converting to uppercase ensures consistency for analysis. ```python from gaspatchio_core.frame.base import ActuarialFrame data = { "policy_id": ["S3001", "S3002", "S3003", "S3004"], "status_raw": ["active", "lapsed", "Active", "PENDING"] } af = ActuarialFrame(data) af_upper_status = af.select( af["status_raw"].str.to_uppercase().alias("status_standardized") ) print(af_upper_status.collect()) ``` ```text shape: (4, 1) ┌─────────────────────┐ │ status_standardized │ │ --- │ │ str │ ╞═════════════════════╡ │ ACTIVE │ │ LAPSED │ │ ACTIVE │ │ PENDING │ └─────────────────────┘ ``` **Vector Example: Uppercasing rider codes for a policy** A policy might have multiple rider codes stored in a list. To ensure uniformity, we can convert all rider codes to uppercase. ```python from gaspatchio_core.frame.base import ActuarialFrame data_policy_riders = { "policy_id": ["R4001", "R4002", "R4003"], "rider_codes_str": [ "adb,wp", "ci,ltc,acc_death", "gio" ] } af_riders = ActuarialFrame(data_policy_riders) # Convert string to list for the string operation af_riders = af_riders.with_columns( af_riders["rider_codes_str"].str.split(",").alias("rider_codes_list") ) af_upper_riders = af_riders.select( af_riders["rider_codes_list"].str.to_uppercase().alias("upper_rider_codes") ) print(af_upper_riders.collect()) ``` ```text shape: (3, 1) ┌────────────────────────────┐ │ upper_rider_codes │ │ --- │ │ list[str] │ ╞════════════════════════════╡ │ ["ADB", "WP"] │ │ ["CI", "LTC", "ACC_DEATH"] │ │ ["GIO"] │ └────────────────────────────┘ ``` ### `zfill(length)` Pad strings with leading zeros to a minimum width. Shorter values are padded on the left with zeros so each entry reaches `length` characters. For list columns, the padding occurs element-wise. When to use - Standardizing policy numbers from different administration systems before merging with valuation data - Preparing zero-padded claim numbers for extracts sent to reinsurers or regulators - Building fixed-width keys when joining to rating tables or mapping grids Parameters: | Name | Type | Description | Default | | --- | --- | --- | --- | | `length` | `int` | The desired minimum length of the string. | *required* | Returns: | Name | Type | Description | | --- | --- | --- | | `ExpressionProxy` | `'ExpressionProxy'` | Strings padded with leading zeros. | ##### Examples Scalar example – Standardizing policy serial numbers:: ````text ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame with pl.Config(fmt_str_lengths=100): data = {"policy_serial": ["123", "45", "6789", None, "1"]} af = ActuarialFrame(data) result = af.select( af["policy_serial"].str.zfill(5).alias("zfilled_serial") ) print(result.collect()) ```` ```text shape: (5, 1) ┌────────────────┐ │ zfilled_serial │ │ --- │ │ str │ ╞════════════════╡ │ 00123 │ │ 00045 │ │ 06789 │ │ null │ │ 00001 │ └────────────────┘ ``` ```` Vector example – Padding numerical components in claim codes:: ```text ```python import polars as pl from gaspatchio_core.frame.base import ActuarialFrame with pl.Config(fmt_str_lengths=100): data = { "claim_batch": ["B01", "B02"], "item_codes": [["A1", "B123", "C04"], [None, "D56"]], } af = ActuarialFrame(data) af = af.with_columns( af["item_codes"].cast(pl.List(pl.String)) ) result = af.select( af["item_codes"].str.zfill(4).alias("zfilled_item_codes") ) print(result.collect()) ```` ```text shape: (2, 1) ┌──────────────────────────┐ │ zfilled_item_codes │ │ --- │ │ list[str] │ ╞══════════════════════════╡ │ ["00A1", "B123", "0C04"] │ │ [null, "0D56"] │ └──────────────────────────┘ ``` ``` ```