Strings
gaspatchio_core.column.namespaces.string_proxy.StringNamespaceProxy
¶
A proxy for Polars expression string (str) namespace operations.
This proxy is typically accessed via the .str
attribute of a ColumnProxy
or ExpressionProxy
that refers to a string or list-of-strings column
within an ActuarialFrame
. It allows for intuitive, Polars-like string
manipulations while remaining integrated with the ActuarialFrame ecosystem.
It automatically handles shimming for List[String]
columns, applying
string methods element-wise to the contents of the lists.
Examples:
Scalar Example: Uppercasing policyholder names
This demonstrates applying a string operation to a scalar string column. We'll convert policyholder names to uppercase.
from gaspatchio_core.frame.base import ActuarialFrame
data_for_class_doctest = { # Renamed to avoid conflict with other examples
"policy_holder_name": ["John Doe", "Jane Smith", "Robert Jones"],
"policy_type_codes": [["TERM", "WL"], ["UL"], ["TERM", "CI"]]
}
af_scalar = ActuarialFrame(data_for_class_doctest)
af_upper_names = af_scalar.select(
af_scalar["policy_holder_name"].str.to_uppercase().alias("upper_name")
)
print(af_upper_names.collect())
shape: (3, 1)
┌──────────────┐
│ upper_name │
│ --- │
│ str │
╞══════════════╡
│ JOHN DOE │
│ JANE SMITH │
│ ROBERT JONES │
└──────────────┘
Vector (List Shimming) Example: Lowercasing policy type codes
This demonstrates applying a string operation to a list-of-strings column. We'll convert lists of policy type codes to lowercase.
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_for_class_doctest = {
"policy_holder_name": ["John Doe", "Jane Smith", "Robert Jones"],
"policy_type_codes": [["TERM", "WL"], ["UL"], ["TERM", "CI"]]
}
af_vector = ActuarialFrame(data_for_class_doctest).with_columns(
pl.col("policy_type_codes").cast(pl.List(pl.String))
)
af_lower_codes = af_vector.select(
af_vector["policy_type_codes"].str.to_lowercase().alias("lower_codes")
)
print(af_lower_codes.collect())
shape: (3, 1)
┌────────────────┐
│ lower_codes │
│ --- │
│ list[str] │
╞════════════════╡
│ ["term", "wl"] │
│ ["ul"] │
│ ["term", "ci"] │
└────────────────┘
__getattr__(name)
¶
Dynamically handle calls to Polars string methods not explicitly defined.
This allows the proxy to support any method available on Polars' str namespace without needing to define each one explicitly on this proxy class.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
name
|
str
|
The name of the string method to call. |
required |
Returns:
Type | Description |
---|---|
Callable[..., 'ExpressionProxy']
|
A callable that, when invoked, will execute the corresponding Polars |
Callable[..., 'ExpressionProxy']
|
string method via |
Raises:
Type | Description |
---|---|
AttributeError
|
If the method does not exist on the Polars string namespace
(this is typically raised by |
__init__(parent_proxy, parent_af)
¶
Initialize the StringNamespaceProxy.
This constructor is not typically called directly by users. Instances are
created by the dispatch mechanism when accessing .str
on a ColumnProxy
or ExpressionProxy.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
parent_proxy
|
'ProxyType'
|
The parent ColumnProxy or ExpressionProxy from which
|
required |
parent_af
|
Optional['ActuarialFrame']
|
The parent ActuarialFrame, providing context such as the underlying DataFrame/LazyFrame and schema. |
required |
contains(pattern, literal=False, strict=False)
¶
Checks if strings in a column contain a specified pattern.
This method searches for a pattern within string values, returning a boolean indicating if the pattern exists in each string. It's useful for filtering, data categorization, and identifying records with specific text patterns.
When to use
- Identify policies with specific riders or endorsements from description fields
- Find claims that mention particular medical conditions or causes
- Filter customer feedback containing specific keywords for risk analysis
- Segment policyholders based on address information (e.g., rural vs urban)
- Flag policies or claims with special handling notes (e.g., "legal review")
- Screen underwriting notes for high-risk indicators
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pattern
|
str | Expr
|
The substring or regex pattern to search for.
Can be a literal string (e.g., "RiderX") or a Polars expression
(e.g., |
required |
literal
|
bool
|
If True, |
False
|
strict
|
bool
|
If True and |
False
|
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
A new |
Examples:
Scalar Example: Identifying policies with an Accidental Death Benefit (ADB) rider
Imagine you have a dataset of policy descriptions and you want to flag all policies that include an "ADB" rider.
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"policy_id": ["POL001", "POL002", "POL003", "POL004"],
"description": [
"Term Life Plan with ADB rider",
"Whole Life - Standard",
"Universal Life, includes ADB rider and Accidental Death Benefit (ADB)",
"Term Life, no Accidental Death Benefit rider"
]
}
af = ActuarialFrame(data)
af_with_adb_rider = af.select(
af["description"].str.contains("ADB rider", literal=True).alias("has_adb_rider")
)
print(af_with_adb_rider.collect())
shape: (4, 1)
┌───────────────┐
│ has_adb_rider │
│ --- │
│ bool │
╞═══════════════╡
│ true │
│ false │
│ true │
│ false │
└───────────────┘
Vector Example: Checking underwriter notes for high-risk keywords
Suppose each policy has a list of notes from underwriters. We want to check if any note for a given policy contains keywords like "medical history" or "hazardous occupation", which might indicate higher risk.
from gaspatchio_core.frame.base import ActuarialFrame
uw_notes_data = {
"policy_id": ["UW001", "UW002", "UW003"],
"underwriter_notes": [
"Standard risk. Family history clear.",
"Applicant works in construction. Reviewed medical history: smoker.",
"No concerning notes. Possible hazardous occupation mentioned."
]
}
af_notes = ActuarialFrame(uw_notes_data)
af_results = af_notes.select(
af_notes["underwriter_notes"].str.contains("medical history").alias("mentions_medical_history"),
af_notes["underwriter_notes"].str.contains("(?i)hazardous occupation").alias("mentions_hazardous_occupation"),
)
print(af_results.collect())
shape: (3, 2)
┌──────────────────────────┬───────────────────────────────┐
│ mentions_medical_history ┆ mentions_hazardous_occupation │
│ --- ┆ --- │
│ bool ┆ bool │
╞══════════════════════════╪═══════════════════════════════╡
│ false ┆ false │
│ true ┆ false │
│ false ┆ true │
└──────────────────────────┴───────────────────────────────┘
Using contains
with a list of patterns (regex and literal)
Suppose we want to check for multiple keywords in underwriter notes using both literal and regex matching.
from gaspatchio_core.frame.base import ActuarialFrame
uw_notes_data_multi = { # Renamed to avoid conflict
"policy_id": ["UW001", "UW002", "UW003"],
"underwriter_notes": [
"Standard risk. Family history clear.",
"Applicant works in construction. Reviewed medical history: smoker.",
"No concerning notes. Possible hazardous occupation mentioned."
]
}
af_multi = ActuarialFrame(uw_notes_data_multi)
af_multi_processed = af_multi.select(
# Literal check
af_multi["underwriter_notes"].str.contains("medical history", literal=True).alias("mentions_medical_history_literal"),
# Regex check (case insensitive)
af_multi["underwriter_notes"].str.contains(r"(?i)hazardous occupation").alias("mentions_hazardous_occupation_regex"),
# Another Regex check (case insensitive) for medical history
af_multi["underwriter_notes"].str.contains(r"(?i)medical history").alias("mentions_medical_history_regex")
)
print(af_multi_processed.collect())
shape: (3, 3)
┌──────────────────────────────────┬─────────────────────────────────────┬────────────────────────────────┐
│ mentions_medical_history_literal ┆ mentions_hazardous_occupation_regex ┆ mentions_medical_history_regex │
│ --- ┆ --- ┆ --- │
│ bool ┆ bool ┆ bool │
╞══════════════════════════════════╪═════════════════════════════════════╪════════════════════════════════╡
│ false ┆ false ┆ false │
│ true ┆ false ┆ true │
│ false ┆ true ┆ false │
└──────────────────────────────────┴─────────────────────────────────────┴────────────────────────────────┘
ends_with(suffix)
¶
Check if strings end with a specific substring.
This method returns a boolean expression showing whether each string
value ends with the provided suffix. For columns containing
List[String]
, the check is applied to every element within each list.
When to use
- Verify that policy identifiers end with region or product codes.
- Flag claim or log entries that end with status markers like "OK" or "PENDING".
- Validate strings against suffixes supplied in another column, such as checking payout account numbers.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
suffix
|
str | Expr
|
The substring to test for at the end of each string. It can be a literal value or a Polars expression. |
required |
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
A boolean result indicating whether each string |
'ExpressionProxy'
|
ends with |
|
'ExpressionProxy'
|
booleans. |
Examples:
Scalar example – region codes
from gaspatchio_core.frame.base import ActuarialFrame
af = ActuarialFrame({
"policy_id": ["P100-US", "P101-CA", "P102-US", None, "P103-EU"]
})
result = af.select(
af["policy_id"].str.ends_with("-US").alias("is_us_policy")
)
print(result.collect())
shape: (5, 1)
┌──────────────┐
│ is_us_policy │
│ --- │
│ bool │
╞══════════════╡
│ true │
│ false │
│ true │
│ null │
│ false │
└──────────────┘
Vector (list) example – status flags
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
logs = {
"policy_id": ["A100", "A101"],
"update_notes_str": [
"Issued OK,Review PENDING",
"None,Paid OK",
],
}
af_logs = ActuarialFrame(logs)
af_logs = af_logs.with_columns(
af_logs["update_notes_str"].str.split(",").alias("update_notes").map_elements(
lambda x: [None if item == "None" else item for item in x], return_dtype=pl.List(pl.String)
)
)
status_ok = af_logs.select(
af_logs["update_notes"].str.ends_with("OK").alias("ends_with_ok")
)
print(status_ok.collect())
shape: (2, 1)
┌───────────────┐
│ ends_with_ok │
│ --- │
│ list[bool] │
╞═══════════════╡
│ [true, false] │
│ [null, true] │
└───────────────┘
extract(pattern, group_index=1)
¶
Extract a capturing group from a regex pattern.
This method returns the specified group from each string that matches
pattern
. It operates element-wise on list columns, making it ideal
for pulling identifiers or amounts embedded in free-text fields.
When to use
- Retrieve policy or claim numbers from combined identifiers or descriptive text
- Capture monetary amounts from claim notes for validation
- Isolate classification codes embedded within longer strings
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pattern
|
str
|
The regex pattern with capturing groups. |
required |
group_index
|
int
|
The 1-based index of the group to extract. |
1
|
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
An |
Examples:
Scalar Example: Extracting policy numbers from combined IDs
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"full_id": ["POLICY-12345-AB", "CLAIM-67890-CD", "POLICY-ABCDE-FG"],
}
af = ActuarialFrame(data)
af_extracted = af.select(
af["full_id"].str.extract(r"POLICY-([A-Z0-9]+)-.*", group_index=1).alias("policy_num")
)
print(af_extracted.collect())
shape: (3, 1)
┌────────────┐
│ policy_num │
│ --- │
│ str │
╞════════════╡
│ 12345 │
│ null │
│ ABCDE │
└────────────┘
Vector Example: Extracting amounts from transaction descriptions
from gaspatchio_core.frame.base import ActuarialFrame
data_list = {
"policy_id": ["P001"],
"transactions": ["Premium paid: $100.50, Fee: $10.00, Adjustment: $-5.25"],
}
af_list = ActuarialFrame(data_list)
af_list = af_list.with_columns(
af_list["transactions"].str.split(", ").alias("transactions")
)
af_list_extracted = af_list.select(
af_list["transactions"].str.extract(r"\$?([-+]?[0-9]+\.[0-9]{2})", group_index=1).alias("amounts_str")
)
print(af_list_extracted.collect())
shape: (1, 1)
┌──────────────────────────────┐
│ amounts_str │
│ --- │
│ list[str] │
╞══════════════════════════════╡
│ ["100.50", "10.00", "-5.25"] │
└──────────────────────────────┘
extract_all(pattern)
¶
Extract all non-overlapping regex matches as a list.
Mirrors Polars' Expr.str.extract_all
. For List[String]
columns, the
extraction is applied element-wise.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pattern
|
str
|
The regex pattern to search for. |
required |
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
An |
When to use
- Collect every monetary amount mentioned in claim notes for validation against the claim ledger.
- Extract all policy reference numbers from free-text fields when reconciling cross-policy transactions.
- Gather every ICD code from a medical report to determine claim triggers.
- Capture all state abbreviations from an address string when assessing geographical concentration risk.
Examples:
Scalar example – Extracting amounts from claim descriptions
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"claim_id": ["C1", "C2"],
"details": ["Paid $150.00 and $25.50 fee", "Refunded $10.00"]
}
af = ActuarialFrame(data)
af_amounts = af.select(
af["details"].str.extract_all(r"\$([0-9]+\.[0-9]{2})").alias("amounts")
)
print(af_amounts.collect())
shape: (2, 1)
┌───────────────────────┐
│ amounts │
│ --- │
│ list[str] │
╞═══════════════════════╡
│ ["$150.00", "$25.50"] │
│ ["$10.00"] │
└───────────────────────┘
Vector example – Extracting policy numbers from lists of notes
from gaspatchio_core.frame.base import ActuarialFrame
notes = {
"claim_id": ["C1"],
"notes": ["Policy 12345 reported, Adjustment for policy 98765"]
}
af = ActuarialFrame(notes)
af_list = af.with_columns(
af["notes"].str.split(", ").alias("notes")
)
result = af_list.select(
af_list["notes"].str.extract_all(r"[0-9]+").alias("policy_numbers")
)
print(result.collect())
shape: (1, 1)
┌────────────────────────┐
│ policy_numbers │
│ --- │
│ list[list[str]] │
╞════════════════════════╡
│ [["12345"], ["98765"]] │
└────────────────────────┘
len_bytes()
¶
Get the number of bytes in each string.
Calculates the byte length of each string in a column. This is particularly useful when dealing with multi-byte character encodings (like UTF-8) where the number of characters may not equal the number of bytes.
When to use
- Data Storage Estimation: Accurately estimating storage requirements for datasets containing text fields, especially with international character sets (e.g., policyholder names, addresses from various regions).
- System Integration Limits: Ensuring that string data, when exported or sent to other systems, conforms to byte-length restrictions imposed by those systems (e.g., fixed-width file formats or database field constraints defined in bytes).
- Performance Considerations: Recognizing that operations on strings with many multi-byte characters might be more resource-intensive.
- Encoding Issue Detection: While not a direct detection method, unexpected byte lengths compared to character lengths might hint at encoding problems or the presence of unusual characters.
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
An |
Examples:
Scalar Example: Byte length of UTF-8 encoded client names
Scenario: You have client names that may include characters from various languages, and you need to understand their storage size in bytes.
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"client_id": ["C001", "C002", "C003", "C004"],
"client_name": ["René", "沐宸", "Zoë", "John Doe"] # French, Chinese, German, English names
}
af = ActuarialFrame(data)
af_byte_len = af.select(
af["client_name"].str.len_bytes().alias("name_byte_length")
)
print(af_byte_len.collect())
shape: (4, 1)
┌──────────────────┐
│ name_byte_length │
│ --- │
│ u32 │
╞══════════════════╡
│ 5 │
│ 6 │
│ 4 │
│ 8 │
└──────────────────┘
Vector Example: Byte length of free-text comments in a list
Scenario: A policy record contains a list of comments, potentially with special characters or different languages. You need to find the byte length of each comment.
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_list_comments = {
"policy_id": ["P501", "P502"],
"comments_list": [
["Test € symbol", "Standard comment.", None], # Euro symbol is multi-byte
["Résumé", "日本語のコメント"] # French with accent, Japanese comment
]
}
af_comments = ActuarialFrame(data_list_comments)
# Ensure the list column has the correct Polars type
af_comments = af_comments.with_columns(
af_comments["comments_list"].cast(pl.List(pl.String))
)
af_comment_byte_len = af_comments.select(
af_comments["comments_list"].str.len_bytes().alias("comment_byte_lengths")
)
print(af_comment_byte_len.collect())
shape: (2, 1)
┌──────────────────────────┐
│ comment_byte_lengths │
│ --- │
│ list[u32] │
╞══════════════════════════╡
│ [13, 17, null] │
│ [7, 21] │
└──────────────────────────┘
len_chars()
¶
Alias for n_chars
. Get the number of characters in each string.
Calculates the length of each string in a column, returning an integer
representing the number of characters. This is an alias for n_chars()
.
When to use
- Data Validation: Ensuring identifiers like policy numbers, social security numbers, or postal codes adhere to expected length constraints, helping to identify data entry errors.
- System Integration: Verifying that string data, such as client names or addresses, does not exceed length limitations of downstream systems or databases.
- Feature Engineering: Using the length of free-text fields (e.g., claim descriptions, underwriter notes) as a potential feature in predictive models, where length might correlate with complexity or severity.
- Data Quality Assessment: Identifying outliers or anomalies in string lengths that might indicate corrupted or incomplete data.
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
An |
Examples:
Scalar Example: Validating policy number length
Scenario: You need to check if policy numbers in your dataset conform to an expected length, say 7 characters.
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"policy_id_raw": ["POL1234", "POL567", "POL89012", None, "POL3456"],
"premium": [100.0, 150.0, 200.0, 50.0, 120.0]
}
af = ActuarialFrame(data)
# Calculate the length of each policy_id_raw
af_len_check = af.select(
af["policy_id_raw"].str.len_chars().alias("policy_id_length")
)
print(af_len_check.collect())
shape: (5, 1)
┌──────────────────┐
│ policy_id_length │
│ --- │
│ u32 │
╞══════════════════╡
│ 7 │
│ 6 │
│ 8 │
│ null │
│ 7 │
└──────────────────┘
Vector Example: Character count of claim notes
Scenario: Each policy may have a list of associated claim notes. You want to find the character length of each note to understand the verbosity or for display purposes.
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_list = {
"policy_id": ["P7001", "P7002"],
"claim_notes_list": [
["Short note.", "This is a much longer note regarding the claim details.", None],
["Urgent review needed!", "All clear."]
]
}
af_list_notes = ActuarialFrame(data_list)
# Ensure the list column has the correct Polars type
af_list_notes = af_list_notes.with_columns(
af_list_notes["claim_notes_list"].cast(pl.List(pl.String))
)
af_notes_len = af_list_notes.select(
af_list_notes["claim_notes_list"].str.len_chars().alias("note_char_lengths")
)
print(af_notes_len.collect())
shape: (2, 1)
┌───────────────────────────┐
│ note_char_lengths │
│ --- │
│ list[u32] │
╞═══════════════════════════╡
│ [11, 53, null] │
│ [20, 9] │
└───────────────────────────┘
ljust(width, fill_char=' ')
¶
Left-align strings by padding on the right.
Strings shorter than width
are padded on the right with fill_char
.
When the column contains List[String]
values, each element is padded
individually.
When to use
- Formatting account or policy identifiers for fixed-width exports.
- Preparing ledger extracts where text fields must be left-aligned.
- Normalizing rider or sub-account codes stored as lists so they compare consistently.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
width
|
int
|
The desired total length of the string after padding. |
required |
fill_char
|
str
|
The character to pad with. Defaults to a space. |
' '
|
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
An |
Examples:
Scalar example – fixed-width account codes
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
with pl.Config(fmt_str_lengths=100):
data = {"account_code": ["A1", "B123", None, "C"]}
af = ActuarialFrame(data)
af_ljust = af.select(
af["account_code"].str.ljust(6, "-").alias("ljust_code")
)
print(af_ljust.collect())
shape: (4, 1)
┌────────────┐
│ ljust_code │
│ --- │
│ str │
╞════════════╡
│ A1---- │
│ B123-- │
│ null │
│ C----- │
└────────────┘
Vector example – padding elements in a list column
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
with pl.Config(fmt_str_lengths=100):
data_list = {
"batch_id": ["X01"],
"sub_codes": [["S1", "LONGCODE", "S23"]],
}
af_list = ActuarialFrame(data_list)
af_list = af_list.with_columns(
af_list["sub_codes"].cast(pl.List(pl.String))
)
af_list_ljust = af_list.select(
af_list["sub_codes"].str.ljust(8, "X").alias("ljust_sub_codes")
)
print(af_list_ljust.collect())
shape: (1, 1)
┌──────────────────────────────────────┐
│ ljust_sub_codes │
│ --- │
│ list[str] │
╞══════════════════════════════════════╡
│ ["S1XXXXXX", "LONGCODE", "S23XXXXX"] │
└──────────────────────────────────────┘
n_chars()
¶
Get the number of characters in each string.
This function calculates the length of each string in a column, returning an integer representing the number of characters. It's a fundamental operation for understanding string data characteristics.
When to use
- Data Quality Checks: Identifying unexpectedly short or long strings that might indicate data entry errors or truncation (e.g., validating the length of policy numbers, postal codes, or identification numbers).
- Feature Engineering: Creating new features based on string length for predictive models (e.g., the length of a claim description might correlate with claim complexity).
- Data Cleaning & Transformation: Deciding on padding or truncation strategies if string fields need to conform to a fixed length for system integration or reporting.
- Understanding Free-Text Fields: Analyzing the distribution of lengths in fields like underwriter notes or medical descriptions to gauge the amount of detail typically provided.
- Filtering or Segmenting Data: Selecting records based on the length of a specific string field (e.g., finding all policyholder names shorter than 3 characters for review).
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
An |
Examples:
Scalar Example: Length of product names
To understand the typical length of product names in your portfolio, or to identify names that might be too long for certain display formats.
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"product_code": ["L-TERM-10", "L-WL-P", "ANN-SDA"],
"product_name": ["Term Life 10 Year", "Whole Life Par", "Single Deferred Annuity"]
}
af = ActuarialFrame(data)
af_len = af.select(
af["product_name"].str.n_chars().alias("name_length")
)
print(af_len.collect())
shape: (3, 1)
┌─────────────┐
│ name_length │
│ --- │
│ u32 │
╞═════════════╡
│ 17 │
│ 14 │
│ 23 │
└─────────────┘
Vector Example: Length of beneficiary names in a list
For policies with multiple beneficiaries, you might want to check the length of each beneficiary's name, perhaps to ensure it fits within system limits or for data validation.
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_list = {
"policy_id": ["P001", "P002"],
"beneficiaries": [["John A. Doe", "Jane B. Smith"], ["Robert King", None, "Alice Wonderland"]]
}
af_list_initial = ActuarialFrame(data_list)
af_list = af_list_initial.with_columns(
af_list_initial["beneficiaries"].cast(pl.List(pl.String))
)
af_bene_len = af_list.select(
af_list["beneficiaries"].str.n_chars().alias("beneficiary_name_lengths")
)
print(af_bene_len.collect())
shape: (2, 1)
┌──────────────────────────┐
│ beneficiary_name_lengths │
│ --- │
│ list[u32] │
╞══════════════════════════╡
│ [11, 13] │
│ [11, null, 16] │
└──────────────────────────┘
pad_end(width, fill_char=' ')
¶
Left-align strings by padding on the right.
Strings shorter than width
are padded on the right with fill_char
.
If the column is List[String]
the padding is applied to each element
of the list.
When to use
- Format policy numbers or claim identifiers for extracts that require fixed-width fields.
- Pad abbreviations in list columns (such as rider codes) so that they line up cleanly in cross-system feeds.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
width
|
int
|
The desired total length of the string after padding. |
required |
fill_char
|
str
|
The character to pad with. Defaults to a space. |
' '
|
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
An |
Examples:
Scalar example – fixed-width policy codes
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
with pl.Config(fmt_str_lengths=100):
data = {"policy_code": ["L101", "L20", None]}
af = ActuarialFrame(data)
result = af.select(
af["policy_code"].str.pad_end(6, "0").alias("fixed_length_code")
)
print(result.collect())
shape: (3, 1)
┌───────────────────┐
│ fixed_length_code │
│ --- │
│ str │
╞═══════════════════╡
│ L10100 │
│ L20000 │
│ null │
└───────────────────┘
Vector example – padding claim codes in a list
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
with pl.Config(fmt_str_lengths=100):
data_list = {"batch_id": ["B200"], "claim_codes": [["A1", "XYZ", "C1234"]]}
af_list = ActuarialFrame(data_list).with_columns(
pl.col("claim_codes").cast(pl.List(pl.String))
)
result = af_list.select(
af_list["claim_codes"].str.pad_end(6, "_").alias("aligned_codes")
)
print(result.collect())
shape: (1, 1)
┌────────────────────────────────┐
│ aligned_codes │
│ --- │
│ list[str] │
╞════════════════════════════════╡
│ ["A1____", "XYZ___", "C1234_"] │
└────────────────────────────────┘
pad_start(width, fill_char=' ')
¶
Alias for rjust
. Pads the start of strings (right-aligns content).
Adds characters to the beginning of each string until it reaches the given width. This is handy when preparing fixed-width extracts or aligning numeric text fields in actuarial reports.
When to use
- Preparing policy identifiers for legacy mainframe interfaces that expect fixed-width fields.
- Aligning premium or reserve amounts in textual summaries generated for regulators or management.
- Standardizing rider codes stored in lists so that they can be compared consistently across policies.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
width
|
int
|
The desired minimum length of the string. |
required |
fill_char
|
str
|
The character to pad with. Defaults to a space. |
' '
|
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
An |
'ExpressionProxy'
|
start. |
Examples:
Scalar Example: Align premium amounts in a report
# Test with pl.Config to ensure consistent display
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
with pl.Config(fmt_str_lengths=100):
data = {
"premium_str": ["1200.5", "85.75", None]
}
af = ActuarialFrame(data)
result = af.select(
af["premium_str"].str.pad_start(8, " ").alias("padded_premium")
)
print(result.collect())
shape: (3, 1)
┌────────────────┐
│ padded_premium │
│ --- │
│ str │
╞════════════════╡
│ 1200.5 │
│ 85.75 │
│ null │
└────────────────┘
Vector Example: Pad rider codes stored as a list
# Test with pl.Config to ensure consistent display
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
with pl.Config(fmt_str_lengths=100):
data_list = {
"policy_id": ["P01"],
"rider_codes": [["RID1", "LONGRID", "R2"]]
}
af_list = ActuarialFrame(data_list).with_columns(
pl.col("rider_codes").cast(pl.List(pl.String))
)
result = af_list.select(
af_list["rider_codes"].str.pad_start(8, "0").alias("padded_rider_codes")
)
print(result.collect())
shape: (1, 1)
┌──────────────────────────────────────────┐
│ padded_rider_codes │
│ --- │
│ list[str] │
╞══════════════════════════════════════════╡
│ ["0000RID1", "0LONGRID", "000000R2"] │
└──────────────────────────────────────────┘
remove_prefix(prefix)
¶
Alias for strip_prefix
. Remove a prefix from each string.
The prefix is removed from the beginning of every string. Strings
without that prefix remain unchanged. List[String]
columns are
processed element by element.
When to use
- Standardizing vendor codes before mapping them to your base product dictionary.
- Cleaning temporary policy identifiers created during data migrations.
- Dropping country prefixes from location codes when you need only the state or province.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prefix
|
str | Expr
|
The substring to remove. May be a literal string or an expression resolving to one. |
required |
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
The expression with the prefix removed. |
Examples:
Scalar example – clean temporary policy IDs
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"policy_id_raw": ["TMP-001", "TMP-002", "003", None],
"processing_prefix": ["TMP-", "TMP-", "TMP-", "TMP-"],
}
with pl.Config(set_tbl_width_chars=100):
af_fixed = ActuarialFrame(data)
fixed = af_fixed.select(
af_fixed["policy_id_raw"].str.remove_prefix("TMP-").alias("policy_id")
).collect()
print(fixed)
af_dynamic = ActuarialFrame(data)
dynamic = af_dynamic.select(
af_dynamic["policy_id_raw"].str.remove_prefix(
af_dynamic["processing_prefix"]
).alias("policy_id")
).collect()
print()
print("Dynamic prefix removal:")
print(dynamic)
shape: (4, 1)
┌───────────┐
│ policy_id │
│ --- │
│ str │
╞═══════════╡
│ 001 │
│ 002 │
│ 003 │
│ null │
└───────────┘
Dynamic prefix removal:
shape: (4, 1)
┌───────────┐
│ policy_id │
│ --- │
│ str │
╞═══════════╡
│ 001 │
│ 002 │
│ 003 │
│ null │
└───────────┘
Vector example – remove LEGACY-
from feature codes
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
af_list = ActuarialFrame({
"policy_key": ["P1", "P2"],
"feature_codes_raw": [
["LEGACY-RIDER1", "BENEFIT_A"],
[None, "LEGACY-OPTION_B"],
],
})
af_list = af_list.with_columns(
af_list["feature_codes_raw"].cast(pl.List(pl.String))
)
with pl.Config(set_tbl_width_chars=100, fmt_str_lengths=100):
result = af_list.select(
af_list["feature_codes_raw"].str.remove_prefix("LEGACY-").alias(
"feature_codes"
)
).collect()
print(result)
shape: (2, 1)
┌─────────────────────────┐
│ feature_codes │
│ --- │
│ list[str] │
╞═════════════════════════╡
│ ["RIDER1", "BENEFIT_A"] │
│ [null, "OPTION_B"] │
└─────────────────────────┘
remove_suffix(suffix)
¶
Alias for strip_suffix
. Remove a suffix from each string.
This method behaves identically to meth:
strip_suffix
, removing the
specified trailing substring from each string value. If a string does not
end with the provided suffix it is returned unchanged. When the column is
a list of strings, the removal is applied element-wise.
When to use
- Normalizing Product Names: Stripping version tags like "-2024" or "_NEW" from product identifiers so that experience can be grouped by the base product.
- Cleaning Import Data: Eliminating temporary indicators such as "-DRAFT" that may be appended to policy numbers imported from administration systems.
- Simplifying Text Fields: Removing trailing notes like "*cancelled" from agent remarks prior to text analytics or matching.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
suffix
|
str | Expr
|
The suffix to remove. Can be a literal string or a Polars expression that evaluates to a string. |
required |
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
A new |
Examples:
Scalar Example: Removing '-OLD' from policy codes
Scenario: Historical policy codes may include a trailing -OLD
suffix that should be dropped for reporting.
from gaspatchio_core.frame.base import ActuarialFrame
data = {"policy_code": ["TERM10-OLD", "WL-OLD", "ANN"]}
af = ActuarialFrame(data)
af_clean = af.select(
af["policy_code"].str.remove_suffix("-OLD").alias("code_clean")
)
print(af_clean.collect())
shape: (3, 1)
┌─────────────┐
│ code_clean │
│ --- │
│ str │
╞═════════════╡
│ TERM10 │
│ WL │
│ ANN │
└─────────────┘
Vector (list) example: Removing trailing '*exp' from lists of underwriting notes
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
notes_data = {
"policy_id": [1, 2],
"uw_notes": [
["Declined*exp", "Check later*exp"],
["Approved", None],
],
}
af_notes = ActuarialFrame(notes_data)
af_notes = af_notes.with_columns(
af_notes["uw_notes"].cast(pl.List(pl.String))
)
af_notes_clean = af_notes.select(
af_notes["uw_notes"].str.remove_suffix("*exp").alias("notes_clean")
)
print(af_notes_clean.collect())
shape: (2, 1)
┌────────────────────────────┐
│ notes_clean │
│ --- │
│ list[str] │
╞════════════════════════════╡
│ ["Declined", "Check later"] │
│ ["Approved", null] │
└────────────────────────────┘
replace(pattern, value, literal=False, n=1)
¶
Replace occurrences of a pattern in each string.
This method searches every string in the column for a given substring or
regular expression pattern and replaces the first n
matches with the
provided value
. When literal
is True
the pattern
is
treated as a plain string; otherwise it is interpreted as a regex.
When to use
- Updating Legacy Codes: Converting outdated product or policy codes to a new standard so assumption tables align across systems.
- Cleaning Free-Text Fields: Removing or altering specific phrases in underwriting or claim notes prior to text analysis.
- Normalizing Reference Data: Adjusting naming conventions in data feeds before merging them with internal models.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
pattern
|
str | Expr
|
Substring or regex pattern to search for. May also be a Polars expression yielding the pattern. |
required |
value
|
str | Expr
|
Replacement text. Can be a string or a Polars expression. |
required |
literal
|
bool
|
If |
False
|
n
|
int
|
Maximum number of replacements per string. Defaults to |
1
|
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
A new expression with the specified replacements |
'ExpressionProxy'
|
applied. |
Examples:
Scalar Example: Normalizing policy status descriptions
Scenario: Some policy statuses contain the phrase "IN FORCE"
.
Replace it with "INFORCE"
for consistency.
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"policy_id": ["P1", "P2", "P3"],
"status_raw": ["IN FORCE", "LAPSED", "IN FORCE"],
}
af = ActuarialFrame(data)
af_clean = af.select(
af["status_raw"].str.replace("IN FORCE", "INFORCE", literal=True).alias("status")
)
print(af_clean.collect())
shape: (3, 1)
┌─────────┐
│ status │
│ --- │
│ str │
╞═════════╡
│ INFORCE │
│ LAPSED │
│ INFORCE │
└─────────┘
Vector Example: Removing 'NOTE: ' from lists of claim notes
Scenario: Each policy has a list of claim notes and some entries
start with "NOTE: "
. Remove this prefix from each note.
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
notes_data = {
"policy_id": ["A1", "A2"],
"claim_notes_str": [
"NOTE: Initial review,Payment authorised",
"None,NOTE: Follow up required",
],
}
af_notes = ActuarialFrame(notes_data)
af_notes = af_notes.with_columns(
af_notes["claim_notes_str"].str.split(",").alias("claim_notes").map_elements(
lambda x: [None if item == "None" else item for item in x], return_dtype=pl.List(pl.String)
)
)
af_clean_notes = af_notes.select(
af_notes["claim_notes"].str.replace("NOTE: ", "", literal=True, n=1).alias("clean_notes")
)
result = af_clean_notes.collect()
print(result)
shape: (2, 1)
┌──────────────────────────────────────────┐
│ clean_notes │
│ --- │
│ list[str] │
╞══════════════════════════════════════════╡
│ ["Initial review", "Payment authorised"] │
│ [null, "Follow up required"] │
└──────────────────────────────────────────┘
rjust(width, fill_char=' ')
¶
Right-align strings by padding on the left.
Strings shorter than width
are padded on the left with fill_char
.
If the column is List[String]
the padding is applied to each element
of the list.
When to use
- Aligning premium or claim amounts before exporting to legacy ledger systems.
- Presenting policy identifiers or rider codes in uniformly padded columns for regulatory or management reports.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
width
|
int
|
The desired total length of the string after padding. |
required |
fill_char
|
str
|
The character to pad with. Defaults to a space. |
' '
|
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
An |
Examples:
Scalar example – formatting premium amounts
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data = {"premium_str": ["123.45", "7", None]}
af = ActuarialFrame(data)
af_rjust = af.select(
af["premium_str"].str.rjust(8).alias("rjust_premium")
)
with pl.Config(fmt_str_lengths=100, tbl_width_chars=100):
print(af_rjust.collect())
shape: (3, 1)
┌───────────────┐
│ rjust_premium │
│ --- │
│ str │
╞═══════════════╡
│ 123.45 │
│ 7 │
│ null │
└───────────────┘
Vector example – aligning claim references
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_list = {
"batch_id": ["B100"],
"claim_refs": [["C1", "C234", "C56789"]],
}
af_list = ActuarialFrame(data_list).with_columns(
pl.col("claim_refs").cast(pl.List(pl.String))
)
result = af_list.select(
af_list["claim_refs"].str.rjust(6, "0").alias("formatted_refs")
)
with pl.Config(fmt_str_lengths=100, tbl_width_chars=100):
print(result.collect())
shape: (1, 1)
┌────────────────────────────────┐
│ formatted_refs │
│ --- │
│ list[str] │
╞════════════════════════════════╡
│ ["0000C1", "00C234", "C56789"] │
└────────────────────────────────┘
starts_with(prefix)
¶
Check if strings in a column start with a given substring.
This is useful for categorizing or flagging records based on prefixes in textual data. For example, identifying policies based on product code prefixes (e.g., "TERM-" for term life, "WL-" for whole life) or segmenting claims by a prefix in their claim ID (e.g., "AUTO-" for auto claims).
When applied to a column of List[String]
, such as a list of associated
product features for a policy, the operation is performed element-wise on
each string within each list, returning a list of booleans.
When to use
- Classify policies by prefix to drive product-specific assumptions.
- Identify riders with a particular prefix (e.g., primary benefits) when stored in a list column.
- Validate codes against expected prefixes coming from another column.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prefix
|
str | Expr
|
The substring to check for at the beginning of each string.
Can be a literal string (e.g., "TERM-") or a Polars expression
(e.g., |
required |
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
A new |
Examples:
Scalar example – policy prefixes
from gaspatchio_core.frame.base import ActuarialFrame
data_policies = {
"policy_no": ["TERM-1001", "WL-2002", "TERM-1003", None, "UL-3004", "TERM-1004"],
"issue_age": [25, 30, 28, 45, 35, 40]
}
af = ActuarialFrame(data_policies)
# Check if policy_no starts with "TERM-"
af_term_policies = af.select(
af["policy_no"].str.starts_with("TERM-").alias("is_term_policy")
)
print(af_term_policies.collect())
shape: (6, 1)
┌────────────────┐
│ is_term_policy │
│ --- │
│ bool │
╞════════════════╡
│ true │
│ false │
│ true │
│ null │
│ false │
│ true │
└────────────────┘
Vector (list) example – rider prefixes
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_policy_riders = {
"policy_id": ["P201", "P202", "P203"],
"rider_codes_list": [
["B-ADB", "S-WP", "S-CI"], # B-AccidentalDeathBenefit, S-WaiverOfPremium, S-CriticalIllness
["S-LTC", None, "B-GIO"], # S-LongTermCare, B-GuaranteedInsurabilityOption
["S-WPR", "S-CIR"]
]
}
af_riders = ActuarialFrame(data_policy_riders).with_columns(
pl.col("rider_codes_list").cast(pl.List(pl.String))
)
af_primary_benefit_check = af_riders.select(
af_riders["rider_codes_list"].str.starts_with("B-").alias("has_primary_benefit_rider")
)
print(af_primary_benefit_check.collect())
shape: (3, 1)
┌───────────────────────────┐
│ has_primary_benefit_rider │
│ --- │
│ list[bool] │
╞═══════════════════════════╡
│ [true, false, false] │
│ [false, null, true] │
│ [false, false] │
└───────────────────────────┘
strip_chars(characters=None)
¶
Removes specified leading and trailing characters from strings.
This is useful for cleaning data, such as removing unwanted prefixes,
suffixes, or whitespace from policy numbers, client names, or address fields.
It mirrors Polars' Expr.str.strip_chars
. If no characters are specified,
it defaults to removing leading and trailing whitespace.
For List[String]
columns, like a list of addresses for a client,
the operation is applied element-wise to each string in the list.
When to use
- Cleanse Identifier Fields: Remove extraneous characters (e.g., spaces, hyphens, special symbols) from policy numbers, claim IDs, or client identifiers to ensure consistency for matching and lookups. For example, "POL- 123* " could become "POL-123" by stripping " *".
- Standardize Textual Data: Trim leading/trailing whitespace from free-text fields like occupation descriptions, addresses, or underwriter notes before analysis or storage.
- Prepare Data for Joins: Ensure that join keys consisting of string data are clean and consistently formatted to avoid join failures due to subtle differences like trailing spaces.
- Sanitize User Input: Clean user-provided search terms or filter values by removing unwanted characters before using them in queries.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
characters
|
str | Expr
|
A string of characters to remove from both ends of each string. Can also be a Polars expression that evaluates to a string of characters. If None (default), removes whitespace (spaces, tabs, newlines, etc.). |
None
|
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
A new |
Examples:
Scalar Example 1: Cleaning policy numbers by removing specific prefixes/suffixes and whitespace
Policy numbers might be recorded with inconsistent characters (e.g., "ID-", "*", spaces). We want to standardize them by removing these specific characters and any surrounding whitespace.
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_policy_nos = {
"raw_policy_id": [
"ID-A123-XYZ*",
" B456 ",
"ID-C789*",
"D012-XYZ",
None,
" ID-E345* ",
],
"chars_to_remove_col": ["ID-*XYZ ", " ", "ID-*", "-XYZ", None, " *ID-"]
}
af = ActuarialFrame(data_policy_nos)
# Example 1a: Remove a fixed set of characters "ID-*XYZ " from policy IDs
af_cleaned_fixed = af.select(
af["raw_policy_id"].str.strip_chars("ID-*XYZ ").alias("cleaned_fixed_chars")
)
print("Cleaned with fixed characters 'ID-*XYZ ':")
print(af_cleaned_fixed.collect())
# Example 1b: Remove characters specified in another column
# This dynamically strips characters based on the 'chars_to_remove_col' for each row.
af_cleaned_dynamic = af.select(
af["raw_policy_id"].str.strip_chars(pl.col("chars_to_remove_col")).alias("cleaned_dynamic_chars")
)
print("\nCleaned with characters from 'chars_to_remove_col':")
print(af_cleaned_dynamic.collect())
# Example 1c: Remove only leading and trailing whitespace
af_trimmed_whitespace = af.select(
af["raw_policy_id"].str.strip_chars().alias("trimmed_whitespace_only") # characters=None
)
print("\nCleaned with default whitespace stripping:")
print(af_trimmed_whitespace.collect())
Cleaned with fixed characters 'ID-*XYZ ':
shape: (6, 1)
┌─────────────────────┐
│ cleaned_fixed_chars │
│ --- │
│ str │
╞═════════════════════╡
│ A123 │
│ B456 │
│ C789 │
│ D012 │
│ null │
│ E345 │
└─────────────────────┘
Cleaned with characters from 'chars_to_remove_col':
shape: (6, 1)
┌───────────────────────┐
│ cleaned_dynamic_chars │
│ --- │
│ str │
╞═══════════════════════╡
│ A123 │
│ B456 │
│ C789 │
│ D012 │
│ null │
│ E345 │
└───────────────────────┘
Cleaned with default whitespace stripping:
shape: (6, 1)
┌───────────────────────────┐
│ trimmed_whitespace_only │
│ --- │
│ str │
╞═══════════════════════════╡
│ ID-A123-XYZ* │
│ B456 │
│ ID-C789* │
│ D012-XYZ │
│ null │
│ ID-E345* │
└───────────────────────────┘
Vector (List Shimming) Example: Cleaning lists of product add-on codes
Product codes for add-ons might be stored in a list, with potential unwanted characters like asterisks, hyphens, or spaces.
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_addons = {
"policy_id": ["P1001", "P1002"],
"addon_codes_raw": [
["*RIDER_A- ", " -RIDER_B*", "BASE_PLAN"],
[None, " *-RIDER_C- ", "\tRIDER_D\t*"]
]
}
af_addons = ActuarialFrame(data_addons).with_columns(
pl.col("addon_codes_raw").cast(pl.List(pl.String))
)
# Strip asterisks, hyphens, spaces, and tabs from each code in the lists
af_cleaned_addons = af_addons.select(
af_addons["addon_codes_raw"].str.strip_chars(" *-#\t").alias("cleaned_addon_codes") # Added '#' to demonstrate it's ignored if not present
)
print(af_cleaned_addons.collect())
shape: (2, 1)
┌───────────────────────────────────┐
│ cleaned_addon_codes │
│ --- │
│ list[str] │
╞═══════════════════════════════════╡
│ ["RIDER_A", "RIDER_B", "BASE_PLA… │
│ [null, "RIDER_C", "RIDER_D"] │
└───────────────────────────────────┘
strip_chars_start(characters=None)
¶
Removes specified leading characters from strings.
Useful for standardizing data by removing known prefixes or initial
whitespace. For instance, cleaning policy numbers by removing a
"TEMP-" prefix or trimming spaces from the beginning of address lines.
It mirrors Polars' Expr.str.strip_chars_start
. If no characters are
specified, it defaults to removing leading whitespace.
When applied to List[String]
columns (e.g., a list of historical
status codes for a policy), the operation is performed element-wise.
When to use
- Normalizing Prefixed Identifiers: Removing consistent prefixes from identifiers like policy numbers (e.g., "PN-", "TEMP_"), claim codes (e.g., "CL-"), or agent codes to get the core identifier.
- Cleaning Leading Characters in Text Fields: Removing leading non-essential characters (e.g., bullets, numbers, special symbols, spaces) from free-text fields like notes, descriptions, or imported data before further processing.
- Standardizing Data from Multiple Sources: If different source systems prefix the same data differently, this function can help unify them by removing those specific leading characters.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
characters
|
str | Expr
|
A string of characters to remove from the start of each string. Can also be a Polars expression that evaluates to a string of characters. If None (default), removes leading whitespace (spaces, tabs, newlines, etc.). |
None
|
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
A new |
Examples:
Scalar Example: Removing prefixes from legacy system IDs and leading whitespace
Legacy system IDs might have prefixes like "LEG_", "OLD-", or be padded with spaces.
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_ids = {
"legacy_id": [
"LEG_POL123",
" OLD-CLM456",
"POL789",
None,
"LEG_ UW001", # Note the space after LEG_
" TRN999"
],
"prefixes_to_strip": ["LEG_", "OLD-", "NONEXISTENT_", None, "LEG_ ", " "]
}
af = ActuarialFrame(data_ids)
# Example 1a: Remove a fixed prefix "LEG_"
af_no_leg_prefix = af.select(
af["legacy_id"].str.strip_chars_start("LEG_").alias("id_no_leg_prefix")
)
print("Stripping fixed prefix 'LEG_':")
print(af_no_leg_prefix.collect())
# Example 1b: Remove leading whitespace only (characters=None)
af_trimmed_space = af.select(
af["legacy_id"].str.strip_chars_start().alias("id_trimmed_leading_space")
)
print("\nStripping leading whitespace only:")
print(af_trimmed_space.collect())
# Example 1c: Remove prefixes defined in another column
# This will strip any character found in the corresponding 'prefixes_to_strip' string from the start.
af_dynamic_prefix = af.select(
af["legacy_id"].str.strip_chars_start(pl.col("prefixes_to_strip")).alias("id_dynamic_prefix_removed")
)
print("\nStripping prefixes from 'prefixes_to_strip' column (character-wise from start):")
print(af_dynamic_prefix.collect())
Stripping fixed prefix 'LEG_':
shape: (6, 1)
┌────────────────────┐
│ id_no_leg_prefix │
│ --- │
│ str │
╞════════════════════╡
│ POL123 │
│ OLD-CLM456 │
│ POL789 │
│ null │
│ UW001 │
│ TRN999 │
└────────────────────┘
Stripping leading whitespace only:
shape: (6, 1)
┌───────────────────────────┐
│ id_trimmed_leading_space │
│ --- │
│ str │
╞═══════════════════════════╡
│ LEG_POL123 │
│ OLD-CLM456 │
│ POL789 │
│ null │
│ LEG_ UW001 │
│ TRN999 │
└───────────────────────────┘
Stripping prefixes from 'prefixes_to_strip' column (character-wise from start):
shape: (6, 1)
┌─────────────────────────────┐
│ id_dynamic_prefix_removed │
│ --- │
│ str │
╞═════════════════════════════╡
│ POL123 │
│ CLM456 │
│ POL789 │
│ null │
│ UW001 │
│ TRN999 │
└─────────────────────────────┘
Vector (List Shimming) Example: Cleaning lists of temporary transaction remarks
Transaction remarks might be stored in lists, with some prefixed by "TEMP: " or spaces.
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_remarks = {
"policy_id": ["TRN01", "TRN02"],
"transaction_remarks_raw": [
["TEMP: Initial assessment", " Adjustment processed", "Final Review"],
[None, "TEMP: Hold for now", "TEMP: Resolved", "Status: OK"]
]
}
af_remarks = ActuarialFrame(data_remarks).with_columns(
pl.col("transaction_remarks_raw").cast(pl.List(pl.String))
)
# Example 2a: Strip fixed prefix "TEMP: " from each remark in the lists
af_cleaned_remarks_prefix = af_remarks.select(
af_remarks["transaction_remarks_raw"].str.strip_chars_start("TEMP: ").alias("cleaned_remarks_prefix")
)
print("Cleaned remarks (prefix 'TEMP: '):")
print(af_cleaned_remarks_prefix.collect())
# Example 2b: Strip leading whitespace from list elements
af_cleaned_remarks_space = af_remarks.select(
af_remarks["transaction_remarks_raw"].str.strip_chars_start().alias("cleaned_remarks_space")
)
print("\nCleaned remarks (leading whitespace):")
print(af_cleaned_remarks_space.collect())
Cleaned remarks (prefix 'TEMP: '):
shape: (2, 1)
┌────────────────────────────────────────────────────────────────────────────┐
│ cleaned_remarks_prefix │
│ --- │
│ list[str] │
╞════════════════════════════════════════════════════════════════════════════╡
│ ["Initial assessment", " Adjustment processed", "Final Review"] │
│ [null, "Hold for now", "Resolved", "Status: OK"] │
└────────────────────────────────────────────────────────────────────────────┘
Cleaned remarks (leading whitespace):
shape: (2, 1)
┌────────────────────────────────────────────────────────────────────────────┐
│ cleaned_remarks_space │
│ --- │
│ list[str] │
╞════════════════════════════════════════════════════════════════════════════╡
│ ["TEMP: Initial assessment", "Adjustment processed", "Final Review"] │
│ [null, "TEMP: Hold for now", "TEMP: Resolved", "Status: OK"] │
└────────────────────────────────────────────────────────────────────────────┘
strip_prefix(prefix)
¶
Remove a prefix from each string.
The prefix is stripped whenever it occurs at the start of the string. Strings without the prefix are returned unchanged. On columns containing lists of strings, the removal happens element by element.
When to use
- Cleaning temporary identifiers such as
TEMP-123
once a policy is fully underwritten. - Harmonizing product codes from different administration systems before mapping them to an actuarial model.
- Stripping
LEGACY-
markers from lists of rider codes imported from historical sources.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
prefix
|
str | Expr
|
Prefix to remove. May be a literal string or an expression that evaluates to a string. |
required |
Returns:
Type | Description |
---|---|
'ExpressionProxy'
|
ExpressionProxy with the prefix removed. |
Examples:
Scalar example – cleaning policy IDs
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
with pl.Config(set_tbl_width_chars=100):
af = ActuarialFrame({"pol_id_raw": ["TEMP-001", "TEMP-002", "003", None]})
cleaned = af.select(
af["pol_id_raw"].str.strip_prefix("TEMP-").alias("pol_id")
).collect()
print(cleaned)
shape: (4, 1)
┌────────┐
│ pol_id │
│ --- │
│ str │
╞════════╡
│ 001 │
│ 002 │
│ 003 │
│ null │
└────────┘
Vector example – removing LEGACY-
from feature codes
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
af = ActuarialFrame({
"policy_key": ["POLICY_A", "POLICY_B"],
"feature_codes_raw": [
["LEGACY-RIDER1", "NEW_FEATURE_X", "LEGACY-BENEFIT2"],
[None, "LEGACY-COVERAGE_Y", "STANDARD_Z"],
],
})
af = af.with_columns(
af["feature_codes_raw"].cast(pl.List(pl.String))
)
with pl.Config(set_tbl_width_chars=120, fmt_str_lengths=100):
cleaned = af.select(
af["feature_codes_raw"].str.strip_prefix("LEGACY-").alias("cleaned_feature_codes")
).collect()
print(cleaned)
shape: (2, 1)
┌─────────────────────────────────────────┐
│ cleaned_feature_codes │
│ --- │
│ list[str] │
╞═════════════════════════════════════════╡
│ ["RIDER1", "NEW_FEATURE_X", "BENEFIT2"] │
│ [null, "COVERAGE_Y", "STANDARD_Z"] │
└─────────────────────────────────────────┘
strip_suffix(suffix)
¶
Remove a suffix from each string.
If a string does not end with the given suffix, it is returned unchanged.
For List[String]
columns, the operation is applied element-wise.
When to use
- Normalizing coverage names that include trailing version codes such as "-OLD".
- Preparing ledger accounts by removing year suffixes like "-2024" before comparing periods.
- Cleaning temporary identifiers imported from external systems (for example, removing a trailing "-TMP").
Parameters:
Name | Type | Description | Default |
---|---|---|---|
suffix
|
str | Expr
|
The suffix to remove. Either a string literal or an expression resolving to a string. |
required |
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
The expression with the suffix removed. |
Examples:
Scalar example – normalize plan names
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"plan_name_raw": ["Term Basic-OLD", "Income Protection-OLD", "Annuity Plus", None]
}
af = ActuarialFrame(data)
result = af.select(
af["plan_name_raw"].str.strip_suffix("-OLD").alias("plan_name")
)
print(result.collect())
shape: (4, 1)
┌───────────────────────┐
│ plan_name │
│ --- │
│ str │
╞═══════════════════════╡
│ Term Basic │
│ Income Protection │
│ Annuity Plus │
│ null │
└───────────────────────┘
Vector (list) example – clean trailing punctuation in claim notes
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
notes_data = {
"claim_id": ["C1", "C2"],
"notes": [["Approved.", "Paid."], [None, "In Review."]],
}
af_list = ActuarialFrame(notes_data)
af_list = af_list.with_columns(
af_list["notes"].cast(pl.List(pl.String))
)
cleaned = af_list.select(
af_list["notes"].str.strip_suffix(".").alias("notes_cleaned")
)
print(cleaned.collect())
shape: (2, 1)
┌────────────────────────┐
│ notes_cleaned │
│ --- │
│ list[str] │
╞════════════════════════╡
│ ["Approved", "Paid"] │
│ [null, "In Review"] │
└────────────────────────┘
strptime(dtype, format=None, *, strict=True, exact=True, cache=True, ambiguous='raise', **kwargs)
¶
Convert string values to Date, Datetime, or Time.
This method parses textual date or time information into Polars temporal
types. For List[String]
columns, each element is parsed individually.
When to use
- Convert policy issue or claim reporting dates that are stored as strings in raw data extracts.
- Parse lists of event timestamps—such as claim status updates—when building experience studies or exposure models.
- Ingest external datasets from underwriting or administration systems where date fields come in a variety of text formats.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dtype
|
'PolarsTemporalType'
|
The Polars temporal type to convert to ( |
required |
format
|
Optional[str]
|
The strf/strptime format string. If |
None
|
strict
|
bool
|
If |
True
|
exact
|
bool
|
If |
True
|
cache
|
bool
|
If |
True
|
ambiguous
|
str | Expr
|
How to handle ambiguous datetimes, such as daylight-saving
transitions. Options are |
'raise'
|
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
Strings converted to the specified temporal type. |
Examples:
Scalar Example: Parsing policy issue dates
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data = {
"policy_id": ["A100", "B200", "C300"],
"issue_date_str": [
"2021-01-15",
"20/02/2022",
"2023-03-10 14:30:00"
]
}
af = ActuarialFrame(data)
af_parsed_dates = af.select(
af["issue_date_str"].str.strptime(pl.Date, "%Y-%m-%d", strict=False).alias("issue_date_strict_fmt"),
af["issue_date_str"].str.strptime(pl.Date, "%d/%m/%Y", strict=False).alias("issue_date_dmy_fmt"),
af["issue_date_str"].str.strptime(pl.Datetime, "%Y-%m-%d %H:%M:%S", strict=False).alias("issue_datetime"),
)
result = af_parsed_dates.collect()
print(result)
shape: (3, 3)
┌───────────────────────┬────────────────────┬─────────────────────┐
│ issue_date_strict_fmt ┆ issue_date_dmy_fmt ┆ issue_datetime │
│ --- ┆ --- ┆ --- │
│ date ┆ date ┆ datetime[μs] │
╞═══════════════════════╪════════════════════╪═════════════════════╡
│ 2021-01-15 ┆ null ┆ null │
│ null ┆ 2022-02-20 ┆ null │
│ null ┆ null ┆ 2023-03-10 14:30:00 │
└───────────────────────┴────────────────────┴─────────────────────┘
Vector Example: Parsing lists of event timestamps
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_list = {
"claim_id": ["CL001"],
"event_timestamps_str": [["2023-04-01T10:00:00", "2023-04-01T10:05:00", "Invalid"]],
}
af_list = ActuarialFrame(data_list).with_columns(
pl.col("event_timestamps_str").cast(pl.List(pl.String))
)
af_parsed_list = af_list.select(
af_list["event_timestamps_str"].str.strptime(
pl.Datetime, "%Y-%m-%dT%H:%M:%S", strict=False
).alias("event_datetimes_μs")
)
result = af_parsed_list.collect()
print(result)
shape: (1, 1)
┌──────────────────────────────────────────────────┐
│ event_datetimes_μs │
│ --- │
│ list[datetime[μs]] │
╞══════════════════════════════════════════════════╡
│ [2023-04-01 10:00:00, 2023-04-01 10:05:00, null] │
└──────────────────────────────────────────────────┘
to_lowercase()
¶
Converts all characters in string columns to lowercase.
This function standardizes textual data by converting all characters in a string column to lowercase. This is essential for ensuring consistency in data fields critical for actuarial analysis, such as system codes, free-text fields like occupation or medical conditions, or external data sources, facilitating accurate matching, aggregation, and text analysis.
When to use
- Normalizing Text for Analysis: Preparing free-text fields (e.g., underwriting notes, claim descriptions, occupation details) for text mining or NLP by ensuring terms like "SMOKER", "Smoker", and "smoker" are treated identically.
- Improving Data Matching with External Sources: When integrating data from various systems or third-party providers where case consistency is not guaranteed (e.g., matching addresses, names, or city information).
- Standardizing User Input: Converting user-entered data (e.g., search terms, filter criteria) to a consistent case before processing or querying.
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
An |
Examples:
Scalar Example: Normalizing occupation descriptions for risk analysis
Occupation descriptions might be entered in various casings. Converting to lowercase helps in standardizing them for consistent risk factor analysis or grouping.
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"policy_id": ["POL001", "POL002", "POL003", "POL004"],
"occupation_raw": ["Engineer", "software DEVELOPER", "Teacher", "Project Manager"]
}
af = ActuarialFrame(data)
af_lower_occupation = af.select(
af["occupation_raw"].str.to_lowercase().alias("occupation_normalized")
)
print(af_lower_occupation.collect())
shape: (4, 1)
┌───────────────────────┐
│ occupation_normalized │
│ --- │
│ str │
╞═══════════════════════╡
│ engineer │
│ software developer │
│ teacher │
│ project manager │
└───────────────────────┘
Vector Example: Lowercasing medical condition codes from multiple sources
Medical condition codes might come from different systems with varying casing. Lowercasing them ensures they can be consistently mapped or analyzed.
from gaspatchio_core.frame.base import ActuarialFrame
import polars as pl
data_medical_codes = {
"claim_id": ["C001", "C002"],
"condition_codes_list": [
["DIAB_T2", "HBP", "ASTHMA"], # DIAB_T2 = Type 2 Diabetes, HBP = High Blood Pressure
["hbp", None, "copd"] # COPD = Chronic Obstructive Pulmonary Disease
]
}
af_codes = ActuarialFrame(data_medical_codes)
# Ensure the list column has the correct Polars type for the string operation
af_codes = af_codes.with_columns(
af_codes["condition_codes_list"].cast(pl.List(pl.String))
)
af_lower_codes = af_codes.select(
af_codes["condition_codes_list"].str.to_lowercase().alias("lower_condition_codes")
)
print(af_lower_codes.collect())
shape: (2, 1)
┌─────────────────────────────────────┐
│ lower_condition_codes │
│ --- │
│ list[str] │
╞═════════════════════════════════════╡
│ ["diab_t2", "hbp", "asthma"] │
│ ["hbp", null, "copd"] │
└─────────────────────────────────────┘
to_uppercase()
¶
Converts all characters in string columns to uppercase.
This function standardizes textual data by converting all characters in a string column to uppercase. This is essential for ensuring consistency in data fields critical for actuarial analysis, such as policy status codes, product identifiers, or geographical regions, facilitating accurate matching, aggregation, and reporting.
When to use
- Standardizing Categorical Data: Ensuring that codes like policy status (e.g., "active", "Lapsed", "ACTIVE" all become "ACTIVE"), gender codes (e.g., "m", "F" become "M", "F"), or smoker status (e.g. "non-smoker", "Smoker" become "NON-SMOKER", "SMOKER") are consistent for grouping and analysis.
- Improving Data Matching: Facilitating joins and lookups between different datasets where case sensitivity might cause mismatches (e.g., matching policyholder names or addresses from different sources).
- Enhancing Readability and Reporting: Presenting data in a uniform case for reports and dashboards, especially for identifiers or codes.
- Preparing Text for Analysis: As a preprocessing step before text mining or natural language processing tasks on fields like claim descriptions or underwriter notes, where case normalization can simplify pattern recognition.
- Simplifying Rule-Based Logic: When applying business rules that depend on string comparisons (e.g., identifying policies with specific rider codes like "ADB" or "WP" irrespective of their original casing).
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
A new |
Examples:
Scalar Example: Standardizing policy status codes
Policy status might be entered in various cases ("active", "lapsed", "ACTIVE"). Converting to uppercase ensures consistency for analysis.
from gaspatchio_core.frame.base import ActuarialFrame
data = {
"policy_id": ["S3001", "S3002", "S3003", "S3004"],
"status_raw": ["active", "lapsed", "Active", "PENDING"]
}
af = ActuarialFrame(data)
af_upper_status = af.select(
af["status_raw"].str.to_uppercase().alias("status_standardized")
)
print(af_upper_status.collect())
shape: (4, 1)
┌─────────────────────┐
│ status_standardized │
│ --- │
│ str │
╞═════════════════════╡
│ ACTIVE │
│ LAPSED │
│ ACTIVE │
│ PENDING │
└─────────────────────┘
Vector Example: Uppercasing rider codes for a policy
A policy might have multiple rider codes stored in a list. To ensure uniformity, we can convert all rider codes to uppercase.
from gaspatchio_core.frame.base import ActuarialFrame
data_policy_riders = {
"policy_id": ["R4001", "R4002", "R4003"],
"rider_codes_str": [
"adb,wp",
"ci,ltc,acc_death",
"gio"
]
}
af_riders = ActuarialFrame(data_policy_riders)
# Convert string to list for the string operation
af_riders = af_riders.with_columns(
af_riders["rider_codes_str"].str.split(",").alias("rider_codes_list")
)
af_upper_riders = af_riders.select(
af_riders["rider_codes_list"].str.to_uppercase().alias("upper_rider_codes")
)
print(af_upper_riders.collect())
shape: (3, 1)
┌────────────────────────────┐
│ upper_rider_codes │
│ --- │
│ list[str] │
╞════════════════════════════╡
│ ["ADB", "WP"] │
│ ["CI", "LTC", "ACC_DEATH"] │
│ ["GIO"] │
└────────────────────────────┘
zfill(length)
¶
Pad strings with leading zeros to a minimum width.
Shorter values are padded on the left with zeros so each entry reaches
length
characters. For list columns, the padding occurs element-wise.
When to use
- Standardizing policy numbers from different administration systems before merging with valuation data
- Preparing zero-padded claim numbers for extracts sent to reinsurers or regulators
- Building fixed-width keys when joining to rating tables or mapping grids
Parameters:
Name | Type | Description | Default |
---|---|---|---|
length
|
int
|
The desired minimum length of the string. |
required |
Returns:
Name | Type | Description |
---|---|---|
ExpressionProxy |
'ExpressionProxy'
|
Strings padded with leading zeros. |
Examples¶
Scalar example – Standardizing policy serial numbers::
```python
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
with pl.Config(fmt_str_lengths=100):
data = {"policy_serial": ["123", "45", "6789", None, "1"]}
af = ActuarialFrame(data)
result = af.select(
af["policy_serial"].str.zfill(5).alias("zfilled_serial")
)
print(result.collect())
```
```text
shape: (5, 1)
┌────────────────┐
│ zfilled_serial │
│ --- │
│ str │
╞════════════════╡
│ 00123 │
│ 00045 │
│ 06789 │
│ null │
│ 00001 │
└────────────────┘
```
Vector example – Padding numerical components in claim codes::
```python
import polars as pl
from gaspatchio_core.frame.base import ActuarialFrame
with pl.Config(fmt_str_lengths=100):
data = {
"claim_batch": ["B01", "B02"],
"item_codes": [["A1", "B123", "C04"], [None, "D56"]],
}
af = ActuarialFrame(data)
af = af.with_columns(
af["item_codes"].cast(pl.List(pl.String))
)
result = af.select(
af["item_codes"].str.zfill(4).alias("zfilled_item_codes")
)
print(result.collect())
```
```text
shape: (2, 1)
┌──────────────────────────┐
│ zfilled_item_codes │
│ --- │
│ list[str] │
╞══════════════════════════╡
│ ["00A1", "B123", "0C04"] │
│ [null, "0D56"] │
└──────────────────────────┘
```