Examples
- Example 1: Medical Notes
- Example 2: AI/ML Fine-Tuning Dataset
- Example 3: Customer Support Logs in S3
Example 1: Medical Notes
In this example, we will see how to mask "FIRST_NAME" consistently in both
structured and unstructured data
within a medical_records table.
version: "1.0"
tasks:
- type: mask_table
table: medical_records
key: row_id
rules:
- column: clinical_notes
hash_columns:
- patient_id
masks:
- type: unstructured_text
matchers:
context_sources:
- label: "FIRST_NAME"
column: first_name
masks:
- label: "FIRST_NAME"
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- column: first_names
hash_columns:
- patient_id
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
| patient_id | first_name (before) | first_name (after) | clinical_notes (before) | clinical_notes (after) |
|---|---|---|---|---|
| pt-001 | Bradley | Korrie | Patient Bradley still unwell. | Patient Korrie still unwell. |
| pt-001 | Bradley | Korrie | Follow-up for Bradley. | Follow-up for Korrie. |
| pt-002 | Amir | Janah | Amir has surgery scheduled. | Janah has surgery scheduled. |
Key ideas:
- Matchers detect labels.
- Each label is assigned a mask.
context_sourcesassociates values fromfirst_namestoclinical_notes.hash_columnsmaintains consistency across patients.
Next steps: See Deterministic Masking for more patterns.
Example 2: AI/ML Fine-Tuning Dataset
In this example, we will see how to mask sensitive data in a fine-tuning dataset using the ai_detect matcher
while preserving consistency with hash_sources.
Note: Using the
ai_detectmatcher requires a connection with the DataMasque AI Engine.
version: "1.0"
tasks:
- type: mask_table
table: training_data
key: row_id
rules:
- column: conversation_logs
masks:
- type: unstructured_text
hash_sources:
- self: entity
matchers:
ai_detect:
- label: "FIRST_NAME"
annotation_config:
use_preset: "FIRST_NAME"
- label: "BIRTH_CITY"
annotation_config:
guidelines:
- "Cities/towns matching a user's place of birth"
valid_examples:
- text: "I was born in [Sydney]"
invalid_examples:
- text: "calling from our [Sydney] office"
reason: "not birth city"
masks:
- label: "FIRST_NAME"
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
- label: "BIRTH_CITY"
masks:
- type: from_file
seed_file: DataMasque_US_addresses_large.csv
seed_column: city
| conversation_logs (before) | conversation_logs (after) |
|---|---|
| Hi, I'm Sara&h. For verification, I was born in Sydney | Hi, I'm Jalisha. For verification, I was born in Albion |
| Thank you Sara&h. I've verified your birth city of Sydney | Thank you Jalisha. I've verified your birth city of Albion |
Note: Sara&h is deliberately mis-spelt to showcase
ai_detect's handling of typos.
Key ideas:
ai_detectcan match on entities with typos or ambiguities.hash_sourcesensures consistent masking within a document.
Next steps: See ai_detect for more on AI-powered entity detection.
Example 3: Customer Support Logs in S3
In this example, we will see how to match on only customer account codes with the help of a context filter.
version: "1.0"
tasks:
- type: mask_file
recurse: true
include:
- glob: "*.txt"
rules:
- masks:
- type: unstructured_text
matchers:
seed_files:
- label: "ACCOUNT_CODE"
seed_file: customer_account_codes.csv # Contains AL, TK, MS, ...
seed_column: account_code
context_filters:
- type: match_whole_words_only
masks:
- label: "ACCOUNT_CODE"
masks:
- type: from_fixed
value: "XX"
| File content | File content (masked, without context_filters) | File content (masked, with context_filters) |
|---|---|---|
| ALERT!! Account AL reported invoice discrepancy. Contact by end of day? | XXERT!! Account XX reported invoice discrepancy. Contact by end of day? | ALERT!! Account XX reported invoice discrepancy. Contact by end of day? |
Key ideas:
context_filterscan refine how matchers identify entities in text.- Specifically,
match_whole_words_onlyprevents partial matches within larger words.
Next steps: See Context Filters for all filter types.