DataMasque Portal

Examples

Example 1: Medical Notes

In this example, we will see how to mask "FIRST_NAME" consistently in both structured and unstructured data within a medical_records table.

version: "1.0"
tasks:
  - type: mask_table
    table: medical_records
    key: row_id
    rules:
      - column: clinical_notes
        hash_columns:
          - patient_id
        masks:
          - type: unstructured_text
            matchers:
              context_sources:
                - label: "FIRST_NAME"
                  column: first_name

            masks:
              - label: "FIRST_NAME"
                masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv
                    seed_column: firstname-mixed

      - column: first_names
        hash_columns:
          - patient_id
        masks:
          - type: from_file
            seed_file: DataMasque_firstNames_mixed.csv
            seed_column: firstname-mixed
patient_id first_name (before) first_name (after) clinical_notes (before) clinical_notes (after)
pt-001 Bradley Korrie Patient Bradley still unwell. Patient Korrie still unwell.
pt-001 Bradley Korrie Follow-up for Bradley. Follow-up for Korrie.
pt-002 Amir Janah Amir has surgery scheduled. Janah has surgery scheduled.

Key ideas:

  • Matchers detect labels.
  • Each label is assigned a mask.
  • context_sources associates values from first_names to clinical_notes.
  • hash_columns maintains consistency across patients.

Next steps: See Deterministic Masking for more patterns.

Example 2: AI/ML Fine-Tuning Dataset

In this example, we will see how to mask sensitive data in a fine-tuning dataset using the ai_detect matcher while preserving consistency with hash_sources.

Note: Using the ai_detect matcher requires a connection with the DataMasque AI Engine.

version: "1.0"
tasks:
  - type: mask_table
    table: training_data
    key: row_id
    rules:
      - column: conversation_logs
        masks:
          - type: unstructured_text
            hash_sources:
              - self: entity
            matchers:
              ai_detect:
                - label: "FIRST_NAME"
                  annotation_config:
                    use_preset: "FIRST_NAME"

                - label: "BIRTH_CITY"
                  annotation_config:
                    guidelines:
                      - "Cities/towns matching a user's place of birth"
                    valid_examples:
                      - text: "I was born in [Sydney]"
                    invalid_examples:
                      - text: "calling from our [Sydney] office"
                        reason: "not birth city"

            masks:
              - label: "FIRST_NAME"
                masks:
                  - type: from_file
                    seed_file: DataMasque_firstNames_mixed.csv
                    seed_column: firstname-mixed

              - label: "BIRTH_CITY"
                masks:
                  - type: from_file
                    seed_file: DataMasque_US_addresses_large.csv
                    seed_column: city
conversation_logs (before) conversation_logs (after)
Hi, I'm Sara&h. For verification, I was born in Sydney Hi, I'm Jalisha. For verification, I was born in Albion
Thank you Sara&h. I've verified your birth city of Sydney Thank you Jalisha. I've verified your birth city of Albion

Note: Sara&h is deliberately mis-spelt to showcase ai_detect's handling of typos.

Key ideas:

  • ai_detect can match on entities with typos or ambiguities.
  • hash_sources ensures consistent masking within a document.

Next steps: See ai_detect for more on AI-powered entity detection.

Example 3: Customer Support Logs in S3

In this example, we will see how to match on only customer account codes with the help of a context filter.

version: "1.0"
tasks:
  - type: mask_file
    recurse: true
    include:
      - glob: "*.txt"
    rules:
      - masks:
          - type: unstructured_text
            matchers:
              seed_files:
                - label: "ACCOUNT_CODE"
                  seed_file: customer_account_codes.csv  # Contains AL, TK, MS, ...
                  seed_column: account_code
                  context_filters:
                    - type: match_whole_words_only

            masks:
              - label: "ACCOUNT_CODE"
                masks:
                  - type: from_fixed
                    value: "XX"
File content File content (masked, without context_filters) File content (masked, with context_filters)
ALERT!! Account AL reported invoice discrepancy. Contact by end of day? XXERT!! Account XX reported invoice discrepancy. Contact by end of day? ALERT!! Account XX reported invoice discrepancy. Contact by end of day?

Key ideas:

Next steps: See Context Filters for all filter types.