DataMasque Portal

Deterministic Masking

What is Deterministic Masking?

Deterministic Masking allows the same entity to receive the same mask, both within and across documents.

For example:

Patient ID Original Text Masked Text
PATIENT_001 Tarzan is sleeping Jefferie is sleeping
PATIENT_001 Tarzan is happy. Jefferie is happy.
PATIENT_002 Jane is recovering Adin is recovering

All names within the same patient_id receive the same masked value.

This can be useful for:

  • preserving referential integrity
  • preserving consistency in datasets for training downstream AI/ML models

Deterministic Masking is not unique to unstructured masking, for more see Deterministic masking

Which Hash Source to Use?

For example, when masking PATIENT_FIRST_NAME, you might want consistency based on different criteria:

Use Case Recommended Approach Example
Same patient ID, same mask hash_columns with
patient_id
patient_id=001:
"John" → "Alice"
"Jane" → "Alice"
patient_id=002:
"John" → "Bob"
Same name, same mask hash_sources with
self: entity
"John" → "Alice" (everywhere)
"Jane" → "Carol" (everywhere)
Same doctor, same mask hash_sources with
label: "DOCTOR"
Text with Dr. "Smith":
"John" → "Alice"
"Jane" → "Alice"
Text with Dr. "Jones":
"John" → "Bob"
Same patient ID and name, same mask hash_columns with
patient_id
+ hash_sources with
self: entity
patient_id=001:
"John" → "Alice"
"Jane" → "Carol"
patient_id=002:
"John" → "Bob"

hash_columns

Use hash_columns to hash on column values from the current row. This works the same as deterministic masking for structured data.

hash_columns can be specified at both the task-level (applying to all columns) and the column-level (overriding the task-level default).

- column: raw_text
  hash_columns:
    - patient_id
  masks:
    - type: unstructured_text
      matchers:
        ai_detect:
          - label: "FIRST_NAME"
            annotation_config:
              use_preset: "FIRST_NAME"
patient_id raw_text masked_text
PATIENT_001 Tarzan sleeps well. Jefferie sleeps well.
PATIENT_001 Tarzan greets Jane Jefferie greets Jefferie
PATIENT_002 Tarzan is recovering Adin is recovering
PATIENT_002 Jane visited today Adin visited today

All FIRST_NAME entities with the same patient_id will receive the same masked value.

hash_sources

Use hash_sources to hash on entities detected within unstructured text. For the full specification of hash_sources parameters, see Ruleset YAML Specification.

self: entity

Use self: entity to hash on the detected entity itself.

- column: raw_text
  masks:
    - type: unstructured_text
      hash_sources:
        - self: entity
      matchers:
        ai_detect:
          - label: "FIRST_NAME"
            annotation_config:
              use_preset: "FIRST_NAME"
patient_id raw_text masked_text
PATIENT_001 Tarzan sleeps well. Bryor sleeps well.
PATIENT_001 Tarzan greets Jane Bryor greets Maeola
PATIENT_002 Tarzan is recovering Bryor is recovering
PATIENT_002 Jane visited today Maeola visited today

All FIRST_NAME entities with the same text will receive the same masked value. For example, "Tarzan" is always masked to "Bryor", and "Jane" is always masked to "Maeola".

Label-based hashing

Use label: "LABEL" to hash on specific entities within the text.

- column: raw_text
  masks:
    - type: unstructured_text
      hash_sources:
        - label: "PATIENT_ID"
      matchers:
        regex:
          - label: "PATIENT_ID"
            pattern: 'PATIENT_\d{3}'
        ai_detect:
          - label: "FIRST_NAME"
            annotation_config:
              use_preset: "FIRST_NAME"
raw_text masked_text
PATIENT_001: Tarzan sleeps well. Tarzan eats well. PATIENT_001: Bryor sleeps well. Bryor eats well.
PATIENT_001: Tarzan is recovering. PATIENT_001: Bryor is recovering.
PATIENT_002: Tarzan greets Jane. PATIENT_002: Khalik greets Khalik.

All FIRST_NAME entities will receive the same masked value based on the first PATIENT_ID found.

match and match_until

The optional match parameter can be used to index on the label to hash on, which can be useful in cases where there are multiple matches for the same label.

Extending this with the optional match_until parameter will allow you to define a range of labels to hash on.

hash_sources:
  - label: "PATIENT_ID"
    match: 0
    match_until: 1
Parameter(s) Hashes on
match: 0 First entity (default if not specified)
match: 1 Second entity
match: -1 Last entity
match: 0, match_until: 1 First two entities (concatenated)
match: -2, match_until: -1 Last two entities (concatenated)

Combining hashes

Chaining together multiple hash_columns and hash_sources will create a composite hash.

Important: The order will affect the final result.

- column: raw_text
  hash_columns:
    - patient_id
    - employee_id
  masks:
    - type: unstructured_text
      hash_sources:
        - label: "HOSPITAL_ID"
        - self: entity
      matchers:
        regex:
          - label: "HOSPITAL_ID"
            pattern: 'H_\d{3}'
        ai_detect:
          - label: "FIRST_NAME"
            annotation_config:
              use_preset: "FIRST_NAME"

In the example above, FIRST_NAME will be hashed (in order) on:

Hash Input Example
The patient_id P_001
The employee_id E_001
The first HOSPITAL_ID in the document H_001
The discovered entity text itself Tarzan

Limitations

Hash sources scope:

  • Hash sources only reference entities within the current document. You cannot hash on entities detected within other unstructured documents.
  • For a workaround and assuming the entities are easily extractable, consider extracting these entities as metadata into a separate column and then using hash_columns.

No JSON/XML path support:

  • Unlike file masking, hash_sources for unstructured text does not support json_path, xpath, or file_path parameters.
  • Workarounds: