Deterministic Masking
- What is Deterministic Masking?
- Which Hash Source to Use?
hash_columnshash_sources- Combining hashes
- Limitations
What is Deterministic Masking?
Deterministic Masking allows the same entity to receive the same mask, both within and across documents.
For example:
| Patient ID | Original Text | Masked Text |
|---|---|---|
| PATIENT_001 | Tarzan is sleeping | Jefferie is sleeping |
| PATIENT_001 | Tarzan is happy. | Jefferie is happy. |
| PATIENT_002 | Jane is recovering | Adin is recovering |
All names within the same patient_id receive the same masked value.
This can be useful for:
- preserving referential integrity
- preserving consistency in datasets for training downstream AI/ML models
Deterministic Masking is not unique to unstructured masking, for more see Deterministic masking
Which Hash Source to Use?
For example, when masking PATIENT_FIRST_NAME, you might want consistency based on different criteria:
| Use Case | Recommended Approach | Example |
|---|---|---|
| Same patient ID, same mask | hash_columns withpatient_id |
patient_id=001:"John" → "Alice" "Jane" → "Alice" patient_id=002:"John" → "Bob" |
| Same name, same mask | hash_sources withself: entity |
"John" → "Alice" (everywhere) "Jane" → "Carol" (everywhere) |
| Same doctor, same mask | hash_sources withlabel: "DOCTOR" |
Text with Dr. "Smith": "John" → "Alice" "Jane" → "Alice" Text with Dr. "Jones": "John" → "Bob" |
| Same patient ID and name, same mask | hash_columns withpatient_id+ hash_sources withself: entity |
patient_id=001:"John" → "Alice" "Jane" → "Carol" patient_id=002:"John" → "Bob" |
hash_columns
Use hash_columns to hash on column values from the current row.
This works the same as deterministic masking
for structured data.
hash_columns can be specified at both the task-level (applying to all columns)
and the column-level (overriding the task-level default).
- column: raw_text
hash_columns:
- patient_id
masks:
- type: unstructured_text
matchers:
ai_detect:
- label: "FIRST_NAME"
annotation_config:
use_preset: "FIRST_NAME"
| patient_id | raw_text | masked_text |
|---|---|---|
| PATIENT_001 | Tarzan sleeps well. | Jefferie sleeps well. |
| PATIENT_001 | Tarzan greets Jane | Jefferie greets Jefferie |
| PATIENT_002 | Tarzan is recovering | Adin is recovering |
| PATIENT_002 | Jane visited today | Adin visited today |
All FIRST_NAME entities with the same patient_id
will receive the same masked value.
hash_sources
Use hash_sources to hash on entities detected within unstructured text.
For the full specification of hash_sources parameters,
see Ruleset YAML Specification.
self: entity
Use self: entity to hash on the detected entity itself.
- column: raw_text
masks:
- type: unstructured_text
hash_sources:
- self: entity
matchers:
ai_detect:
- label: "FIRST_NAME"
annotation_config:
use_preset: "FIRST_NAME"
| patient_id | raw_text | masked_text |
|---|---|---|
| PATIENT_001 | Tarzan sleeps well. | Bryor sleeps well. |
| PATIENT_001 | Tarzan greets Jane | Bryor greets Maeola |
| PATIENT_002 | Tarzan is recovering | Bryor is recovering |
| PATIENT_002 | Jane visited today | Maeola visited today |
All FIRST_NAME entities with the same text will receive the same masked value.
For example, "Tarzan" is always masked to "Bryor", and "Jane" is always masked to "Maeola".
Label-based hashing
Use label: "LABEL" to hash on specific entities within the text.
- column: raw_text
masks:
- type: unstructured_text
hash_sources:
- label: "PATIENT_ID"
matchers:
regex:
- label: "PATIENT_ID"
pattern: 'PATIENT_\d{3}'
ai_detect:
- label: "FIRST_NAME"
annotation_config:
use_preset: "FIRST_NAME"
| raw_text | masked_text |
|---|---|
| PATIENT_001: Tarzan sleeps well. Tarzan eats well. | PATIENT_001: Bryor sleeps well. Bryor eats well. |
| PATIENT_001: Tarzan is recovering. | PATIENT_001: Bryor is recovering. |
| PATIENT_002: Tarzan greets Jane. | PATIENT_002: Khalik greets Khalik. |
All FIRST_NAME entities will receive the same masked value
based on the first PATIENT_ID found.
match and match_until
The optional match parameter can be used to index on the label to hash on,
which can be useful in cases where there are multiple matches for the same label.
Extending this with the optional match_until parameter
will allow you to define a range of labels to hash on.
hash_sources:
- label: "PATIENT_ID"
match: 0
match_until: 1
| Parameter(s) | Hashes on |
|---|---|
match: 0 |
First entity (default if not specified) |
match: 1 |
Second entity |
match: -1 |
Last entity |
match: 0, match_until: 1 |
First two entities (concatenated) |
match: -2, match_until: -1 |
Last two entities (concatenated) |
Combining hashes
Chaining together multiple hash_columns and hash_sources will create a composite hash.
Important: The order will affect the final result.
- column: raw_text
hash_columns:
- patient_id
- employee_id
masks:
- type: unstructured_text
hash_sources:
- label: "HOSPITAL_ID"
- self: entity
matchers:
regex:
- label: "HOSPITAL_ID"
pattern: 'H_\d{3}'
ai_detect:
- label: "FIRST_NAME"
annotation_config:
use_preset: "FIRST_NAME"
In the example above, FIRST_NAME will be hashed (in order) on:
| Hash Input | Example |
|---|---|
The patient_id |
P_001 |
The employee_id |
E_001 |
The first HOSPITAL_ID in the document |
H_001 |
| The discovered entity text itself | Tarzan |
Limitations
Hash sources scope:
- Hash sources only reference entities within the current document. You cannot hash on entities detected within other unstructured documents.
- For a workaround and assuming the entities are easily extractable,
consider extracting these entities as metadata into a separate column
and then using
hash_columns.
No JSON/XML path support:
- Unlike file masking,
hash_sourcesfor unstructured text does not supportjson_path,xpath, orfile_pathparameters. - Workarounds:
- For JSON/XML files: Use file masking
with
hash_sourcesthat supportjson_pathandxpath - For JSON/XML columns in databases: Use
hash_columnswithjson_path(see Deterministic masking with databases)
- For JSON/XML files: Use file masking
with