Ruleset YAML Specification for unstructured_text
matchers(required): A dictionary defining which matchers to use for entity detection. Each Matcher Type is a key containing a list of matcher configurations.context_sources(optional): Detect entities by matching values from other columns.label(required): The entity label (must beUPPER_CASE).column(required): The source column containing entity values.case_sensitive(optional): Whether to match case-sensitively. Defaults tofalse.context_filters(optional): List of Context Filters to refine matches.
seed_files(optional): Detect entities by matching values from seed files.label(required): The entity label (must beUPPER_CASE).seed_file(required): The name of a user-provided seed file (see Files guide).seed_column(required): The name of the seed file column that will provide entity values.case_sensitive(optional): Whether to match case-sensitively. Defaults tofalse.context_filters(optional)
ai_detect(optional): AI-powered entity detection using AWS Bedrock (requires DataMasque AI Engine).label(required): The entity label (must beUPPER_CASE).annotation_config(required): Configuration for AI-based detection.use_preset(required): Use a built-in preset (e.g.,"FULL_NAME")
OR provide a custom annotation config:guidelines(required): List of guidelines for custom entity types.valid_examples(optional): Examples of valid entity matches.invalid_examples(optional): Examples of invalid entity matches.
context_filters(optional)
regex(optional): Detects entities matching regular expressions.label(required): The entity label (must beUPPER_CASE).pattern(required): The regular expression used to detect entities.context_filters(optional)
checksum(optional): Detects entities matching checksum algorithms.label(required): The entity label (must beUPPER_CASE).type(required): Checksum type. Seeset_checksumfor supported types.context_filters(optional)
masks(required): A list of label-to-mask mappings defining how to mask each detected entity.label(required): The entity label (must match a label frommatchers).masks(required): A list of masks to apply to entities with this label.
hash_sources(optional): List of Hash Source configurations for deterministic masking.self: entity(optional): Hash on the detected entity text itself.label(optional): Hash on a detected entity.match(optional): Zero-based index of which detected entity to hash on (defaults to 0).match_until(optional): Hash on a range of entities from[match:match_until].
fallback_masks(optional): List of masks to apply to the entire text if AI Engine processing fails.image_handling(optional): Controls how images are handled in Word documents. Has no effect on text columns."redact"(default): Replaces all images with a black rectangle (PNG)."retain": Preserves original images.
Note: Image properties (alt text, title, description) are always removed regardless of this setting. Non-image drawings (charts, SmartArt, shapes, etc.) are always redacted.
Important: Every label defined in
matchersmust have a corresponding entry inmasks. Usedo_nothingif you want to detect without masking.Note: When multiple matchers detect overlapping entities, DataMasque merges them by extending boundaries to cover all overlapping text. See Overlap Handling for examples.
Basic Example
The following ruleset matches on customer names in the description column.
After matching, each name is deterministically masked
with a first name from the DataMasque_firstNames_mixed.csv seed file.
version: "1.0"
tasks:
- type: mask_table
table: support_tickets
key: ticket_id
rules:
- column: description
masks:
- type: unstructured_text
hash_sources:
- self: entity
matchers:
context_sources:
- label: "CUSTOMER_NAME"
column: customer_name
masks:
- label: "CUSTOMER_NAME"
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
| description (before) | description (after) |
|---|---|
| Caleb Jones reported issue with login | Marnie Taren reported issue with login |
| Follow-up for Sally Reefton | Follow-up for Korrie Marwood |
| Caleb Jones escalated to management | Marnie Taren escalated to management |
In the above example, hash_sources deterministically masks repeated names
found using the context_sources matcher.
- For more on other matchers see Matchers and Labels.
- For more detailed examples, see Examples.