DataMasque Portal

Matchers and Labels

Matcher Types

context_sources

Note: context_sources can only be used with mask_table or mask_tabular_file tasks.

The context_sources matcher references values from other columns or fields.

first_name clinical_notes
Sarah Patient Sarah presented with acute migraine.
Bradley Following the consultation, Bradley reported improved mobility.

Use context_sources if you have structured data alongside each text entry you want masked.

matchers:
  context_sources:
    - label: "PATIENT_FIRST_NAME"
      column: first_name

Case Sensitivity

By default, case_sensitive: false.

first_name case_sensitive clinical_notes
Thomas false THOMAS arrived
Thomas true THOMAS arrived
matchers:
  context_sources:
    - label: "PATIENT_FIRST_NAME"
      column: first_name
      case_sensitive: true

seed_files

The seed_files matcher references values from seed files loaded into DataMasque. Given a medications.csv seed file, you can detect the following:

clinical_notes
Prescribed Aspirin 100mg daily.
Bradley allergic to Morphine.

Use seed_files if you have a list of entities you want masked within each text entry.

matchers:
  seed_files:
    - label: "MEDICATION_NAME"
      seed_file: "medications.csv"
      seed_column: "medication_names"

Case Sensitivity

By default, case_sensitive: false.

case_sensitive clinical_notes
false THOMAS arrived
true THOMAS arrived
matchers:
  seed_files:
    - label: "FIRST_NAME"
      seed_file: "DataMasque_firstNames_mixed.csv"
      seed_column: "firstname-mixed"
      case_sensitive: true

ai_detect

Note: ai_detect is only available if the DataMasque AI Engine is configured.

The ai_detect matcher is powered by the DataMasque AI Engine and uses a language model for detection.

Text PATIENT_NAME DOCTOR_NAME
Jones: Patient Amir reported chest pain. Amir Jones
Emil&y reports knee pain to Dr. Jones. Emil&y Dr. Jones

Note: Emil&y is deliberately mis-spelt to showcase ai_detect's handling of typos.

Use ai_detect if you want to detect:

  • novel or ambiguous entities requiring clear guidelines,
  • entities in texts containing misspellings or OCR artefacts

Warning: The ai_detect matcher has a maximum text length of 10,000 characters per field.

If text exceeds this limit, it will raise an error unless fallback_masks is configured.

DataMasque includes 23 built-in presets for common entity types. The preset below matches on FIRST_NAME according to that preset's definition:

matchers:
  ai_detect:
    - label: "FIRST_NAME"
      annotation_config:
        use_preset: "FIRST_NAME"

For entities not covered by presets (e.g. PATIENT_FIRST_NAME), you can define custom detection rules with guidelines:

matchers:
  ai_detect:
    - label: "PATIENT_FIRST_NAME"
      annotation_config:
        guidelines:
          - First names of patients

If a simple guideline is insufficient for downstream masking or otherwise matches on different entities than intended, multiple guidelines may be added, alongside valid_examples and invalid_examples.

In the guideline below, we are excluding titles because they will be masked separately (with the TITLE preset). Likewise, we are more explicit about excluding friends and family because we want to reduce false matches.

matchers:
  ai_detect:
    - label: "PATIENT_FIRST_NAME"
      annotation_config:
        guidelines:
          - First names only of patients
          - Exclude titles, and names of family/friends
        valid_examples:
          - text: "Patient [Smith] reports pain"
        invalid_examples:
          - text: "Surgery was performed on [Ms.Smith] Thompson"
            reason: "Contains title"

Note: The ai_detect matcher accesses the Bedrock API so processes rows slower than the other matchers.

Use the max_rows run option to validate configuration before running on full datasets.

Token usage and costs

Assuming Claude Sonnet 4.5 is used as the DataMasque AI Engine's underlying model, the cost for processing 100 documents (with 1,000 characters each, alongside 20 labels) is approximately $1 USD. Costs scale linearly with each of the following: document length, the number of labels, and annotation config complexity.

Note: For more details on token usage and cost calculation, see the README.md provided with the DataMasque AI Engine.

Writing effective guidelines

1. Start simple
  • Language models have a lot of pre-built world knowledge.
  • A small simple guideline is usually enough to match on 95%+ of entities.
2. Work iteratively
  • If you are able to access the data after masking, it helps to view the results both before and after. Tweak the guidelines after each attempt. Use max_rows to mask only a subset of your data.

  • If you find that certain entities tend not to be matched, consider adding them to your valid_examples,
    and if other entities tend to be unintentionally matched, consider adding them to your invalid_examples.

3. Consider downstream masks
  • Lastly, consider how your guidelines apply to downstream masks and the realism of your masked data.

  • Be specific. In the PATIENT_FIRST_NAME example above for example, TITLE was excluded because we wanted to mask it separately.

regex

The regex matcher finds text matching regular expression patterns.

Pattern Text
\b\d{5}\b ZIP Code: 90210
\b[A-Z]{3}-\d{4}\b Here: ABC-1234 is my reference
\b\d{3}-\d{3}-\d{4}\b Call 555-867-5309 for details

Use regex if you want to detect data following well-defined patterns.

matchers:
  regex:
    - label: "ZIP_CODE"
      pattern: "\b\d{5}\b"

checksum

The checksum matcher validates numbers using checksum algorithms.

Number Valid Credit Card?
4532-1234-5678-9010 Yes
The cc: <1234-5678-9012-3456> No
\n5105105105105100\n Yes

Use checksum if you want to match only those entities which pass validation algorithms.

matchers:
  checksum:
    - label: "CREDIT_CARD"
      type: "credit_card"

For a list of supported checksums, see the set_checksum documentation.

Overlap Handling

When matches overlap, DataMasque merges them by extending boundaries to cover all overlapping text.

The final label is determined by considering:

Note: Adjacent entities that touch but don't overlap are masked separately.

Same-label overlap

Multiple matchers may be assigned to the same label:

matchers:
  context_sources:
    - label: "LAST_NAME"
      column: last_name
  ai_detect:
    - label: "LAST_NAME"
      annotation_config:
        use_preset: "LAST_NAME"
matcher label clinical_notes
context_sources LAST_NAME Patient Rebecca de Santis arrived.
ai_detect LAST_NAME Patient Rebecca de Santis arrived.
combined LAST_NAME Patient Rebecca de Santis arrived.

As both detected entities have the same label, the combined LAST_NAME is the extension of both.

Cross-label overlap

Multiple matchers may overlap across different labels:

matchers:
  context_sources:
    - label: "ORGANISATION"
      column: org_name
  ai_detect:
    - label: "LOCATION"
      annotation_config:
        use_preset: "STREET_ADDRESS"
matcher label text
context_sources ORGANISATION Visit Acme Bank, London …
ai_detect LOCATION Visit Acme Bank, London
combined LOCATION Visit Acme Bank, London

If entities with different labels overlap, then the winning label is the one with the left-most start position.

If entities have the same left-most start position, the winning label is the longest one. In the above example, the LOCATION label is used because it is the longest.

When both start positions and lengths are equal, then matcher priorities apply.

Matcher Priorities

From highest to lowest:

  1. context_sources
  2. seed_files
  3. ai_detect
  4. checksum
  5. regex

Context Filters

Context filters refine matches and can be applied to any matcher.

match_whole_words_only

By default, matchers will match partial words. match_whole_words_only prevents matching within words.

first_name Without filter With match_whole_words_only
Tom Tom arrived Tom arrived
Tom Tomorrow Tomorrow

Use match_whole_words_only to exclude partial word matches.

matchers:
  context_sources:
    - label: "PATIENT_FIRST_NAME"
      column: first_name
      context_filters:
        - type: match_whole_words_only

exclude_pattern

Excludes matches which match a regex pattern.

Without filter Exclude "@example.com"
Contact: john.smith@example.com Contact: john.smith@example.com
Email: jose.garcia@email.com Email: jose.garcia@email.com

Use exclude_pattern to filter out applying regex patterns.

matchers:
  regex:
    - label: "EMAIL"
      pattern: '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
      context_filters:
        - type: exclude_pattern
          pattern: '@example\.com$'

include_pattern

Includes only matches which match a regex pattern.

Without filter Include " Hospital"
Royal Melbourne Hospital treats… Royal Melbourne Hospital treats…
Royal Melbourne clinic is closed Royal Melbourne clinic is closed

Use include_pattern to only include applying regex patterns.

matchers:
  seed_files:
    - label: "HOSPITAL_NAME"
      seed_file: "hospitals.csv"
      seed_column: "hospital_name"
      context_filters:
        - type: include_pattern
          pattern: " Hospital$"

add_prefix

Extends matches to include prefixes.

Without filter Extended with "Dr. "
Dr. Jones consulted Dr. Jones consulted
Jones arrived Jones arrived

Use add_prefix to include leading context.

matchers:
  context_sources:
    - label: "DOCTOR_NAME"
      column: last_name
      context_filters:
        - type: add_prefix
          prefix: "Dr. "

add_suffix

Extends matches to include a suffix if present.

hospital_name Without filter Extended with " Hospital"
Royal Melbourne Royal Melbourne treats Royal Melbourne treats
Royal Melbourne Royal Melbourne Hospital Royal Melbourne Hospital

Use add_suffix to include trailing context.

matchers:
  seed_files:
    - label: "HOSPITAL_NAME"
      seed_file: "hospitals.csv"
      seed_column: "hospital_name"
      context_filters:
        - type: add_suffix
          suffix: " Hospital"

Combining filters

Context Filters are applied in two phases:

  1. Validation filters (include_pattern, exclude_pattern, match_whole_words_only) are applied first
  2. Extension filters (add_prefix, add_suffix) are applied second
matchers:
  seed_files:
    - label: "DOCTOR_NAME"
      seed_file: "DataMasque_lastNames.csv"
      seed_column: "lastnames"
      context_filters:
        - type: match_whole_words_only
        - type: add_prefix
          prefix: "Dr. "
Text input Match Step 1: match_whole_words_only Step 2: add_prefix Final result
"Dr. Smith consulted" Smith whole word ✓ extends to Dr. Smith Dr. Smith masked
"Smithson arrived" Smith part of word ✗ - Not masked
"Jones arrived" Jones whole word ✓ no prefix found Jones masked

Configuring Masks for Labels

After defining which labels to match for, a mask must be defined for each label.

Important: DataMasque will raise an error if a matcher defines a label without an associated mask. If you wish to detect entities without masking, you can use the do_nothing mask

Note: Labels must be UPPER_CASE.

Detected label Mask Used Original text Masked text
PHONE from_fixed Call 555-123-4567. Call XXX-XXX-XXXX.
REFERENCE_ID imitate Reference ID: AC-891234 Reference ID: BK-472851
matchers:
  ...

masks:
  - label: "PHONE"
    masks:
      - type: from_fixed
        value: "XXX-XXX-XXXX"

  - label: "REFERENCE_ID"
    masks:
      - type: imitate

For a complete list of masks, see Masking Functions Overview.

Fallback Masks

If the ai_detect matcher fails to process text due to a network timeout, content guardrails, or otherwise, fallback_masks can be applied to the entire text instead of raising an error.

This following ruleset uses the ai_detect matcher to match on first names. Given a networking issue, the entire text will be replaced.

matchers:
  ai_detect:
    - label: "FIRST_NAME"
      annotation_config:
        use_preset: "FIRST_NAME"

masks:
  - label: "FIRST_NAME"
    masks:
      - type: from_file
        seed_file: DataMasque_firstNames_mixed.csv
        seed_column: firstname-mixed

fallback_masks:
  - type: from_fixed
    value: "[REDACTED]"
AI Engine Status clinical_notes (before) clinical_notes (after)
Online Patient Heiko arrived Patient Marnie arrived
Network Error Patient Heiko arrived [REDACTED]

The fallback replaces the entire text, not individual entities.

Appendix: ai_detect Presets

FIRST_NAME

Guidelines:

  • Given name only.
  • Excludes titles, middle names, initials

Valid examples:

  • "Student [Alice] Walker submitted the assignment"

Invalid examples:

  • "Manager [J.] approved the request"
    • Initial

LAST_NAME

Guidelines:

  • Surname only
  • Excludes titles, middle names, initials

Valid examples:

  • "Ms. [Smith] will lead the meeting"
  • "Professor [van der Berg] published the paper"
    • Compound surname as unit

FULL_NAME

Note: Use FIRST_NAME and LAST_NAME if this preset is not behaving as expected.

Guidelines:

  • First and last name together, optionally includes middle names
  • Excludes titles, initials

Valid examples:

  • "Award winner [Mary Elizabeth Johnson] will speak"

INITIAL_NAME

Guidelines:

  • Single or multi-letter person-related initials only
  • Excludes titles e.g. Mr, Miss

Valid examples:

  • "[J. K.] Rowling wrote the book"

Invalid examples:

  • "The [U.S.A.] is a large country"
    • Not an identifying individual

TITLE

Guidelines:

  • Professional titles and honorifics, including dot

Valid examples:

  • "[Dr.] Smith will see you now"
  • "Welcome [Ms] Johnson to the stage"

AGE

Guidelines:

  • Numeric ages and age ranges of individuals

Valid examples:

  • "Applicant is [45 years old]"
  • "Participants aged [30-35] were selected"

Invalid examples:

  • "DataMasque is [5] years old"
    • Not individual
  • "Targeted at [elderly] population"
    • Non-numeric

DATE_OF_BIRTH

Guidelines:

  • Dates when context clearly indicates birth

Valid examples:

  • "DOB: [January 1]"
  • "he was born yesterday (today is [01/05/2024])"

NATIONALITY

Guidelines:

  • National origin, citizenship when describing identity

Valid examples:

  • "She is a [British] citizen"

GENDER

Guidelines:

  • Gender identity or biological sex
  • Excludes pronouns

Valid examples:

  • "The patient (27[M]) is recovering well"
  • "Participant is [non-binary]"

RACE

Guidelines:

  • Racial or ethnic identity

Valid examples:

  • "Patient identified as [African American]"

EMAIL

Guidelines:

  • Email addresses with @ and domain
  • Includes partial/obfuscated emails

Valid examples:

  • "Contact: [user+tag@company.co.uk]"
  • "Reach out at [bob@gmail] for details"

FAX

Guidelines:

  • Fax numbers only when context indicates fax

Valid examples:

  • "Send documents to fax [487.416.6741x713]"

PHONE_NUMBER

Guidelines:

  • Phone numbers in any format
  • Includes extensions, partial numbers
  • Excludes fax numbers and placeholders

Valid examples:

  • "Extension [555.123.4567 ext. 890]"
  • "Heya my number is: *** *** *** [1234]"
    • Excludes placeholder *s

USERNAME

Guidelines:

  • Online handles and account identifiers

Valid examples:

  • "Follow [@johndoe123] on twitter"

STREET_ADDRESS

Guidelines:

  • Street number and name only, may include unit/apartment
  • Includes postal codes, PO boxes
  • Excludes city, state, country

Valid examples:

  • "Ship to [195 Main Street], Boston, MA [02101]"
  • "Office located at [456 Oak Avenue Apt 2B]"

CITY

Guidelines:

  • Cities, towns, villages

Valid examples:

  • "Traveling to [Melbourne] next week"

COUNTRY

Guidelines:

  • Country names include abbreviations

Valid examples:

  • "Visiting the [UK] next month"
  • "Born in [Canada]"

STATE

Guidelines:

  • States, provinces, territories, includes abbreviations

Valid examples:

  • "Shipping to [NY]"
  • "Property in [New South Wales]"

ZIPCODE

Guidelines:

  • ZIP codes and postal codes in any format
  • Can be part of an address or standalone

Valid examples:

  • "Postcode [SW1A 1AA]"

JOB_TITLE

Guidelines:

  • Specific, professional roles and job titles

Valid examples:

  • "Promoted to [Senior Vice President of Marketing]"

Invalid examples:

  • "He was a good [manager]"
    • Generic, non-specific

ORGANIZATION

Guidelines:

  • Specific organizations, including appropriate prefixes/suffixes

Valid examples:

  • "Enrolled at [Stanford University] for fall"

Invalid examples:

  • "Works at [a hospital]"
    • Generic, non-specifying

MEDICAL_CONDITION

Guidelines:

  • Diseases, illnesses, diagnoses

Valid examples:

  • "Diagnosed with [hypertension]"
  • "Treatment for [Type 2 diabetes]"