Matchers and Labels
- Matcher Types
- Overlap Handling
- Context Filters
- Configuring Masks for Labels
- Fallback Masks
- Appendix:
ai_detectPresets
Matcher Types
context_sources
Note:
context_sourcescan only be used withmask_tableormask_tabular_filetasks.
The context_sources matcher references values from other columns or fields.
| first_name | clinical_notes |
|---|---|
| Sarah | Patient Sarah presented with acute migraine. |
| Bradley | Following the consultation, Bradley reported improved mobility. |
Use context_sources if you have structured data alongside each text entry you want masked.
matchers:
context_sources:
- label: "PATIENT_FIRST_NAME"
column: first_name
Case Sensitivity
By default, case_sensitive: false.
| first_name | case_sensitive | clinical_notes |
|---|---|---|
| Thomas | false |
THOMAS arrived |
| Thomas | true |
THOMAS arrived |
matchers:
context_sources:
- label: "PATIENT_FIRST_NAME"
column: first_name
case_sensitive: true
seed_files
The seed_files matcher references values from seed files loaded into DataMasque.
Given a medications.csv seed file, you can detect the following:
| clinical_notes |
|---|
| Prescribed Aspirin 100mg daily. |
| Bradley allergic to Morphine. |
Use seed_files if you have a list of entities you want masked within each text entry.
matchers:
seed_files:
- label: "MEDICATION_NAME"
seed_file: "medications.csv"
seed_column: "medication_names"
Case Sensitivity
By default, case_sensitive: false.
| case_sensitive | clinical_notes |
|---|---|
false |
THOMAS arrived |
true |
THOMAS arrived |
matchers:
seed_files:
- label: "FIRST_NAME"
seed_file: "DataMasque_firstNames_mixed.csv"
seed_column: "firstname-mixed"
case_sensitive: true
ai_detect
Note:
ai_detectis only available if the DataMasque AI Engine is configured.
The ai_detect matcher is powered by the DataMasque AI Engine and uses a language model for detection.
| Text | PATIENT_NAME | DOCTOR_NAME |
|---|---|---|
| Jones: Patient Amir reported chest pain. | Amir | Jones |
| Emil&y reports knee pain to Dr. Jones. | Emil&y | Dr. Jones |
Note: Emil&y is deliberately mis-spelt to showcase
ai_detect's handling of typos.
Use ai_detect if you want to detect:
- novel or ambiguous entities requiring clear guidelines,
- entities in texts containing misspellings or OCR artefacts
Warning: The
ai_detectmatcher has a maximum text length of 10,000 characters per field.If text exceeds this limit, it will raise an error unless
fallback_masksis configured.
DataMasque includes 23 built-in presets for common entity types.
The preset below matches on FIRST_NAME according to that preset's definition:
matchers:
ai_detect:
- label: "FIRST_NAME"
annotation_config:
use_preset: "FIRST_NAME"
For entities not covered by presets (e.g. PATIENT_FIRST_NAME),
you can define custom detection rules with guidelines:
matchers:
ai_detect:
- label: "PATIENT_FIRST_NAME"
annotation_config:
guidelines:
- First names of patients
If a simple guideline is insufficient for downstream masking
or otherwise matches on different entities than intended,
multiple guidelines may be added,
alongside valid_examples and invalid_examples.
In the guideline below, we are excluding titles
because they will be masked separately (with the TITLE preset).
Likewise, we are more explicit about excluding friends and family
because we want to reduce false matches.
matchers:
ai_detect:
- label: "PATIENT_FIRST_NAME"
annotation_config:
guidelines:
- First names only of patients
- Exclude titles, and names of family/friends
valid_examples:
- text: "Patient [Smith] reports pain"
invalid_examples:
- text: "Surgery was performed on [Ms.Smith] Thompson"
reason: "Contains title"
Note: The
ai_detectmatcher accesses the Bedrock API so processes rows slower than the other matchers.Use the
max_rowsrun option to validate configuration before running on full datasets.
Token usage and costs
Assuming Claude Sonnet 4.5 is used as the DataMasque AI Engine's underlying model, the cost for processing 100 documents (with 1,000 characters each, alongside 20 labels) is approximately $1 USD. Costs scale linearly with each of the following: document length, the number of labels, and annotation config complexity.
Note: For more details on token usage and cost calculation, see the
README.mdprovided with the DataMasque AI Engine.
Writing effective guidelines
1. Start simple
- Language models have a lot of pre-built world knowledge.
- A small simple guideline is usually enough to match on 95%+ of entities.
2. Work iteratively
If you are able to access the data after masking, it helps to view the results both before and after. Tweak the guidelines after each attempt. Use
max_rowsto mask only a subset of your data.If you find that certain entities tend not to be matched, consider adding them to your
valid_examples, and if other entities tend to be unintentionally matched, consider adding them to yourinvalid_examples.
3. Consider downstream masks
Lastly, consider how your
guidelinesapply to downstream masks and the realism of your masked data.Be specific. In the
PATIENT_FIRST_NAMEexample above for example,TITLEwas excluded because we wanted to mask it separately.
regex
The regex matcher finds text matching regular expression patterns.
| Pattern | Text |
|---|---|
\b\d{5}\b |
ZIP Code: 90210 |
\b[A-Z]{3}-\d{4}\b |
Here: ABC-1234 is my reference |
\b\d{3}-\d{3}-\d{4}\b |
Call 555-867-5309 for details |
Use regex if you want to detect data following well-defined patterns.
matchers:
regex:
- label: "ZIP_CODE"
pattern: "\b\d{5}\b"
checksum
The checksum matcher validates numbers using checksum algorithms.
| Number | Valid Credit Card? |
|---|---|
| 4532-1234-5678-9010 | Yes |
| The cc: <1234-5678-9012-3456> | No |
| \n5105105105105100\n | Yes |
Use checksum if you want to match only those entities which pass validation algorithms.
matchers:
checksum:
- label: "CREDIT_CARD"
type: "credit_card"
For a list of supported checksums,
see the set_checksum documentation.
Overlap Handling
When matches overlap, DataMasque merges them by extending boundaries to cover all overlapping text.
The final label is determined by considering:
- the start positions,
- the match lengths,
- and finally the matcher priority.
Note: Adjacent entities that touch but don't overlap are masked separately.
Same-label overlap
Multiple matchers may be assigned to the same label:
matchers:
context_sources:
- label: "LAST_NAME"
column: last_name
ai_detect:
- label: "LAST_NAME"
annotation_config:
use_preset: "LAST_NAME"
| matcher | label | clinical_notes |
|---|---|---|
context_sources |
LAST_NAME |
Patient Rebecca de Santis arrived. |
ai_detect |
LAST_NAME |
Patient Rebecca de Santis arrived. |
combined |
LAST_NAME |
Patient Rebecca de Santis arrived. |
As both detected entities have the same label,
the combined LAST_NAME is the extension of both.
Cross-label overlap
Multiple matchers may overlap across different labels:
matchers:
context_sources:
- label: "ORGANISATION"
column: org_name
ai_detect:
- label: "LOCATION"
annotation_config:
use_preset: "STREET_ADDRESS"
| matcher | label | text |
|---|---|---|
context_sources |
ORGANISATION |
Visit Acme Bank, London … |
ai_detect |
LOCATION |
Visit Acme Bank, London … |
combined |
LOCATION |
Visit Acme Bank, London … |
If entities with different labels overlap, then the winning label is the one with the left-most start position.
If entities have the same left-most start position, the winning label is
the longest one. In the above example, the LOCATION label is used because it is the longest.
When both start positions and lengths are equal, then matcher priorities apply.
Matcher Priorities
From highest to lowest:
context_sourcesseed_filesai_detectchecksumregex
Context Filters
Context filters refine matches and can be applied to any matcher.
match_whole_words_only
By default, matchers will match partial words.
match_whole_words_only prevents matching within words.
| first_name | Without filter | With match_whole_words_only |
|---|---|---|
| Tom | Tom arrived | Tom arrived |
| Tom | Tomorrow | Tomorrow |
Use match_whole_words_only to exclude partial word matches.
matchers:
context_sources:
- label: "PATIENT_FIRST_NAME"
column: first_name
context_filters:
- type: match_whole_words_only
exclude_pattern
Excludes matches which match a regex pattern.
| Without filter | Exclude "@example.com" |
|---|---|
| Contact: john.smith@example.com | Contact: john.smith@example.com |
| Email: jose.garcia@email.com | Email: jose.garcia@email.com |
Use exclude_pattern to filter out applying regex patterns.
matchers:
regex:
- label: "EMAIL"
pattern: '[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}'
context_filters:
- type: exclude_pattern
pattern: '@example\.com$'
include_pattern
Includes only matches which match a regex pattern.
| Without filter | Include " Hospital" |
|---|---|
| Royal Melbourne Hospital treats… | Royal Melbourne Hospital treats… |
| Royal Melbourne clinic is closed | Royal Melbourne clinic is closed |
Use include_pattern to only include applying regex patterns.
matchers:
seed_files:
- label: "HOSPITAL_NAME"
seed_file: "hospitals.csv"
seed_column: "hospital_name"
context_filters:
- type: include_pattern
pattern: " Hospital$"
add_prefix
Extends matches to include prefixes.
| Without filter | Extended with "Dr. " |
|---|---|
| Dr. Jones consulted | Dr. Jones consulted |
| Jones arrived | Jones arrived |
Use add_prefix to include leading context.
matchers:
context_sources:
- label: "DOCTOR_NAME"
column: last_name
context_filters:
- type: add_prefix
prefix: "Dr. "
add_suffix
Extends matches to include a suffix if present.
| hospital_name | Without filter | Extended with " Hospital" |
|---|---|---|
| Royal Melbourne | Royal Melbourne treats | Royal Melbourne treats |
| Royal Melbourne | Royal Melbourne Hospital | Royal Melbourne Hospital |
Use add_suffix to include trailing context.
matchers:
seed_files:
- label: "HOSPITAL_NAME"
seed_file: "hospitals.csv"
seed_column: "hospital_name"
context_filters:
- type: add_suffix
suffix: " Hospital"
Combining filters
Context Filters are applied in two phases:
- Validation filters (
include_pattern,exclude_pattern,match_whole_words_only) are applied first - Extension filters (
add_prefix,add_suffix) are applied second
matchers:
seed_files:
- label: "DOCTOR_NAME"
seed_file: "DataMasque_lastNames.csv"
seed_column: "lastnames"
context_filters:
- type: match_whole_words_only
- type: add_prefix
prefix: "Dr. "
| Text input | Match | Step 1: match_whole_words_only |
Step 2: add_prefix |
Final result |
|---|---|---|---|---|
| "Dr. Smith consulted" | Smith ✓ | whole word ✓ | extends to Dr. Smith ✓ | Dr. Smith masked |
| "Smithson arrived" | Smith ✓ | part of word ✗ | - | Not masked |
| "Jones arrived" | Jones ✓ | whole word ✓ | no prefix found | Jones masked |
Configuring Masks for Labels
After defining which labels to match for, a mask must be defined for each label.
Important: DataMasque will raise an error if a matcher defines a label without an associated mask. If you wish to detect entities without masking, you can use the
do_nothingmaskNote: Labels must be
UPPER_CASE.
| Detected label | Mask Used | Original text | Masked text |
|---|---|---|---|
PHONE |
from_fixed |
Call 555-123-4567. | Call XXX-XXX-XXXX. |
REFERENCE_ID |
imitate |
Reference ID: AC-891234 | Reference ID: BK-472851 |
matchers:
...
masks:
- label: "PHONE"
masks:
- type: from_fixed
value: "XXX-XXX-XXXX"
- label: "REFERENCE_ID"
masks:
- type: imitate
For a complete list of masks, see Masking Functions Overview.
Fallback Masks
If the ai_detect matcher fails to process text due to a network timeout, content guardrails, or otherwise,
fallback_masks can be applied to the entire text instead of raising an error.
This following ruleset uses the ai_detect matcher to match on first names.
Given a networking issue, the entire text will be replaced.
matchers:
ai_detect:
- label: "FIRST_NAME"
annotation_config:
use_preset: "FIRST_NAME"
masks:
- label: "FIRST_NAME"
masks:
- type: from_file
seed_file: DataMasque_firstNames_mixed.csv
seed_column: firstname-mixed
fallback_masks:
- type: from_fixed
value: "[REDACTED]"
| AI Engine Status | clinical_notes (before) | clinical_notes (after) |
|---|---|---|
| Online | Patient Heiko arrived | Patient Marnie arrived |
| Network Error | Patient Heiko arrived | [REDACTED] |
The fallback replaces the entire text, not individual entities.
Appendix: ai_detect Presets
FIRST_NAME
Guidelines:
- Given name only.
- Excludes titles, middle names, initials
Valid examples:
- "Student [Alice] Walker submitted the assignment"
Invalid examples:
- "Manager [J.] approved the request"
- Initial
LAST_NAME
Guidelines:
- Surname only
- Excludes titles, middle names, initials
Valid examples:
- "Ms. [Smith] will lead the meeting"
- "Professor [van der Berg] published the paper"
- Compound surname as unit
FULL_NAME
Note: Use FIRST_NAME and LAST_NAME if this preset
is not behaving as expected.
Guidelines:
- First and last name together, optionally includes middle names
- Excludes titles, initials
Valid examples:
- "Award winner [Mary Elizabeth Johnson] will speak"
INITIAL_NAME
Guidelines:
- Single or multi-letter person-related initials only
- Excludes titles e.g. Mr, Miss
Valid examples:
- "[J. K.] Rowling wrote the book"
Invalid examples:
- "The [U.S.A.] is a large country"
- Not an identifying individual
TITLE
Guidelines:
- Professional titles and honorifics, including dot
Valid examples:
- "[Dr.] Smith will see you now"
- "Welcome [Ms] Johnson to the stage"
AGE
Guidelines:
- Numeric ages and age ranges of individuals
Valid examples:
- "Applicant is [45 years old]"
- "Participants aged [30-35] were selected"
Invalid examples:
- "DataMasque is [5] years old"
- Not individual
- "Targeted at [elderly] population"
- Non-numeric
DATE_OF_BIRTH
Guidelines:
- Dates when context clearly indicates birth
Valid examples:
- "DOB: [January 1]"
- "he was born yesterday (today is [01/05/2024])"
NATIONALITY
Guidelines:
- National origin, citizenship when describing identity
Valid examples:
- "She is a [British] citizen"
GENDER
Guidelines:
- Gender identity or biological sex
- Excludes pronouns
Valid examples:
- "The patient (27[M]) is recovering well"
- "Participant is [non-binary]"
RACE
Guidelines:
- Racial or ethnic identity
Valid examples:
- "Patient identified as [African American]"
EMAIL
Guidelines:
- Email addresses with @ and domain
- Includes partial/obfuscated emails
Valid examples:
- "Contact: [user+tag@company.co.uk]"
- "Reach out at [bob@gmail] for details"
FAX
Guidelines:
- Fax numbers only when context indicates fax
Valid examples:
- "Send documents to fax [487.416.6741x713]"
PHONE_NUMBER
Guidelines:
- Phone numbers in any format
- Includes extensions, partial numbers
- Excludes fax numbers and placeholders
Valid examples:
- "Extension [555.123.4567 ext. 890]"
- "Heya my number is: *** *** *** [1234]"
- Excludes placeholder *s
USERNAME
Guidelines:
- Online handles and account identifiers
Valid examples:
- "Follow [@johndoe123] on twitter"
STREET_ADDRESS
Guidelines:
- Street number and name only, may include unit/apartment
- Includes postal codes, PO boxes
- Excludes city, state, country
Valid examples:
- "Ship to [195 Main Street], Boston, MA [02101]"
- "Office located at [456 Oak Avenue Apt 2B]"
CITY
Guidelines:
- Cities, towns, villages
Valid examples:
- "Traveling to [Melbourne] next week"
COUNTRY
Guidelines:
- Country names include abbreviations
Valid examples:
- "Visiting the [UK] next month"
- "Born in [Canada]"
STATE
Guidelines:
- States, provinces, territories, includes abbreviations
Valid examples:
- "Shipping to [NY]"
- "Property in [New South Wales]"
ZIPCODE
Guidelines:
- ZIP codes and postal codes in any format
- Can be part of an address or standalone
Valid examples:
- "Postcode [SW1A 1AA]"
JOB_TITLE
Guidelines:
- Specific, professional roles and job titles
Valid examples:
- "Promoted to [Senior Vice President of Marketing]"
Invalid examples:
- "He was a good [manager]"
- Generic, non-specific
ORGANIZATION
Guidelines:
- Specific organizations, including appropriate prefixes/suffixes
Valid examples:
- "Enrolled at [Stanford University] for fall"
Invalid examples:
- "Works at [a hospital]"
- Generic, non-specifying
MEDICAL_CONDITION
Guidelines:
- Diseases, illnesses, diagnoses
Valid examples:
- "Diagnosed with [hypertension]"
- "Treatment for [Type 2 diabetes]"