Unique Masks
Unique masks are masks that will ensure every value is replaced with a unique value for that column. For other masks see all mask functions.
- Imitate Unique (
imitate_unique)
Masks strings or numbers while preserving format and uniqueness - Imitate NZ IRD (
imitate_nz_ird)
Masks New Zealand IRD numbers while preserving validity and format - Imitate UUID (
imitate_uuid)
Masks UUID/GUID values while preserving format and uniqueness - From Unique Imitate (
from_unique_imitate) (deprecated)
Transforms strings or numbers to be random, retaining format and uniqueness - From Unique (
from_unique) (deprecated)
Generates random but unique strings or numbers, from a format string (databases only)
Imitate Unique (imitate_unique)
Note: When used in a
mask_tabletask, this mask type will be skipped in a dry run as it requires writing to the database.
The imitate_unique mask replaces characters with others
from the same configured character group,
retaining format and uniqueness.
- Characters in the configured charset groups are replaced with other characters from the same group.
- Characters not in any configured charset group are left unchanged.
- The masked values are guaranteed to be consistent for the same given input, and different for different inputs.
By default, the character set groups are
uppercase letters (A-Z), lowercase letters (a-z), and digits (0-9).
Custom charset groups can be configured
to support additional characters such as non-Latin scripts.
This mask can only be used on columns of integer or string (char / varchar) type.
null values will always be left as null.
Empty strings are preserved as empty strings.
Important!
imitate_uniquedoes not supportIDENTITYcolumns in Oracle or Microsoft SQL Server.
imitate_uniquedoes not support directly masking partitioning tables (child tables of partitions) in Postgres 10+. Please mask the parent table instead.
imitate_uniquedoes not support directly masking hybrid tables in Snowflake.Note on Sequence Reset Warnings: When masking columns that have default values containing function calls (e.g.
gen_random_uuid(),now()in PostgreSQL, or custom functions), you may see a warning: "Unable to reset the next value of the sequence for column… If a sequence exists, it may generate a value that already exists in the column."This warning occurs because DataMasque cannot determine whether functions in default values internally use sequences. The warning is informational and harmless for columns where the default function does not use a sequence. DataMasque takes this defensive approach to ensure that if a sequence does exist within a function call, you are notified that it could not be reset. For columns with non-sequence function defaults, this warning can be safely ignored and masking will work correctly.
You can apply an imitate_unique mask to a primary key column
or a column that is used as a foreign key in another table.
References will be updated automatically.
Composite primary keys are supported.
Parameters
charsets(optional): A list of charset groups to use for masking. Each group is masked independently; characters are only replaced with other characters from the same group. Each entry has atypeof eithernamedorcustom:type: namedwithcharset: one ofDIGITS(digits0-9),ALPHA(uppercase and lowercase lettersA-Zanda-z),ALPHA_UPPERCASE(uppercase lettersA-Z), orALPHA_LOWERCASE(lowercase lettersa-z).type: customwithcharacters: a string of the characters to include.
Characters must not appear in more than one configured group. If
charsetsis omitted, the default groups areDIGITS,ALPHA_UPPERCASE, andALPHA_LOWERCASE.checksum(optional): A checksum algorithm to apply to the masked output. The available options are the same as forfrom_unique_imitate:brazilian_cpf,credit_card,luhn,icp,australian_business_number,australian_company_number, orweighted_dual_checksum. For alphanumeric checksums (such asicp), the configured charsets must include both letters and digits.on_invalid(optional): How to handle values that fail checksum validation. Can only be used in combination with thechecksumparameter. One of:error(default): Raise an error and stop masking.mask: Mask the value with an invalid checksum if the format is correct, or mask without consideration of a checksum otherwise.skip: Leave the value unchanged. See Usingon_invalidfor detailed behaviour.
retain_prefix_length(optional): The number of maskable characters (characters in the configured charset groups) to retain at the start of the value. Must leave at least one maskable character available. By default, no prefix is retained. See Retaining Prefixes for details and examples.retain_suffix_length(optional): The number of maskable characters (characters in the configured charset groups) to retain at the end of the value. By default, no suffix is retained.include_leading_zeros(optional): Whether to mask leading zeros in input values and allow leading zeros in masked values. One of:true: Leading zeros will be masked.false: Leading zeros will be preserved.warn(default): Likefalse, but also issues a warning in the run log if a string value with all zero digit(s) is encountered. See thefrom_unique_imitatedocumentation for detailed behaviour and examples.
min_digits(optional): Only applicable wheninclude_leading_zerosistrue. Pads integer-type values to the given number of digits before masking. Valid values are between 1 and 18.on_too_long(optional): Only applicable wheninclude_leading_zerosistrueandmin_digitsis specified. Determines what action to take when an integer value has more thanmin_digitsdigits. One of:error(default): Raise an error and stop masking.mask_suffix: The lastmin_digitsdigits will be masked; earlier digits will be left as-is.mask_all: The entire value will be masked.
seed(optional): An additional seed value to adjust the RNG. Use this to produce different masked outputs for the same input across different columns or tables.
Example
version: '1.0'
tasks:
- type: mask_table
table: customers
key: id
rules:
- column: customer_code
masks:
- type: imitate_unique
charsets:
- type: named
charset: ALPHA_UPPERCASE
- type: named
charset: DIGITS
- type: custom
characters: "等级的活动"
Show result
| Before | After |
|
|
|---|
Limitations
The maskable segment length is limited to 65,535 characters (characters in the configured charset groups after any prefix/suffix preservation).
Imitate NZ IRD (imitate_nz_ird)
Note: When used in a
mask_tabletask, this mask type will be skipped in a dry run as it requires writing to the database.
The imitate_nz_ird mask is designed for masking New Zealand Inland Revenue Department (IRD) numbers.
Valid IRD numbers are masked to other valid IRD numbers with correct checksums.
- The mask is format-preserving: valid IRD numbers are masked to valid IRD numbers.
- The masked values are guaranteed to be consistent for the same given input, and different for different inputs.
- IRD numbers can be either 8 or 9 digits. An 8-digit IRD number is equivalent to a 9-digit number
with a leading zero (e.g.,
12345678is equivalent to012345678). The mask handles both formats.
This mask can only be used on columns of integer or string (char / varchar) type.
null values will always be left as null.
Empty strings are preserved as empty strings.
Note: The IRD numbers shown in this documentation are randomly generated for illustrative purposes. Any resemblance to actual IRD numbers is coincidental.
Important!
imitate_nz_irddoes not supportIDENTITYcolumns in Oracle or Microsoft SQL Server.
imitate_nz_irddoes not support directly masking partitioning tables (child tables of partitions) in Postgres 10+. Please mask the parent table instead.
imitate_nz_irddoes not support directly masking hybrid tables in Snowflake.
You can apply an imitate_nz_ird mask to a primary key column
or a column that is used as a foreign key in another table.
References will be updated automatically.
Composite primary keys are supported.
When imitate_nz_ird is applied to a primary key or unique key column,
DataMasque temporarily disables the constraint, masks the values,
then re-enables the constraint after masking completes.
Parameters
on_invalid(optional): A string to specify the action to take if the value is not a valid IRD number. One of:error(default): Raise an error and stop masking.mask: For 8-9 digit values with an invalid checksum, mask to an output that also has an invalid checksum. For values with fewer or more digits, mask usingfrom_unique_imitate(length-preserving).skip: Skip to the next value, the value remains unchanged.
retain_length(optional): A boolean to preserve the digit count of the input. Whentrue, 8-digit inputs produce 8-digit outputs and 9-digit inputs produce 9-digit outputs. Defaults tofalse(generally preferable), meaning the output may vary between 8 and 9 digits. See details and examples below.skip_letters(optional): A boolean to enable masking of alphabetic characters separately. Whentrue, letters are masked viafrom_unique_imitatewhile digits follow IRD masking logic. Defaults tofalse(letters are preserved as-is).seed(optional): A string or integer combined with the instance secret and run secret to control randomness. Allows different columns using the same mask type to produce different outputs for identical inputs. Note that separate runs with the same seed will produce different output unless the instance secret and run secret are the same.
A value is considered invalid if:
- It has fewer than 8 or more than 9 digits
- It is outside the valid IRD range (10,000,000 to 149,999,999)
- It has 8-9 digits but the checksum is incorrect
retain_length Details
By default, imitate_nz_ird treats 8-digit and 9-digit IRD numbers as equivalent
(since 12345678 and 012345678 represent the same IRD).
This means the same IRD stored as 12345678 (integer) and "012345678" (string) will mask to the same
numeric value. String outputs preserve their original length when possible—an 8-digit string stays 8 digits
if the output value is under 100,000,000, and a 9-digit string always stays 9 digits.
Setting retain_length: true guarantees the exact digit count is always preserved, regardless of the
output value. This is useful when your data contains both "12345678" and "012345678" as distinct
string values in the same column that should remain distinguishable after masking.
Examples of
retain_length behavior
With retain_length: false (default):
| Input | Type | Example Output | Notes |
|---|---|---|---|
"28678258" |
string | "47448549" |
8-digit string stays 8 digits when output < 100M |
"028678258" |
string | "047448549" |
9-digit string stays 9 digits |
28678258 |
integer | 47448549 |
Same numeric value as both strings above |
Note: An 8-digit input can produce a 9-digit output if the masked value is 100,000,000 or greater. The numeric value is always consistent across equivalent IRD representations.
With retain_length: true:
| Input | Type | Example Output | Notes |
|---|---|---|---|
"28678258" |
string | "14196102" |
Stays 8 digits |
"028678258" |
string | "107859721" |
Stays 9 digits, different from 8-digit input |
28678258 |
integer | 14196102 |
Stays 8 digits |
When to use retain_length: true:
Use this option when a string column contains the same IRD value stored both with and without
a leading zero (e.g., both "12345678" and "012345678" exist as distinct rows in the same column).
Without retain_length: true, these would mask to the same output, potentially causing uniqueness
constraint violations.
Caveats:
- The string
"028678258"and the integer28678258will NOT mask to consistent values. - Integer columns cannot benefit from this option since integers cannot have leading zeros.
When to leave retain_length as false (default):
In most cases, leave retain_length as false. This provides consistent masking between
8-digit and 9-digit representations of the same IRD, and between string and integer columns.
Example Rulesets
Basic IRD Masking (Integer Column)
version: '1.0'
tasks:
- type: mask_table
table: taxpayers
key: id
rules:
- column: ird_number
masks:
- type: imitate_nz_ird
Show result
| Before | After |
|
|
|---|
Masking with Format Preservation (String Column)
version: '1.0'
tasks:
- type: mask_table
table: records
key: id
rules:
- column: formatted_ird
masks:
- type: imitate_nz_ird
Show result
| Before | After |
|
|
|---|
Using retain_length for Consistent Digit Count (String Column)
version: '1.0'
tasks:
- type: mask_table
table: mixed_format_irds
key: id
rules:
- column: ird_number
masks:
- type: imitate_nz_ird
retain_length: true
Show result
| Before | After |
|
|
|---|
Note: With retain_length: true, the 8-digit input stays 8 digits, and the 9-digit input (with leading zero) stays 9 digits.
Handling Invalid Values
version: '1.0'
tasks:
- type: mask_table
table: legacy_data
key: id
rules:
- column: ird_number
masks:
- type: imitate_nz_ird
on_invalid: mask
Show result
| Before | After |
|
|
|---|
Imitate UUID (imitate_uuid)
Note: When used in a
mask_tabletask, this mask type will be skipped in a dry run as it requires writing to the database.
The imitate_uuid mask is designed specifically for masking UUID (Universally Unique Identifier)
and GUID (Globally Unique Identifier) values while preserving their format and maintaining uniqueness.
This mask automatically detects and handles various UUID formats:
- Standard hyphenated format:
550e8400-e29b-41d4-a716-446655440000 - Compact format (no hyphens):
550e8400e29b41d4a716446655440000 - Prefixed formats:
urn:uuid:550e8400-e29b-41d4-a716-446655440000,uuid:550e8400-e29b-41d4-a716-446655440000 - Binary/byte format: 16-byte binary representation (e.g. Oracle RAW(16) columns)
Key features:
- Deterministic masking: The same input UUID always produces the same masked output, ensuring consistency across related data.
- Format preservation: The original UUID formatting (hyphens, prefixes, case) is maintained in the output.
- Referential integrity: When applied to primary keys, foreign key references are automatically updated.
- Uniqueness guarantee: Each distinct input UUID maps to a unique output UUID.
null values will always be left as null.
Empty strings are preserved as empty strings.
Binary/Byte Format Support
imitate_uuid can mask UUIDs stored as binary data (e.g. Oracle RAW(16) columns). When processing byte values:
- The input must be exactly 16 bytes to be recognised as a valid UUID
- The same UUID represented as bytes or string will produce equivalent masked results
- All retention parameters work identically for byte and string formats
- The output maintains the byte format when the input is bytes
Important: The on_invalid: mask option has limitations with byte values. Since from_unique_imitate
could produce non-hexadecimal characters, it cannot safely be used for byte masking. When byte input is invalid:
on_invalid: errorraises an error (default behaviour)on_invalid: skipreturns the original bytes unchangedon_invalid: maskreverts to error behaviour for safety
Important!
imitate_uuiddoes not supportIDENTITYcolumns in Oracle or Microsoft SQL Server.imitate_uuiddoes not support directly masking partitioning tables (child tables of partitions) in Postgres 10+. Please mask the parent table instead.imitate_uuiddoes not support directly masking hybrid tables in Snowflake.Note on Sequence Reset Warnings: When masking UUID columns that have default values containing function calls (e.g.
gen_random_uuid()in PostgreSQL), you may see a warning: "Unable to reset the next value of the sequence for column… If a sequence exists, it may generate a value that already exists in the column."This warning occurs because DataMasque cannot determine whether functions in default values internally use sequences. The warning is informational and harmless for UUID columns, as UUID generation functions typically do not use sequences. DataMasque takes this defensive approach to ensure that if a sequence does exist within a function call, you are notified that it could not be reset. For UUID columns with function defaults, this warning can be safely ignored and masking will work correctly.
You can apply an imitate_uuid mask to a primary key column
or a column that is used as a foreign key in another table.
References will be updated automatically.
Composite primary keys containing UUID columns are supported.
Parameters
seed(optional): An additional seed value to create different masking patterns while maintaining determinism. Useful for generating different UUID mappings for different environments or columns.retain_uuid_version(optional): A boolean to preserve the UUID version indicator (13th hexadecimal character). Whentrue(default), the version nibble is retained to maintain UUID version consistency.retain_prefix_length(optional): Number of hexadecimal characters to retain from the beginning of the UUID. Valid values are 0-31. Must not overlap with suffix retention.retain_suffix_length(optional): Number of hexadecimal characters to retain from the end of the UUID. Valid values are 0-31. Must not overlap with prefix retention.on_invalid(optional): Action to take when the input value is not a valid UUID. One of:error(default): Raise an error and stop masking.skip: Leave the value unchanged.mask: Fall back to character-by-character masking usingfrom_unique_imitate. Note: For byte values, this option reverts toerrorbehaviour sincefrom_unique_imitatecannot safely mask binary data.
default_case(optional): The case to use for output when the input UUID contains only digits or has mixed case. Options:upper,lower(default).warn_if_masking_invalid(optional): A boolean that controls whether to log warnings when invalid values are masked (only applies whenon_invalidismask). Defaults totrue.warn_on_mixed_case(optional): A boolean that controls whether to log warnings when UUIDs contain mixed uppercase and lowercase characters. Defaults totrue.uuid_format(optional): Controls the format compliance of masked UUIDs. Options:default,rfc_4122. When set torfc_4122, masked UUIDs are guaranteed to have a valid RFC 4122 variant (variant bits set to10xx) and a valid version (1-5). This is useful when downstream systems validate UUID format strictly (e.g. yup). Defaults todefault.
Examples
Basic Operation
This example shows standard UUID masking with default settings:
version: '1.0'
tasks:
- type: mask_table
table: users
key: id
rules:
- column: user_id
masks:
- type: imitate_uuid
| Input | Output |
|---|---|
550e8400-e29b-41d4-a716-446655440000 |
a4f7c832-7b5e-4d8a-9c23-8e9f2a1b6d45 |
550e8400e29b41d4a716446655440000 |
a4f7c8327b5e4d8a9c238e9f2a1b6d45 |
The format (with or without hyphens) is preserved.
Using Seed for Different Mappings
The seed parameter creates different masking patterns. Compare the same UUID masked with and without a seed:
version: '1.0'
tasks:
- type: mask_table
table: audit_log
key: id
rules:
- column: uuid_with_seed
masks:
- type: imitate_uuid
seed: "custom_seed"
- column: uuid_without_seed
masks:
- type: imitate_uuid
| Input UUID | Output without seed | Output with seed="custom_seed" |
|---|---|---|
123e4567-e89b-42d3-a456-426614174000 |
8d2e7a5b-9f1c-4e3d-a7b2-5e9c8f3a2d1e |
f3c2a8e5-7b4d-49f1-8e3a-9d2f5c8a1b7e |
The seed ensures different columns get different masked values for the same input UUID. Note that DataMasque's run secret and instance secret are combined with the seed to control the random output, i.e. separate runs with the same seed will produce different output unless the instance secret and run secret are the same.
Note: The seed parameter is used for controlling masking patterns:
- Use different seeds for different columns to ensure the same UUID doesn't map to the same output across columns
- Example: Without different seeds,
if user UUID
123e4567...appears in bothcreated_byandmodified_bycolumns, it would mask to the same value in both places, potentially revealing relationships
Prefix and Suffix Retention
Retain specific hex characters from the start or end (hyphens not counted):
version: '1.0'
tasks:
- type: mask_table
table: devices
key: id
rules:
- column: device_uuid
masks:
- type: imitate_uuid
retain_prefix_length: 10 # First 10 hex chars (dashes not counted)
retain_suffix_length: 4 # Last 4 hex chars
| Input | Output |
|---|---|
550e8400-e29b-41d4-a716-446655440000 |
550e8400-e25e-4d8a-9c23-8e9f2a440000 |
The first 10 hex characters (550e8400e2) and last 4 (0000) are preserved (note that dashes are not counted).
Note: When using retain_prefix_length and retain_suffix_length:
- Only hexadecimal characters are counted (0-9, a-f, A-F)
- Hyphens and other formatting characters are ignored in the count
- Total retention (prefix + suffix) must be less than 32 characters
- Example: In
550e8400-e29b-41d4-a716-446655440000, the first 12 hex characters are550e8400e29b, not550e8400-e29
Handling Invalid Values
The on_invalid parameter controls behaviour for non-UUID values:
Skip invalid values (leave unchanged):
- type: imitate_uuid
on_invalid: skip
| Input | Output |
|---|---|
DEVICE-001 |
DEVICE-001 |
Mask invalid values (use from_unique_imitate):
- type: imitate_uuid
on_invalid: mask
| Input | Output |
|---|---|
DEVICE-001 |
ZXCVBN-123 |
Error on invalid values (stop masking):
- type: imitate_uuid
on_invalid: error # Default
| Input | Output |
|---|---|
DEVICE-001 |
An error is raised due to invalid UUID format |
Note: When on_invalid: mask is set:
- Non-UUID values are processed using
from_unique_imitate - This maintains uniqueness but doesn't produce valid UUIDs
- Original format and character types are preserved where possible
- Example:
DEVICE-001might becomeZXCVBN-123
Case Handling
The default_case parameter applies when input has mixed case or only digits.
When the input has consistent casing (all uppercase or all lowercase),
that case is retained in the output:
- type: imitate_uuid
default_case: upper # or 'lower' (default)
| Input | Output with lower |
Output with upper |
|---|---|---|
550E8400-e29b-41D4-a716-446655440000 (mixed) |
a4f7c832-7b5e-4d8a-9c23-8e9f2a1b6d45 |
A4F7C832-7B5E-4D8A-9C23-8E9F2A1B6D45 |
11111111-2222-3333-4444-555555555555 (digits) |
c8f2a8e5-7b4d-49f1-8e3a-9d2f5c8a1b7e |
C8F2A8E5-7B4D-49F1-8E3A-9D2F5C8A1B7E |
Warning Messages
Warning messages are logged to the run log in these situations:
warn_on_mixed_case: Logs once per column per masking run when mixed-case UUIDs are detected (default: true)warn_if_masking_invalid: Logs once per column per masking run when invalid values are masked withon_invalid: mask(default: true)
These warnings appear only once per affected column during each masking run, regardless of how many mixed-case or invalid values are encountered in that column.
If multiple columns use imitate_uuid, each column can generate its own warning.
The warnings help identify data quality issues without interrupting the masking process or cluttering the log with repeated messages.
Warning:
imitate_uuidis designed for masking genuine, randomly-generated UUIDs. When applied to sequential or predictable UUID-formatted data (such as incremental counters in UUID format) combined with high retention settings that leave only a few characters to be masked, collisions may occur due to limited output space. For such data, consider usingfrom_unique_imitateinstead.
From unique (from_unique)
A mask that generates string values that are guaranteed to be unique within the target column.
Note: To automatically cascade primary and unique key values to
foreign keys or mask composite keys, consider using
mask_unique_key instead.
Parameters
format(required): The format that will be used to generate values. See format string syntax for details on how to construct a format string.
version: '1.0'
tasks:
- type: mask_table
table: drivers
key: id
rules:
- column: licence_plate
masks:
- type: from_unique
format: "{[A-Z],3}{[0-9],3}"
Show result
| Before | After |
|
|
|---|
Note
- All values produced by
from_uniquewill be strings. To convert values to other data types, you canchainyourfrom_uniquemask with atypecastmask. - Unlike
from_format_string,from_uniqueensures generated values are unique. This is achieved using the same underlying unique value generation procedure asmask_unique_key. - The specified
formatmust allow for a sufficient number of unique values to cover the full number of rows in the target table or file (rounded up to the nearest multiple of thebatch_sizeformask_tabletasks), otherwise an error will be reported when executing the run. - Using the
from_uniquemask type differs from themask_unique_keytask type in the following ways:from_uniquecannot perform any cascading of values to related columns, such as foreign keys.from_uniquecannot be used to guarantee joint uniqueness across the columns in a composite key, unless guaranteeing uniqueness within a single column within the composite key is sufficient for the target use case.from_uniquecan be used to update unique columns at the same time as other columns in amask_tabletask, without the need for a separate database update performed by amask_unique_keytask.from_uniquecan be used to generate unique values inmask_tabular_filetasks or inmask_tabletasks for databases that do not supportmask_unique_key(such as Amazon DynamoDB).
- Using
from_uniquein amask_tabletask whereworker_count> 1 is only supported for Oracle and Amazon DynamoDB connections. from_uniquedoes not support directly masking partitioning tables (child tables of partitions) in Postgres 10+. Please mask the parent table instead.from_uniquedoes not support directly masking hybrid tables in Snowflake.from_uniquecannot be used in the rules of amask_filetask.- Like other mask functions,
from_uniquecannot be used to update thekeycolumns of amask_tabletask (except for databases that allowmask_tableto update key columns, such as Amazon DynamoDB). - All
from_uniquemasks within a given task will be produced from the same sequence of unique values, such that twofrom_uniquemasks with identicalformatwill produce the same value for the same row. This can be useful to ensure values produced by identicalfrom_uniquemasks in differentifbranches are jointly unique.
From Unique Imitate (from_unique_imitate)
Deprecated: Consider using
imitate_uniqueinstead, which provides deterministic, collision-free masking with configurable character sets.Note: When used in a
mask_tabletask, this mask type will be skipped in a dry run as it requires writing to the database.
The from_unique_imitate mask is very similar to the imitate mask:
- alphabetical characters
a-zandA-Zare replaced by other alphabetical characters of the same case - digits are replaced by other digits
- all other characters (whitespace, symbols such as
%, and so on) are left as-is.
However, there are the following differences:
- Most notably,
from_unique_imitatecan also be used on primary keys and columns that have aUNIQUEconstraint. Indeed, it is designed specifically for data that must take unique values per row. - The masked values are guaranteed to be consistent for the same given input, and different for different inputs.
(The specific values generated will vary across runs unless you control the seeding.)
- For example, if a column has a
UNIQUEconstraint and the existing data satisfies that constraint, the resulting masked values will all be unique. - If a column contains the same value in every row, the content of the column after masking will also have the same value in every row.
- For example, if a column has a
The from_unique_imitate mask will always create a different result to its input,
as if it has an implicit force_change: true.
The mask may occasionally preserve individual letters and digits in the same position,
but it is always guaranteed that at least one (and very likely almost all)
of the alphanumeric characters will be different.
For example, ABC-123 may be masked to BYC-457.
This mask can only be used on columns of integer or string (char / varchar) type.
null values will always be left as null.
Important!
from_unique_imitatedoes not supportIDENTITYcolumns in Oracle or Microsoft SQL Server.
from_unique_imitatedoes not support directly masking partitioning tables (child tables of partitions) in Postgres 10+. Please mask the parent table instead.
from_unique_imitatedoes not support directly masking hybrid tables in Snowflake.
from_unique_imitatewill try to enable row movement for partitioned parent tables in Oracle Enterprise Edition if it has not already been enabled.Note on Sequence Reset Warnings: When masking columns that have default values containing function calls (e.g.
gen_random_uuid(),now()in PostgreSQL, or custom functions), you may see a warning: "Unable to reset the next value of the sequence for column… If a sequence exists, it may generate a value that already exists in the column."This warning occurs because DataMasque cannot determine whether functions in default values internally use sequences. The warning is informational and harmless for columns where the default function does not use a sequence. DataMasque takes this defensive approach to ensure that if a sequence does exist within a function call, you are notified that it could not be reset. For columns with non-sequence function defaults, this warning can be safely ignored and masking will work correctly.
You can apply a from_unique_imitate mask to a primary key column
or a column that is used as a foreign key in another table.
References will be updated automatically.
Composite primary keys are supported.
Parameters
skip_letters(optional): A boolean to enable or disable the skipping alphabetical characters from being masked. Defaults tofalse(alphabetical characters will be replaced).skip_digits(optional): A boolean to enable or disable the skipping of digits from being masked. Defaults tofalse(digits will be replaced).checksum(optional): A string to specify an algorithm to use to generate unique valid replacements. Options:brazilian_cpf,credit_card,luhn,icp,australian_business_number,australian_company_number, orweighted_dual_checksum.on_invalid(optional): A string to specify the action to take if the value fails checksum validation. One of:error(default): Raise an error and stop masking.mask: If the input length and format is valid (ie, the input value is only invalid because of an invalid checksum), the input value will be masked with a new uniquely assigned invalid checksum. Otherwise, the input value will be masked without consideration of a checksum.skip: Skip to the next value, the value remains unchanged.
retain_prefix_length(optional): The number of characters of the input value to retain. See Retaining Prefixes below. By default, no prefix is retained (i.e. the entire input value is masked). Note that retaining prefixes does not work with a prefix based checksum (australian_business_number)include_leading_zeros(optional): Whether to mask leading zeros in input values and allow leading zeros in masked values. See the table below for details and examples of how this option works. One of:true: Leading zeros will be masked.false: Leading zeros will be preserved.warn(default): Likefalse, but also issues a warning in the run log if a string value with all zero digit(s) is encountered, reminding the user that the digits (or perhaps the entire value) will not be masked.
min_digits(optional): Only applicable wheninclude_leading_zerosistrue. Pads integer-type values to the given number of digits before masking. Valid values formin_digitsare between 1 and 18.on_too_long(optional): Only applicable wheninclude_leading_zerosistrueandmin_digitsis specified. Determines what action to take when an integer value with more thanmin_digitsdigits is to be masked. (The leading-sign in a negative number does not count as a digit.) Valid values are:error(default): Raise an error and stop masking.mask_suffix: The lastmin_digitsdigits of the value will be masked; earlier digits will be left as-is.mask_all: The entire value will be masked.
disable_warning_on_skipped_characters(optional): A boolean which, when set totrue, will prevent warnings about characters being skipped being logged during masking. Defaults tofalse.
The
include_leading_zerosoption has no effect for integer data unlessmin_digitsis also specified. By default, integer values are always masked to values with the same number of digits as the input.Because it can create duplicate values, masking negative integers with
min_digitsis not supported. If this option is set and a negative integer value is encountered, DataMasque raises an error and stops masking.
Details and examples about the
include_leading_zeros, min_digits, and on_too_long options
Without min_digits, the include_leading_zeros option only has an effect when the input is a string.
The following table details the behaviour for string values.
include_leading_zeros = false or warn (default) |
include_leading_zeros = true |
|
|---|---|---|
| Leading zeros in input | All zeros appearing before any other digit in the value are preserved. If the value only contains zero digits, no digits are masked. |
All digits, including leading zeros, are masked. |
| Leading zeros in output | The masked value cannot have a zero as the first digit character in the string. The first non-zero digit cannot be masked to a zero digit. |
Any masked value of the same format is possible, including those with leading zero digits. The first non-zero digit can be masked to a zero digit. |
The min_digits and on_too_long options are only applicable when include_leading_zeros is true,
and allow include_leading_zeros to apply to integer values as well.
The idea behind min_digits is to pad integer values to a certain length,
so all values with at most that many digits are treated the same.
- Without
min_digits, values 0-9 mask to 0-9, 10-99 to 10-99, and so on. - With
min_digits: 6, as an example, all 1-6 digit numbers (i.e. 0-999,999) mask to any other value 0-999,999.
String inputs
The following table shows how the various options work on string inputs. All values are of string datatype.
| Input | include_leading_zeros |
min_digits |
on_too_long |
Example output | Explanation |
|---|---|---|---|---|---|
"ABC-00123-0044" |
false |
- | - | "ZQD-00257-3498" |
The leading zero characters in the value are left unchanged when include_leading_zeros is false (or warn). |
"ABC-00123-0044" |
true |
- | - | "ZQD-13256-0349" |
By setting include_leading_zeros to true, the first two digits in the value can now be masked. |
"DFR-08112-1123" |
false |
- | - | "PLX-08432-0091" |
The rule about digits not being masked to zeros when include_leading_zeros is false only applies to the digits before the first non-zero digit (here the 8). Hence, the subsequent 1123 can be masked to a value starting with one or more zero digits. |
"DFR-08112-1123" |
true |
- | - | "PLX-00315-5440" |
With include_leading_zeros set to true, any digit can be masked to any digit. Here the 0 happened to be masked to 0 and the 8 was masked to another 0. |
"AA000000" |
warn |
- | - | "NF000000",warning issued |
All the digits in the value are zeros, so DataMasque issues a warning as none of the digits will be masked. You can suppress the warning by setting include_leading_zeros to false. |
"AA000000" |
true |
- | - | "NF493281" |
By setting include_leading_zeros to true, the value can now be masked. |
"1234" |
true |
5 |
Any | "0781" |
The input is a string, but the options min_digits and on_too_long only affect integer values. |
Integer inputs
The following table shows how the various options work on integer inputs. All values are of integer datatype.
| Input | include_leading_zeros |
min_digits |
on_too_long |
Example output | Explanation |
|---|---|---|---|---|---|
1234 |
true |
- | - | 4873 |
Integer inputs cannot have leading zeros by definition, so include_leading_zeros has no effect for integer values unless min_digits is also specified. |
1234 |
true |
6 |
- | 192766 |
The value is padded to 6 digits and then masked. |
4321 |
true |
6 |
- | 5( 000005) |
Like the previous example, the value is padded to 6 digits and then masked. The output might have leading zeros, which are stripped off when the output is converted to integer datatype. As a result, the final masked value can have between 1 and 6 digits inclusive. |
1234567 |
true |
4 |
Not specified or error |
Error | The input value has 7 digits, but min_digits is set to 4. The default behaviour of on_too_long is to raise an error. |
1234567 |
true |
4 |
mask_suffix |
1230991 |
The input has more digits than min_digits. With the mask_suffix option, DataMasque only masks the last min_digits (here 4) digits. The other digits 123 are left as-is. |
1234567 |
true |
4 |
mask_all |
5098437 |
The input has more digits than min_digits. When on_too_long is set to mask_all, all digits are masked. The output always has the same length as the input when using mask_all, otherwise duplicate values can occur. |
0 |
false |
- | - | 0 |
If include_leading_zeros is not true, the integer value zero is always masked to zero. |
0 |
true |
- | - | 8 |
With include_leading_zeros set to true, the zero value is treated the same as any other one-digit number. |
0 |
true |
2 |
Any | 19 |
Same as the previous example - zero is treated like any other value, so padded to a two-digit number (since min_digits is 2) and then masked. |
-1234567 |
false or warn |
- | - | -4132891 |
When include_leading_zeros is not true, integer inputs always mask to numbers with the same number of digits. The sign of the value is preserved. |
-1234 |
true |
6 |
- | Error | Masking negative numbers with min_digits is not supported. |
Obtaining consistent masking between string and integer inputs
One of the main uses of min_digits is to enable numeric input values to mask to the same output,
regardless of whether the input is of string or integer datatype.
To do this, set include_leading_zeros to true, and min_digits to the number of digits in the largest input value.
The string values must be stored as zero-padded values with this many digits, for example "000123" when using min_digits: 6.
Note: Integer inputs must all be non-negative (0 or higher). Masking negative integers with
min_digitsis not supported.Note: For this masking case, omit
on_too_long, or set it to the default value oferror. That way, the masking run will fail if it encounters a value with more digits than expected, avoiding any cases wherefrom_unique_imitatewould produce inconsistent masking between string and integer values. Use ofmask_suffixormask_allwill not produce consistent results.
| Input | include_leading_zeros |
min_digits |
Example output | Explanation |
|---|---|---|---|---|
"000123" |
true |
- | "481657" |
Since include_leading_zeros is true, all digits of the input are masked. |
123 |
true |
6 |
481657 |
The integer equivalent of the above input string is masked to the integer equivalent of the above output string. |
"004511" |
true |
- | "000008" |
Since include_leading_zeros is true, it is possible for digits to be masked to zeros. |
4511 |
true |
6 |
8( 000008) |
Again, the integer equivalent of the input is masked to the integer equivalent of the output. Any output value between 0 and 999,999 is possible. |
123 |
true |
Not specified | 892 |
Without min_digits, a three-digit integer is always masked to another three-digit integer. |
You can also leave include_leading_zeros as the default value of warn,
which still produces consistent results between zero-padded strings and integers,
but the output is less secure as the output value always has the same number of significant digits as the input value.
You might use this option if you do not know the number of digits in the largest (in magnitude) input value,
or if the input data includes negative numbers.
Note: For a given input, the output when using
include_leading_zeros: false(orwarn) may be markedly different from the output when usinginclude_leading_zeros: true, even when there are no leading zeros.
| Input | include_leading_zeros |
Example output | Explanation |
|---|---|---|---|
"000123" |
false or warn |
"000892" |
The input value has three leading zeros. Without include_leading_zeros, the output value retains exactly three leading zeros. |
123 |
false or warn |
892 |
The integer equivalent of the above input string is masked to the integer equivalent of the above output string. |
"-004511" |
false or warn |
"-003669" |
There are two leading zeros. The minus sign is not masked. |
-4511 |
false or warn |
-3669 |
Again, the integer equivalent of the input is masked to the integer equivalent of the output. |
Example ruleset
Consider a table like the following:
idPrimary key |
value_as_stringCHAR(6) |
value_as_integerINTEGER |
|---|---|---|
1 |
"000123" |
123 |
2 |
"381731" |
381731 |
| etc. |
To mask this such that the consistency between the two value columns is preserved, you can use a ruleset like the following.
version: "1.0"
tasks:
- type: mask_table
table: '"my_table"'
key: '"id"'
rules:
- column: '"value_as_string"'
masks:
- type: from_unique_imitate
include_leading_zeros: true
- column: '"value_as_integer"'
masks:
- type: from_unique_imitate
include_leading_zeros: true
min_digits: 6
Invalid Parameter Combinations
- Setting both
skip_digitsandskip_letterstotrueis prohibited as no masking would take place.- Using the
checksumoption requires masking digits, so if any checksum is specified then you cannot setskip_digitstotrue. Further, theicpchecksum is an alphanumeric checksum, so you cannot setskip_digitsnorskip_letterstotruewhen using this checksum.- Because it would have no effect,
include_leading_zeroscannot be specified ifskip_digitsis set totrue.- The
min_digitsoption is not compatible withchecksum, nor withretain_prefix_length.- Prefix based checksums cannot be used with
retain_prefix_length. This applies toaustralian_business_number.
Values Requiring Checksums
When the output values must satisfy a checksum, specify the name of the checksum as the checksum parameter.
Unique values will be generated that satisfy that checksum algorithm.
The available options for checksum are:
brazilian_cpfcredit_cardluhnicpaustralian_business_numberaustralian_company_numberweighted_dual_checksum
For each checksum, the input value must contain a certain number of digits 0-9 and no letters A-Z or a-z.
Other non-letter characters that are used for formatting are retained in the output.
The replacement value will conform to the checksum algorithm,
even if the input did not, provided it is of the correct length.
To handle masking of values that may not match the checksum, or that may contain letters,
specify the on_invalid parameter.
Please refer to Using on_invalid for a detailed explanation of the behavior of each on_invalid
option.
brazilian_cpf
Use this checksum type to generate values that satisfy the Brazilian CPF (Cadastro de Pessoas Físicas) number
checksum.
For valid CPFs to be generated, the input value must contain 11 digits (and may contain spaces or punctuation).
For handling of invalid input values (for example, incorrect length, bad checksum or the presence of letters),
please refer to Using on_invalid.
The table below shows example input and output data, based on the default parameters.
| Input Example | Description | Output Example | Output Description |
|---|---|---|---|
| 298.056.372-20 | Valid, formatted CPF | 886.972.870-65 | Valid CPF with formatting retained |
| 2980,5637,220 | Valid CPF, with other formatting | 8869,7287,065 | Valid CPF with formatting retained |
| 29805637220 | Valid CPF, digits only | 88697287065 | Valid CPF, digits only |
| 298.056.372-29 | 11-digit, formatted number, that is not a CPF | 886.972.870-65 | Valid CPF with formatting retained |
| 29805637229 | 11-digit number that is not a CPF | 88697287065 | Valid CPF, digits only |
| 298056372 | 9-digit number | – | No output, error is raised and masking stops due to invalid length |
| 298A056B372C20 | String with letters | – | No output, error is raised and masking stops due to invalid characters |
credit_card and luhn
The credit_card and luhn checksums both generate values that satisfy the Luhn checksum algorithm.
The difference is how they each validate the length of the number:
luhnmay be applied to any number containing two or more digits.credit_cardis only valid for numbers of length 12 to 19, inclusive.
For values of length 12-19 characters, the behaviour of both checksums is identical.
If you are masking only credit cards, then credit_card should be preferred,
as it will also validate the length of existing values.
luhn should be used when generating values of other lengths,
for example, mobile phone IMEI numbers.
For handling of invalid input values (for example, incorrect length, bad checksum or the presence or letters),
please refer to Using on_invalid.
| Input Example | Description | Output Example | Output Description |
|---|---|---|---|
| 4111 1111 1111 1111 | Valid, formatted card number | 2260 5651 2623 0906 | Number that satisfies the Luhn checksum, with formatting retained |
| 2980,5637/2204 | Number with other formatting | 8869,7287/0655 | Number that satisfies the Luhn checksum, with formatting retained |
| 4111111111111111 | Valid card number, digits only | 2260565126230906 | Number that satisfies the Luhn checksum, digits only |
| 1234 1234 5678 5678 | Formatted card number that does not satisfy the Luhn algorithm | 2260 5651 2623 0906 | Number that satisfies the Luhn checksum, with formatting retained |
| 298A056B372C20 | String with letters | – | No output, error is raised and masking stops due to invalid characters |
icp
Use this checksum type to generate values that satisfy the New Zealand Installation Control Point (ICP) checksum.
For valid ICPs to be generated, the input value must contain 15 digits and letters (and may contain spaces or
punctuation). For handling of invalid input values (for example, incorrect length, or bad checksum),
please refer to Using on_invalid.
The table below shows example input and output data, based on the default parameters.
| Input Example | Description | Output Example | Output Description |
|---|---|---|---|
| 1234567890XYD51 | Valid ICP | 7972434682KR014 | Valid ICP |
| 12345-67890-XY-D51 | Valid ICP, with other formatting | 79724-34682-KR-014 | Valid ICP with formatting retained |
| 0123456789XYD51 | 15-alpha numeric, that is not an ICP | - | No output, error is raised and masking stops due to invalid checksum |
| 29805637229 | 11-digit number that is not an ICP | - | No output, error is raised and masking stops due to invalid checksum |
australian_business_number
Use this checksum type to generate values that satisfy the Australian Business Number (ABN) checksum.
For valid ABNs to be generated, the input value must contain 11 digits (and may contain spaces, hyphens,
or punctuation), and retain_prefix_length may not be specified. For handling of invalid input values (for example,
incorrect length, bad checksum, or when containing letters) please refer to Using on_invalid.
The table below shows example input and output data, based on the default parameters.
| Input Example | Description | Output Example | Output Description |
|---|---|---|---|
| 17009623009 | Valid ABN. | 53576799896 | Valid ABN. |
| 73 649 387 613 | Valid ABN, with other formatting. | 17 652 954 083 | Valid ABN with formatting retained. |
| 41262121252 | 11-digit number, with invalid checksum digits. | - | No output, error is raised and masking stops due to invalid checksum. |
| 553111431AE | 11 alphanumeric characters - ABN can't have letters. | - | No output, error is raised and masking stops due to invalid characters. |
| 551241 | 6-digit number - ABN requires exactly 11 digits. | - | No output, error is raised and masking stops due to invalid length. |
australian_company_number
Use this checksum type to generate values that satisfy the Australian Company Number (ACN) checksum.
For valid ACNs to be generated, the input value must contain 9 digits (and may contain spaces or
punctuation). For handling of invalid input values (for example incorrect length, bad checksum or letters),
please refer to Using on_invalid.
The table below shows example input and output data, based on the default parameters.
| Input Example | Description | Output Example | Output Description |
|---|---|---|---|
| 004085616 | Valid ACN. | 009830068 | Valid ACN |
| 004 085 616 | Valid ACN with formatting. | 009 830 068 | Valid ACN with formatting retained |
| 60522999C | 9 alphanumeric characters, ACN can't contain letters. | - | No output, error is raised and masking stops due to invalid checksum |
| 605229993 | 9 digit number with invalid checksum. | - | No output, error is raised and masking stops due to invalid checksum |
| 298056 | 6 digit number - ACN requires exactly 9 digits. | - | No output, error is raised and masking stops due to invalid checksum |
| 1234567890 | 10 digit number - ACN requires exactly 9 digits. | - | No output, error is raised and masking stops due to invalid checksum |
Example
This example will apply from_unique_imitate masks to the vehicle_id, license_plate and validation_code columns.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- column: vehicle_id
masks:
- type: from_unique_imitate
- column: license_plate
masks:
- type: from_unique_imitate
- column: validation_code
masks:
- type: from_unique_imitate
Show result
| Before | After |
|
|
|---|
weighted_dual_checksum
Use this checksum type to generate values that satisfy the dual weighted checksum,
where each digit in the value is given its own weight,
but may generate a high or low checksum digit,
to match the high/low value on the original checksum.
Using on_invalid
The on_invalid parameter can be used to control how invalid values are handled by from_unique_imitate.
It can only be used in combination with the checksum parameter,
since without specifying a checksum there is no way of considering a value "invalid".
A value is considered invalid if:
- It is too short for the checksum, or,
- The checkdigit(s) are not valid for the checksum, or,
- It contains letters – this can be controlled with the
skip_lettersparameter. See Determining and Handling Invalid Values for more detail.
Note that null is a special case and is not considered invalid.
null input is masked to null output, regardless of settings.
The following table illustrates the behaviour of from_unique_imitate based on different on_invalid parameters.
on_invalid |
Behaviour when encountering invalid value |
|---|---|
error (default if not specified) |
The masking task stops with an error. |
skip |
The invalid value is retained. |
mask |
The value is masked, and warnings are logged to the run log. See Determining and Handling Invalid Values for more detail. |
Determining and Handling Invalid Values
There two are main ways that a value can be invalid:
- It contains letters.
- It has the wrong length or checksum.
Note that special characters, punctuation, and spaces, do not affect the validity of values. For example,
from_unique_imitateconsiders the values123 456,123-456and123456the same.
Determining the validity based on the presence of letters can be controlled with the skip_letters parameter.
When set to true, a value will not be invalid if it contains letters.
However, if the digits themselves do not satisfy the given checksum,
then the value would be considered invalid.
The following table shows the validity of some example values for the brazilian_cpf algorithm,
with and without the use of skip_letters: true.
The rules are applicable to any checksum.
| Input Value | skip_letters |
Valid | Reason |
|---|---|---|---|
12175488403 Valid CPF number |
true or false |
Yes | |
121.754.884-03 Valid CPF number with formatting |
true or false |
Yes | Punctuation is ignored. |
AB121.754.884-03 Valid CPF number containing letters |
false |
No | The value contains letters. |
AB121.754.884-03 Valid CPF number containing letters |
true |
Yes | The letters are ignored due to the use of skip_letters: true. |
121.754.884-00 Invalid CPF, bad check digits. |
true or false |
No | Invalid due to bad check digits. |
AB121.754.884-00 Invalid CPF, contains letters and bad check digits. |
true or false |
No | Invalid due to bad check digits, regardless of skipping letters or not. |
121.754.884 Invalid CPF, bad length. |
true or false |
No | Invalid due to length. |
AB121.754.884 Invalid CPF, bad length. |
true or false |
No | Invalid due to bad length, regardless of skipping letters or not. |
Once a value is determined to be invalid:
- If using
on_invalid: errorthen the masking task will stop with an error. - If using
on_invalid: skipthen the value will be returned unmasked. - If using
on_invalid: maskthen masking will continue, and is described in more detail below.
For invalid values, the behaviour of the mask will change based on the reason for it being invalid.
If masking a value that is invalid only because of a bad check checksum, the output will mask uniquely to a new value with a uniquely masked, invalid checksum. If the value is invalid for another other reason, the string will be masked uniquely.
For example, the brazilian_cpf algorithm requires 11-digit values.
An 11-digit value that is not a valid CPF number will be masked to an invalid CPF number
(an 11-digit number that does not have a valid CPF number checksum).
However, a number that is not 11-digits will not be masked to a valid CPF number, as the number of digits is not correct for that algorithm, but the numbers will be masked.
The following table shows the minimum and maximum value length to which the checksum applies.
checksum |
Minimum length (inclusive) | Maximum length (inclusive) |
|---|---|---|
brazilian_cpf |
11 | 11 |
credit_card |
12 | 19 |
luhn |
2 | 10,000 |
icp |
15 | 15 |
australian_business_number |
11 | 11 |
australian_company_number |
9 | 9 |
If the value is invalid because it contains letters:
- The letters will be masked if
skip_digitsisfalse. - The letters remain unchanged if
skip_digitsistrue.
The following table gives examples of masked outputs for invalid inputs,
for different skip_letters options.
on_invalid is set to mask, otherwise no masking would occur.
The example uses the brazilian_cpf algorithm but the rules are applicable to any checksum.
| Input Value | skip_letters |
Output Value | Explanation |
|---|---|---|---|
AB121.754.884-03 Valid CPF with letters |
false |
DF149.758.055-29 |
Valid CPF in output, with letters masked. |
AB121.754.884-03 Valid CPF with letters |
true |
AB149.758.055-29 |
Valid CPF in output, letters not masked. |
121.754.884-99 Invalid CPF |
true or false |
149.758.055-30 |
Invalid CPF in output. |
121.754.884-98 Invalid CPF |
true or false |
149.758.055-61 |
Invalid CPF in output. |
121.754.884-99AB Invalid CPF with letters |
false |
149.758.055-30DF |
Invalid CPF in output, with letters masked. |
121.754.884-99AB Invalid CPF with letters |
true |
149.758.055-30AB |
Invalid CPF in output, letters not masked. |
121.754.884 Too short for CPF |
true or false |
246.016.536 |
Not a valid CPF, as input too short. |
121.754.884.692 Too long for CPF |
true or false |
246.016.536.420 |
Not a valid CPF, as input too long. |
AB121.754.884 Too short for CPF, with letters |
false |
DF246.016.536 |
Not a valid CPF, as input too short, with letters masked. |
AB121.754.884.692 Too long for CPF, with letters |
false |
DF246.016.536.420 |
Not a valid CPF, as input too long, with letters masked. |
AB121.754.884 Too short for CPF, with letters |
true |
AB246.016.536 |
Not a valid CPF, as input too short, letters not masked. |
AB121.754.884.692 Too long for CPF, with letters |
true |
AB246.016.536.420 |
Not a valid CPF, as input too long, letters not masked. |
Warning: When using
on_invalid: maskwith thebrazilian_cpfchecksum, 10 of the possible 1,000,000,000 digit combinations are not maskable with this algorithm.Where the data to mask has 11 digits, and the first nine digits are all the same digit (eg,
111.111.111-xxregardless of the checksum), the number itself is considered to be invalid.Where this data is required to be masked, it is recommended that the
from_unique_imitaterule is wrapped by anifrule that detects this data, and directs the engine to apply an alternative algorithm.When the masking process is configured to uniquely mask a CPF number that is invalid due to repeated digits, the run log will note a warning, and these entries will not be masked. This does not affect other checksum masking as this situation is specific to Brazilian CPF unique masking.
Example
This example will apply from_unique_imitate masks to the cpf_number and apply a redaction to known invalid numbers
by leveraging an if conditional masking rule.
version: '1.0'
tasks:
- type: mask_table
table: employees
key: id
rules:
- if:
- column: cpfnumber
matches: '(?:.*?(\d)\1\1){3}.*'
rules:
- column: cpfnumber
masks:
- type: from_fixed
value: 'redacted'
else_rules:
- column: cpfnumber
masks:
- type: from_unique_imitate
checksum: brazilian_cpf
on_invalid: mask
Retaining Prefixes
When masking values with from_unique_imitate, the retain_prefix_length option be used to specify the number of
prefix characters of the input to retain in the output.
The length takes into account only characters that would be masked based on the parameters of the mask.
Since from_unique_imitate doesn't mask punctuation or spaces, then these are not counted towards the prefix length.
Similarly, if using skip_letters/skip_digits, then letters or digits (respectively) won't be counted in the prefix.
If the retain_prefix_length is equal to or longer than the values to be masked,
then an error will be raised during masking.
The following table shows the retained prefix for example values, based on different parameters.
| Input Value | retain_prefix_length |
skip_letters |
skip_digits |
Retained Prefix | Example Output |
|---|---|---|---|---|---|
A1B2C3D4E5 |
4 | false |
false |
A1B2 |
A1B2F6G7H8 |
A1 B2 C3 D4 E5 |
4 | false |
false |
A1 B2 |
A1 B2 F6 G7 H8 |
A1 B2 C3 D4 E5 |
4 | true |
false |
A1 B2 C3 D4 |
A1 B2 C3 D4 E8 |
A1 B2 C3 D4 E5 |
4 | false |
true |
A1 B2 C3 D |
A1 B2 C3 D4 J5 |
A1B |
4 | true/false |
true/false |
- | No output, error is raised as the prefix is >= the length of the value |
AAA111 |
3 | false |
false |
AAA |
AAA456 |
AAA111 |
3 | true |
false |
- | No output, error is raised as the prefix is >= than number of digits |
AAA111 |
3 | false |
true |
- | No output, error is raised as the prefix is >= than number of letters |
The retain_prefix_length parameter can be combined with the checksum parameter with a
suffix-based checksum (may not be specified when selecting australian_business_number).
The checksum will be generated after combining the prefix with the masked values.
For example, using the luhn checksum, with retain_prefix_length of 5.
The value to be masked is 211287932175 (which is valid for the Luhn checksum).
The order of masking is:
- Extract 5 characters as a prefix:
21128 - Apply unique masking to all digits but the checksum digit
793217, giving (for example)123456 - This value is combined with the prefix, giving
21128123456 - The checksum digit is calculated from this prefix and masked value, which is
0 - The checksum digit is appended to the masked value, giving final output value of
211281234560, which is also valid for the Luhn algorithm.
When retaining the prefix and using a checksum, the number of digits must be less than the number of digits in the original value, minus the length of the check digits. For example, to mask a Brazilian CPF, a maximum of 8 digits can be retained, since the value is made up of 11 digits with the final 2 being checkdigits.
Checksum validity is checked when retaining the prefix, and validation is on the original value (i.e. including the prefix). To control how to handle invalid values, refer to Determining and Handling Invalid Values.
The following example ruleset shows masking a column using the credit_card checksum,
retaining the first 4 digits.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: from_unique_imitate
retain_prefix_length: 4
skip_letters: true
checksum: credit_card
Show result
| Before | After |
|
|
|---|
RedShift Limitations
Due to the current method in which we mask with RedShift, please note the current limitations:
PRIMARY KEYconstrained columns will be transformed into columns which areUNIQUEconstrained instead ofPRIMARY KEYconstrained.NOT NULLconstrained columns will lose theirNOT NULLconstraint. However, this is not just limited tofrom_unique_imitatemasking.