Data Pattern Masks
Data pattern masks are used to mask very specific patterns of data.
- Credit Card (
credit_card)
Replaces credit card values with random ones - Brazilian CPF (
brazilian_cpf)
Replaces Brazilian CPF numbers with random ones - Social security number (
social_security_number)
Replaces social security numbers with random ones - Set checksum (
set_checksum)
Calculates and sets valid checksum values
Credit card (credit_card)
This mask provides a number of methods for masking and generating credit card numbers. Parameters can be set to control the prefix, PAN formatting and Luhn checksum validity of the generated numbers.
There are three modes of operation of this mask.
- Card numbers can be replaced with generated numbers
(
generate_card_numberset totrue). - Card numbers can have the middle digits obscured (using a
#character by default), leaving just the first 6 and last 4 digits readable (pan_formatset totrue). - Both the above modes can be combined (by setting both parameters to
true), which will generate a card number and obscure the middle digits.
Please note that at least one of generate_card_number or pan_format must be true.
If they are both false then the masking run will fail as no masking would occur.
Parameters
generate_card_number(optional): Iftrue, new credit card numbers will be generated. Set tofalseto not generate card numbers (which improves performance), if thepan_formatargument is to be used.generate_card_numberdefaults totrue.pan_format(optional): Iftrue, mask the card number by replacing the digits between the first six and last four withpan_character.pan_formatdefaults tofalse.pan_character(optional): The character to use to conceal credit card digits, ifpan_formatistrue. Must be a single character string. Defaults to#.generate_luhn(optional): Iftruethe generated card number will pass the Luhn checksum. Set tofalseto generate random credit cards instead, which slightly improves performance by skipping Luhn digit generation.generate_luhndefaults to the opposite ofpan_format, ortrueifpan_formatis not set.retain_prefix_length(optional): The number of digits of the input card's prefix to retain, orautomaticto automatically determine the length of the prefix from the issuer. See Retaining Prefixes below. By default, no prefix is retained (i.e. the entire credit card number is generated randomly).issuer_names(optional): The generated card will have the specified issuer's prefix(es) and card lengths. If left empty all card issuers can be used to generate the card number. Not valid to use ifretain_prefix_lengthis specified. Please refer to the list of issuers.apply_weighting(optional): Iftrue, randomly select prefixes based on the actual popularity of prefixes. This increases the accuracy of generated data but slightly decreases performance. See Random Weighting below.apply_weightingdefaults tofalse.on_null(optional): A string to specify the action to take if the value isnull. One of:skip(default): Skip to the next value, the value remains unchanged (i.e. the value staysnull).mask: Overwrite thenullvalue with a generated credit card number.error: Raise an error and stop masking.
on_invalid(optional): A string to specify the action to take if the value is an invalid credit card number. One of:mask(default): Always overwrite without validating the credit card number. If an input value is not a valid credit card number theimitatemask will be used to replace the digits.skip: Skip to the next value, the value remains unchanged.error: Raise an error and stop masking.
output_format_choice(optional): A string to specify the desired output format. One of:retained(default): Detect the input format (also based on the input value type i.e. numeric types have no format) and output in the same format. This is done by replacing each digit in the original value with the generated digits, in order.numeric: Always return just the digits, as long as there are only digits in the masked value.
segment_separators(optional): An array of characters to allow as separators when validating credit card numbers. See Validating Card Numbers below.segment_separatorsdefaults to[" ", "/", "-"].
Invalid Parameter Combinations
Some combinations of parameters are invalid as they would be redundant or cause no masking to occur. These combinations will cause an error and the masking run will fail.
generate_card_number and pan_format can not both be false, since no masking would occur. Both may be true,
however, which will mean card numbers will be generated and then have PAN formatting applied.
Using retain_prefix_length with pan_format only (i.e. generate_card_number is false) is invalid as there is
no reason to try to retain a prefix when not generating the card number.
generate_luhn and pan_format can not both be true. It is redundant to try to generate the Luhn digit when the
middle characters will be unknown in the output.
A list of issuer_names can not be provided when retain_prefix is true, as this may create an unresolvable
scenario if trying to retain the prefix of a credit card number that is not in the list of specified issuers.
Retaining Prefixes
When generating card numbers there are three options for retaining the prefix of the input credit card number. The
first is to not retain the prefix at all, which means the entire credit card number will be randomly generated. This
is the default behaviour, if retain_prefix_length is omitted from the ruleset.
The second option is to specify a number of digits to retain. For example, to retain the first 4 digits of each input credit card, use the following ruleset.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
retain_prefix_length: 4
generate_card_number: true
If retain_prefix_length is more than half of the length of a credit card that is encountered when masking, an error
will be raised. For example, if retain_prefix_length is 7 and a credit card number of 14 or fewer digits is found,
this will cause an error and the masking run will stop.
Finally, the credit_card mask can be configured to automatically retain the prefix of the issuer, by specifying
automatic for retain_prefix_length. The length of the prefix will depend on the issuer and card length. The
longest matching prefix will be retained; for example, the prefixes 62 and 622126 both exist. The card
number 623… would retain just the 62 prefix, whereas a card number 6221264… would retain the 622126
prefix: even though it matches both the longest will be selected.
If no prefixes match a card number, then the mask will fall back to just retaining the first digit.
This next ruleset shows how to use automatic prefix retaining.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
retain_prefix_length: automatic
generate_card_number: true
The retain_prefix_length parameter is not valid if a list of issuer_names are provided.
DataMasque contains a list of over 105,000 prefixes which are used when the retain_prefix_length: automatic parameter
is set. If a prefix is not found, then DataMasque falls back to preserving just the first digit.
A full list of prefixes can be found in the credit_card_prefixes.csv file (1.9MB CSV).
Card issuer names
These card issuer names can be used as arguments to the issuer_names parameter. They are not case-sensitive.
| Visa | Mastercard | American Express |
| China T-Union | China Unionpay | Dankort |
| Diners Club International | Diners Club United States & Canada | Discover Card |
| Instapayment | Interpayment | JCB |
| Lankapay | Maestro | Maestro UK |
| MIR | NPS Pridnestrovie | Rupay |
| Troy | Ukrcard | Verce |
Random Selection and Weightings
When not choosing to retain the prefix, one will be randomly selected so that generated card numbers have a valid issuer prefix.
Note that the following rules apply after issuers have been filtered down to those specified in the
issuer_namesparameter (if provided). If this parameter is not provided then all issuers will be selected from.
If apply_weighting is enabled, credit_card mask uses weighting to select a prefix to use - first based on the
popularity of the issuer and then based on the count of credit card numbers inside that prefix.
The approximate weighting of issuers is as shown (note that weightings are approximate and apply relative to each other, so they do not add to 100% exactly).
| Issuer | Approximate Weighting |
|---|---|
| Visa | 53% |
| Mastercard | 33% |
| Discover Card | 8% |
| American Express | 8% |
| Other Issuers | .1% each |
Once an issuer has been chosen, a prefix is selected weighted on the length of cards. For example, a prefix of length one for a 16 character card has 1015 combinations, whereas a prefix of length one for a 14 digit card has 1013 combinations. Therefore, the 16 digit card number is ten times more likely to be chosen than the 15 digit card.
If apply_weighting is false then each issuer, and all prefixes for the chosen issuer, are equally likely to be
chosen.
Once a prefix and length have been chosen, a random card number is generated with this prefix and length. If
generate_luhn is true, the generated card number will pass the Luhn checksum.
Applying weightings will produce more realistic distribution of generated card numbers, at the cost of a slight performance penalty.
Validating Card Numbers
The credit_card mask can be configured to perform different actions when a value that is not a credit card number is
encountered. A value is considered a valid credit card number if:
- It is of integer or string type.
- It is between 12 and 19 characters long (inclusive).
- The digits satisfy the Luhn algorithm.
Note that a null value is not covered by these rules, as it is evaluated based on the on_null parameter (see
Handling Nulls below).
For string types, validation takes into account the segment_separators argument. Take for example, the
card number 4111111111111111 which is valid (correct length and passes Luhn checksum). Since the
segment_separators includes - (dash) by default, then the value 4111-1111-1111-1111 would also be
considered valid. However, the value 4111_1111_1111_1111 would not be considered a valid credit card number,
since _ (underscore) is not in the list of segment_separators.
Specifying the segment_separators parameter replaces the existing list of segment separators. This means that _
can't be added by itself as a separator, the existing separators must also be specified. For example:
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
segment_separators:
- "_"
- " "
- "/"
- "-"
When on_invalid is skip, invalid values will be returned as is, i.e. no masking will occur on these values.
When on_invalid is error, the masking run will stop and an error displayed in the run log.
When on_invalid is mask, the value will be masked with the imitate mask, which is configured
to replace digits in the value. Note that the imitate mask can only be applied to strings, therefore a value may not
be a valid credit card number which causes a fallback to the imitate mask, but it may then cause an error and
halting of the masking run by being the wrong type for imitate to be applied. Therefore, setting on_invalid to
mask is not a foolproof way of masking any type of data.
Handling Nulls
When on_null is skip, null values will be returned as is, i.e. the null value will be retained.
When on_null is error, the masking run will stop and an error displayed in the run log.
When on_null is mask, a credit card number will be generated based on the behaviour described in Random Selection
and Weightings above. However, if retain_prefix is true then an error will be raised and the masking run will
fail, as a null value has no prefix.
Example
This example generates credit card numbers that pass the Luhn checksum, with card issuer set to either MasterCard, Visa, or American Express.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
issuer_names:
- VISA
- MASTERCARD
- AMERICAN EXPRESS
generate_luhn: true
pan_format: false
Show result
| Before | After |
|
|
|---|
This example generates credit card numbers that retain the original card prefix.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
retain_prefix_length: automatic
generate_card_number: true
Show result
| Before | After |
|
|
|---|
This example does not generate card numbers, it just applied PAN formatting. The output_format_choice parameter is set
to numeric, so the input format is not retained, instead the output is normalized to just the (concealed) numbers.
version: "1.0"
tasks:
- type: mask_table
table: customers
key: customer_id
rules:
- column: credit_card_number
masks:
- type: credit_card
pan_format: true
output_format_choice: numeric
Show result
| Before | After |
|
|
|---|
Brazilian CPF (brazilian_cpf)
This mask provides a method for masking Brazilian CPF numbers using random (but valid) CPF numbers. Parameters can be adjusted to determine which actions to take based on the input value and the desired output.
The conventional format for CPF numbers is XXX.XXX.XXX-XX. This format can be retained, or if the input doesn't adhere
to this format, it can be standardised to this format. Alternatively, the input can be forced to display only the digits.
Parameters
on_null(optional): A string to specify the action to take if the value isnull. One of:skip(default): Skip to the next value, the value remains unchanged (i.e. the value staysnull).mask: Overwrite thenullvalue with a generated CPF number.error: Raise an error and stop masking.
on_invalid(optional): A string to specify the action to take if the value is an invalid CPF number. One of:mask(default): Always overwrite without validating the generated CPF number. If an input value is not a valid CPF number theimitatemask will be used to replace the digits.skip: Skip to the next value, the value remains unchanged.error: Raise an error and stop masking.
output_format_choice(optional): A string to specify the desired output format. One of:retained(default): Detect the input format (also based on the input value type i.e. numeric types have no format) and output in the same format. This is done by replacing each digit in the original value with the generated digits, in order.formatted: Always return with the conventional formatting (XXX.XXX.XXX-XX) if the masked value is of correct length (11).numeric: Always return just the digits, as long as there are only digits in the masked value.
Notes: In the cases where the digits are replaced by the
imitatemask:
- If
output_format_choice: formatted, the value must be of correct length (11) to be formatted. If not, an error will be raised.- If
output_format_choice: numericand there are non-numeric characters an error will be raised.
Example
In this example, CPF numbers are generated to replace the values in the tax_number column of the employees table.
The numbers are forced into the standardised format XXX.XXX.XXX-XX, ensuring that both null and invalid values are replaced for consistency.
version: "1.0"
tasks:
- type: mask_table
table: employees
key: employee_id
rules:
- column: tax_number
masks:
- type: brazilian_cpf
on_null: mask
on_invalid: mask
output_format_choice: formatted
Show result
| Before | After |
|
|
|---|
Social security number (social_security_number)
This mask provides a method for masking social security numbers using random (but valid) social security numbers. Parameters can be adjusted to determine which actions to take based on the input value and the desired output.
The conventional format for security numbers is XXX-XX-XXXX. This format can be retained, or if the input doesn't adhere
to this format, it can be standardised to this format. Alternatively, the input can be forced to display only the digits.
A value is considered a valid social security number if:
- It is of integer or string type.
- It consists of three groups of digits.
- Group 1 (area number) comprises 3 digits, the value must be in the range of 001 to 899 and should not be 666.
- Group 2 (group number) comprises 2 digits, the value must be in the range of 01 to 99.
- Group 3 (serial number) comprises 4 digits, the value must be in the range of 0001 to 9999.
- These three groups can be written together or separated by either a hyphen
-or a space.
Parameters
on_null(optional): A string to specify the action to take if the value isnull. One of:skip(default): Skip to the next value, the value remains unchanged (i.e. the value staysnull).mask: Overwrite thenullvalue with a generated social security number.error: Raise an error and stop masking.
on_invalid(optional): A string to specify the action to take if the value is an invalid social security number. One of:mask(default): Always overwrite without validating the generated social security number. If an input value is not a valid social security number theimitatemask will be used to replace the digits.skip: Skip to the next value, the value remains unchanged.error: Raise an error and stop masking.
output_format_choice(optional): A string to specify the desired output format. One of:retained(default): Detect the input format (also based on the input value type i.e. numeric types have no format) and output in the same format. This is done by replacing each digit in the original value with the generated digits, in order.formatted: Always return with the conventional formatting (XXX-XX-XXXX) if the masked value is of correct length (9).numeric: Always return just the digits, as long as there are only digits in the masked value.
Example
In this example, social security numbers are generated to replace the values in the social_number column of the employees table.
The numbers are forced into the standardised format XXX-XX-XXXX, ensuring that both null and invalid values are replaced for consistency.
version: "1.0"
tasks:
- type: mask_table
table: employees
key: employee_id
rules:
- column: social_number
masks:
- type: social_security_number
on_null: mask
on_invalid: mask
output_format_choice: formatted
Show result
| Before | After |
|
|
|---|
Set checksum (set_checksum)
This mask provides a method for calculating and setting valid checksum values for identifiers like VINs, ICP numbers, and credit card numbers. The mask accepts an input value where the checksum may be invalid, calculates the correct checksum, and modifies the value to have a valid checksum.
Note: If the input value to a
set_checksummask already has a valid checksum according to the checksum selected, theset_checksummask will not change the value.
The following checksums are supported:
- Australian Business Number (ABN)
- Australian Company Number (ACN)
- Brazilian CPF (CPF)
- Credit Card (CC)
- Installation Control Point (ICP)
- Luhn
- Vehicle Identification Number (VIN)
The set_checksum algorithm only accepts string values. It will raise an error if passed other data types or null values.
By using the on_invalid_characters and on_invalid_format options,
the set_checksum algorithm can handle cases where the input isn't formatted correctly for the checksum, for example:
- the input is an incorrect length for the selected checksum
- the input contains characters which are not valid for the selected checksum (for example, letters in a credit card number).
This mask can be combined with the checksum_valid conditional test
to handle more complex situations.
Parameters
checksum: A string to specify an algorithm to use to generate the checksum.Options:
australian_business_numberaustralian_company_numberbrazilian_cpfcredit_cardicpluhnvin
on_invalid_characters: Specifies what to do when the input contains characters which are not valid for the selected checksum.Options:
retain: Retain all invalid characters in their original positions.remove: Remove all invalid characters when masking.skip: If the value contains any invalid characters, skip to the next value; the value remains unchanged.error(default): If the value contains invalid characters, stop masking with an error message.
Valid and invalid characters
In the set_checksum mask, the characters in the input string are divided into valid and invalid characters.
Valid characters are ones that are recognized by the selected checksum algorithm.
For most checksums, the valid characters are digits 0-9;
the exceptions are the vin and icp checksums,
where the valid characters are all letters A-Z, a-z and digits 0-9.
Everything else is an invalid character.
As an example, when the credit_card checksum is in use, if we are given the input AA1234-5678-9012-3456%,
the valid characters are the digits 1234567890123456,
and the invalid characters are everything else: two As, three hyphens, and a % character.
on_invalid_format: Specifies what to do when the value to be masked is not a valid format or value for the checksum. For example,1234983921is not a valid Brazilian CPF number as it has only ten digits; CPFs require eleven digits. Likewise,4829ABI0828Q908XYis not a valid VIN: although it is the correct length (17 characters), it contains the charactersIandQwhich are not valid in VINs.Options:
skip: If the value is not valid for the checksum, skip to the next value; the value remains unchanged.error(default): If the value is not valid for the checksum, stop masking with an error message.
For the purpose of on_invalid_format, invalid characters are ignored.
For example, given the input 123.498.392-1 for the Brazilian CPF checksum,
the dots and hyphen are invalid characters,
so the checksum will only verify whether the digits in the string, 1234983921, are valid for the checksum.
In this case they are not, as per above (there are only ten digits but there should be eleven digits).
Note: Because the purpose of the
set_checksummask is to correct erroneous check digits, a value with the wrong check digit(s) is not considered to be of an "invalid format". This behaviour differs from theon_invalidoption used in other masks.Note: The check for invalid characters happens before the check for an invalid format. Thus, in the case where the value to be masked both has invalid characters and is of an invalid format,
on_invalid_characters: skipandon_invalid_characters: errorboth take precedence overon_invalid_format.
Character casing for alphanumeric checksums
The icp and vin checksums work on alphanumeric values.
The canonical form of values for these checksums is digits 0-9 and uppercase letters A-Z.
When using an alphanumeric checksum, the set_checksum mask handles character casing as follows:
- The input value can have any case, including mixed case.
- The generated replacement check digit(s), where they are letters, will always be in uppercase.
- Even if the input value has the correct check digit(s), they will still be converted to uppercase.
- Other characters in the value will retain their original case.
As an example, the correct check digits (last three characters) for an ICP of the form 04LJGQXK6BJ1xxx are D0A.
Thus, the values 04LjgQxK6bJ1ABc and 04LjgQxK6bJ1d0a will both mask to 04LjgQxK6bJ1D0A:
the lowercase j, g, x and b in the value are preserved,
while the check digits are always generated in uppercase.
You can use the transform_case mask
to adjust the character casing of the masked value.
Basic example: masking VINs
In this example, a table of both valid and invalid VINs
are updated to replace the check digit with a valid value.
Note that we are using skip_defaults to skip both null
and empty string values.
version: "1.0"
skip_defaults:
- ''
- null
tasks:
- type: mask_table
table: vehicles
key: id
rules:
- column: vin
masks:
- type: set_checksum
checksum: vin
Show result
| Before | After |
|
|
|---|
Examples of the on_invalid_characters and on_invalid_format options
The table below shows the behaviour of various combinations of the on_invalid_characters and on_invalid_format options.
The checksum in use in these examples is brazilian_cpf.
| Input | on_invalid_characters |
on_invalid_format |
Masked result | Explanation |
|---|---|---|---|---|
"04164106860" |
Any | Any | "04164106859" |
Valid format CPF with no invalid characters. The check digits (last two digits) are corrected from 60 to 59. |
"041.641.068-60" |
retain |
Any | "041.641.068-59" |
Valid format CPF. Invalid characters (dots and hyphen) are retained. |
"041.641.068-60" |
remove |
Any | "04164106859" |
Valid format CPF. Invalid characters are removed. |
"041.641.068-60" |
skip |
Any | "041.641.068-60" (skipped) |
Due to presence of invalid characters, this row is skipped. |
"041.641.068-60" |
error |
Any | Error | Due to presence of invalid characters, an error is raised. |
"0416410686" |
Any | skip |
"0416410686" (skipped) |
Invalid format CPF (should have exactly 11 digits, but has only 10). This row is skipped. |
"CPF 041.641.068-6" |
retain or remove |
skip |
"CPF 041.641.068-6" (skipped) |
After handling the invalid characters (everything except the digits), the resulting digits form an invalid format CPF, so this row is skipped. |
"04164106861234" |
Any | error |
Error | Invalid format CPF (14 digits given but exactly 11 are required). An error is raised. |
"041.641.068-61234" |
retain or remove |
error |
Error | After handling the invalid characters, the resulting digits form an invalid format CPF, so an error is raised. |
"041.641.068-61234" |
skip |
error |
"041.641.068-61234" (skipped) |
An invalid format CPF with invalid characters. The check for invalid characters happens first, so on_invalid_characters: skip takes precedence in this case. |