Ruleset Generator
- Overview
- Ruleset Generator
- Keywords
- Generated YAML Ruleset
- Troubleshooting Generated Rulesets
- Ruleset Editor
Overview
Use DataMasque's Ruleset Generator to generate a prototype YAML ruleset to mask database tables on a connection.
The Ruleset Generator runs a run_schema_discovery task in the background to discover database tables.
Navigate to the Ruleset Generator page to utilise this functionality.
For more information about the methodology behind schema discovery, see the Schema Discovery guide.
More information on the run_schema_discovery task can be found
on the Ruleset Specification page.
Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server).
Caution: Be aware of the memory usage limitations.
Ruleset Generator
Select an existing connection from the dropdown box.
If no run_schema_discovery task has previously been run against this connection
(either from the ruleset generator page or by including a run_schema_discovery task in a ruleset),
click the Run Discovery button to create a new run with the run_schema_discovery task.
Otherwise, the table will populate with the most recent run data.
You can click the Rerun Discovery button to run schema discovery again,
which will overwrite the previous run's results.
Additional Custom Data Classification and Ignored keywords can be added for schema discovery. For more information please refer to the Additional Keywords section.
Schemas
By default, schema discovery will run against the schema
configured on the database connection -
or if none is configured there, then the database user's default schema.
Alternatively, you can specify the schemas to discover
by clicking on the Configure schemas button
and entering the schema names, or uploading them from a CSV file.
Notes:
- MySQL and MariaDB don't have the concept of a schema; instead, they use databases to represent this concept (a grouping of tables). When a MySQL or MariaDB database connection is selected, the word "schema" in the UI will be replaced by "database" to reflect this.
- Schema (or database, for MySQL/MariaDB) names must be complete matches and are case-sensitive. Partial matches and wildcards are not supported. For example, entering
myschemawill match onlymyschema, notmySCHEMAnormyschema_1.
In-data discovery
The toggle switch marked In-data discovery allows you to enable or disable in-data discovery
for this schema discovery run.
When in-data discovery finds a column that may contain sensitive information,
the column will be marked as such in the results (under Flagged by).
When generating a ruleset, DataMasque will suggest suitable mask types for the columns
based on the type of data that in-data discovery thinks the column contains.
Schema discovery results
Once the run is completed the table will populate with the report data from that run. The report data can be downloaded by clicking the "Download Report" button. The report will be downloaded as a CSV similar to the Sensitive Data Discovery report.
The CSV report contains the following columns:
| Table schema | The schema of the table discovered. |
| Table name | The name of the table discovered. |
| Column name | The name of the column discovered and matched against built-in keywords, Global Custom Data Classification keywords or Custom keywords if keyword matches are selected. |
| Constraint | Whether the column is a Primary or Unique key. In parentheses it will list the columns in which the constraint is present. |
| Table Hash Column | Indicates whether the column is selected as a table-level hash column for deterministic masking. This selection affects how hash_columns is configured in the generated ruleset. |
| Hash Columns Override | For selected columns to be masked, shows the specific hash columns chosen to override the default table-level hash columns. Allows fine-grained control over which columns are used for hash calculation on a per-column basis. |
| Data Type | The column data type specified in the database metadata. |
| Foreign Keys | A list of any foreign keys reference this column, described in the following pattern (fk_name, referenced_column). |
| Max Length | If the column is a text field, this contains the max length of the column, otherwise empty string. |
| Numeric Precision | If the column is a numeric field, this contains the numeric precision of the column: the maximum number of digits allowed for the number. Otherwise, this value is an empty string. |
| Numeric Scale | If the column is a numeric field, this contains the numeric scale of the column: the number of digits that are present after the decimal point. Otherwise, this value is an empty string. |
| Max Length | If the column is a text field, this contains the max length of the column. |
| Reason for flag | Description of pattern which caused the column to be flagged for sensitive data. |
| Flagged by | Whether the column was flagged for sensitive data through in-data discovery or through the standard sensitive data discovery / keyword matching process. |
| Data classifications | A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), PCI (Payment Card Information) and Custom (User specified custom keywords). |
The columns intended to be masked can be selected from the table. Once all the intended columns have been selected the ruleset can be generated by clicking the "Generate Ruleset" button.
After the ruleset has been generated it can be previewed, downloaded, or sent to the ruleset editor.
Notes:
- Foreign key columns cannot be selected in the user interface, as they should only be updated as the result of masking the columns they reference.
Hash Columns Configuration
Hash columns enable deterministic masking, ensuring that the same input data always produces the same masked output, which is crucial for maintaining referential integrity and consistency across related data. The ruleset generator provides two columns for configuring Hash Columns.
For comprehensive information about hash columns, their benefits, and advanced configuration options, see the Hash Columns documentation.
Table Hash Column
The Table Hash Column column contains checkboxes that allow you to select which columns should be used as table-level hash columns for deterministic masking. When you select columns here:
- The selected columns will be used to calculate hash values for generating consistent masked data
- This creates a table-level
hash_columnsconfiguration in the generated ruleset - All masked columns in the table will use these hash columns by default unless overridden
Hash Columns Override
The Hash Columns Override column provides dropdown menus that allow you to override the default table-level hash columns for specific masked columns,
by adding hash_columns at the column level.
This column is only available for:
- Columns that are selected for masking (checked in the leftmost column)
- Columns that are not primary keys (primary keys use unique masking functions)
- Columns with available key columns (primary keys, unique keys, or foreign keys) in the same table
The dropdown options include:
- Available key columns: Shows columns that have primary key (PK), unique key (UK), or foreign key (FK) constraints
- "No Hash Columns": Explicitly excludes hash columns for this specific masked column
- Multiple selections: You can select multiple columns to use as hash columns for a specific masked column
When no override is specified for a masked column, it will use the table-level hash columns if any are configured.
If multiple hash columns are selected, they will be sorted alphabetically when added to the ruleset. This means if multiple tables have hash columns with the same name, they will have the hash columns added in the same order, regardless of the order of the columns in the data dictionary.
Keywords
Built-in Keywords
Built-in keywords can be enabled or disabled, this will only stop the classification of the columns relating to PII, PHI or PCI and the reasons for those flags.
Additional Keywords
Additional keywords can be configured for a run_schema_discovery task run on a connection.
A modal will be opened in which keywords can be added manually to the list, or a CSV file with additional keywords can be uploaded. The format and interpretation of additional custom data classification keywords and ignored keywords entered on the ruleset generator page is exactly the same as for the global keywords - see the links below.
The global keywords set on the Settings page will also be included if the "Include Global Custom Data Classification Keywords" or "Include Sensitive Data Discovery Ignored Keywords" toggles are toggled on.
For more information about keywords please refer to:
Generated YAML Ruleset
After schema discovery has been run and the columns intended to be masked have been selected, the YAML ruleset can be generated by clicking the "Generate Ruleset" button.
This will automatically generate a ruleset containing mask_table tasks for those columns.
Hash Columns in Generated Rulesets
When hash columns are configured through the ruleset generator UI,
they are translated into hash_columns entries in the generated YAML ruleset:
Table Hash Columns
When columns are selected in the Table Hash Column column,
a task-level hash_columns configuration is added to the ruleset:
tasks:
- type: mask_table
table: employees
hash_columns:
- '"user_id"'
This configuration applies to all masked columns in the table unless overridden at the column level.
Column-Level Hash Column Overrides
When specific hash columns are selected in the Hash Columns Override column,
they create column-level hash_columns entries:
tasks:
- type: mask_table
table: employees
hash_columns:
- user_id # Default for all columns
rules:
- column: email
masks:
- type: imitate # Uses task-level hash_columns (user_id)
- column: backup_email
hash_columns:
- '"backup_email"' # Override: hash on itself instead
When No Hash Columns is selected for a specific column, it generates:
tasks:
- type: mask_table
table: employees
hash_columns:
- user_id # Default for all columns
rules:
- column: name
hash_columns: null # Disable for this column
For more detailed information about hash column configuration and syntax, see the Hash Columns documentation.
Ruleset Generation Process
The generation of the ruleset is as follows:
mask_tabletasks are generated for the selected columns.- Selected unique columns (unique keys, primary keys, and foreign key
targets) are masked with unique-preserving masks:
imitate_uniquefor text and integer columns,imitate_uuidfor UUID columns, orfrom_uniqueas a fallback for data types thatimitate_uniquecannot handle (e.g. dates, times, floats). Unique columns masked in this way will be listed in a documentation block at the top of the generated YAML ruleset. - If one or more columns appear in multiple composite unique or primary keys or foreign key targets, a unique-preserving mask will only be generated for one of those column sets in order to not break uniqueness or referential integrity. If one of those column sets is a subset of another, the subset will always be masked in order to guarantee the uniqueness of both the subset and superset.
- If a ruleset contains any
mask_tabletasks, askip_defaultsoption to exclude masking ofnullvalues and empty strings will be added to the ruleset. This can be removed, or overridden in individual tasks to mask these blank values in those tasks. - For other columns, the column names are first matched to the Built-in Keywords using the same method as sensitive data discovery in order to select an appropriate mask. If a match is not found, then an appropriate mask type is selected based on the column's data type.
The
keyof amask_tabletask is a column name, or list of column names. It determines which rows will be masked with the same value. To maximise the security of the masked data, ideally no two rows would be deliberately masked to the same value and hence every row needs a different value in thekeycolumn (or a different combination of values across allkeycolumns, where the ruleset specifies more than one). The following describes how the ruleset generator chooses thekeycolumn(s) to try to achieve this.The ruleset generator automatically validates column types when selecting keys, excluding columns with data types that are unsuitable for use as keys (such as XML, JSON, or certain database-specific types). Only columns with valid, supported data types are considered as potential keys. For a complete list of unsupported key column types by database, refer to the Database Connections documentation.
For Oracle databases, the
keyfor themask_tabletask is always generated asROWID(a unique number that Oracle gives to each row). For other databases, thekeyfor themask_tabletask is selected from the following options (in order of precedence):- The table's primary key, if it is not selected for masking in this task.
- The unique key with the fewest columns, where none of those columns are selected for masking in this task.
- The columns participating in the smallest (in terms of number of columns) unique index on the table, where none of those columns are selected for masking in this task.
- The column or set of columns targeted by a foreign key with the fewest columns
where none of those columns are selected for masking in this task.
While the target of a foreign key is not guaranteed to be unique for all connections (e.g. MySQL)
it is expected to be sufficiently unique to act as the key for
mask_table. - Any column(s) that are flagged as
Identification/Identifiersby sensitive data discovery, and are not selected for masking. - The first three column(s) alphabetically by name amongst those not selected for masking. The columns are selected as a set; the ruleset generator will always select exactly three columns in this case.
Note: when all masks in a task are unique-preserving (e.g.
imitate_unique), the key columns may include columns that are being masked, since these masks guarantee that uniqueness is maintained.The ruleset generator adds a comment above every
keyin amask_tabletask explaining which key column(s) was/were chosen and why. The ruleset generator adds warning comments to the top of the generated ruleset in several situations:- Non-unique keys: When the selected key columns may not contain unique values, a warning is added.
Review the generator's choice of
keycolumn(s) to ensure they are sufficiently unique for the table. - No valid keys available: When the generator cannot select any valid
keycolumns (either because no suitable columns exist or all candidate columns have unsupported data types), it adds a warning and includes a commented-outkeyofREPLACE_ME. You must uncomment this line and replaceREPLACE_MEwith appropriate column name(s). - Column type validation: If columns were excluded from key selection due to unsupported data types, this is noted in the warnings.
- Referential integrity: When columns cannot be masked or additional columns are masked to maintain referential integrity, these are documented in the warning block.
- Selected unique columns (unique keys, primary keys, and foreign key
targets) are masked with unique-preserving masks:
Further modifications to the ruleset may be required to achieve the intended mask on the database, which can be completed after passing the ruleset to the Ruleset Editor.
Notes:
- In certain circumstances, the generated ruleset may not mask all selected columns, such as:
- Columns where no masking approach can be determined that would not break referential integrity for one or more foreign keys
- In certain circumstances, additional columns that were not selected may also be masked, such as:
- Foreign keys referencing masked columns
- Unselected columns in groups of jointly unique columns where at least one column is selected, including: composite unique keys, primary keys, and the targets of foreign keys.
- In both of the above cases, the columns that could not be masked or were additionally masked will be listed in a documentation block at the top of the generated YAML ruleset.
JSON Columns
Any json or jsonb type columns detected by the Ruleset Generator will be masked with a
from_fixed mask with the value {} (empty JSON
object/dictionary). This provides a safe default by effectively blanking out any JSON columns.
For proper masking of JSON columns, please use a json mask instead. The json
mask can traverse a JSON document and update individual elements while retaining its structure.
Troubleshooting Schema Discovery
If, after running discovery in the Ruleset Generator, an error or warning appears and results are not displayed, refer to the following troubleshooting guidance. You can also click on the link in the message to view the run log for the discovery run, which might contain more details about why discovery failed.
"Schema discovery failed"
- Check that DataMasque is licensed.
- Test the connection from the Database Masking page, and fix any connection issues such as the password being incorrect.
- Check there is no masking run against the target database currently in progress.
- Ensure the database user configured on the connection has sufficient database privileges.
"Schema discovery on connection successful, but no tables were discovered"
- Check the schema(s) or database(s) specified for discovery. These are case-sensitive.
- If there are no schemas or databases explicitly specified in the options, then check the default schema or database specified on the connection. This is also case-sensitive.
- Check the database user has access to the target schema(s) or database(s).
Troubleshooting Generated Rulesets
Failures to satisfy primary or unique key constraints
If a mask_table task fails to satisfy a primary or unique key
constraint, it could be due to one of the following issues:
- For columns masked with
from_unique(used as a fallback for data types thatimitate_uniquecannot handle): the range of values generated byfrom_uniqueoverlaps with the range of existing values in the column, resulting in duplicate values mid-masking. You should configurefrom_uniqueto generate values that do not overlap with the existing contents of the column, or addrun_sqltasks to disable unique constraints during masking and re-enable them after. - The
keyof themask_tabletask is not a column or set of columns containing unique values (e.g. it is a non-unique set of columns referenced by a foreign key, which the ruleset generator assumes will typically be unique). You should change thekeyto a column or set of columns that is guaranteed to contain only unique values.
Generated ruleset warnings
Warning comments at the top of a generated ruleset indicate issues that require manual review:
- "Key columns may not be unique": The selected key columns might contain duplicate values. Review and potentially change the key to ensure uniqueness.
- "No valid key columns could be selected": No columns met the criteria for use as keys (either due to data type constraints or lack of unique columns). You'll need to manually specify appropriate key columns.
- "Columns excluded due to unsupported types": Some columns were not considered as keys because their data types are not supported for key operations.
- "Columns could not be masked": Review these to understand which columns were affected by referential integrity constraints.
Always review and address these warnings before running the generated ruleset.
Ruleset Editor
The Ruleset Editor is the same as the editor used when creating/editing a ruleset. See the Ruleset Editor guide for more information on this feature.