DataMasque Portal

Ruleset Generator

Overview

Use DataMasque's Ruleset Generator to generate a prototype YAML ruleset to mask database tables on a connection.

The Ruleset Generator runs a run_schema_discovery task in the background to discover database tables. Navigate to the Ruleset Generator page to utilise this functionality.

For more information about the methodology behind schema discovery, see the Schema Discovery guide. More information on the run_schema_discovery task can be found on the Ruleset Specification page.

Note: The schema discovery feature does not currently support Amazon DynamoDB or Microsoft SQL Server (Linked Server).

Caution: Be aware of the memory usage limitations.

Ruleset Generator

Select an existing connection from the dropdown box. If no run_schema_discovery task has previously been run against this connection (either from the ruleset generator page or by including a run_schema_discovery task in a ruleset), click the Run Discovery button to create a new run with the run_schema_discovery task. Otherwise, the table will populate with the most recent run data. You can click the Rerun Discovery button to run schema discovery again, which will overwrite the previous run's results.

Additional Custom Data Classification and Ignored keywords can be added for schema discovery. For more information please refer to the Additional Keywords section.

Schemas

By default, schema discovery will run against the schema configured on the database connection - or if none is configured there, then the database user's default schema. Alternatively, you can specify the schemas to discover by clicking on the Configure schemas button and entering the schema names, or uploading them from a CSV file.

Notes:

  • MySQL and MariaDB don't have the concept of a schema; instead, they use databases to represent this concept (a grouping of tables). When a MySQL or MariaDB database connection is selected, the word "schema" in the UI will be replaced by "database" to reflect this.
  • Schema (or database, for MySQL/MariaDB) names must be complete matches and are case-sensitive. Partial matches and wildcards are not supported. For example, entering myschema will match only myschema, not mySCHEMA nor myschema_1.

In-data discovery

The toggle switch marked In-data discovery allows you to enable or disable in-data discovery for this schema discovery run. When in-data discovery finds a column that may contain sensitive information, the column will be marked as such in the results (under Flagged by). When generating a ruleset, DataMasque will suggest suitable mask types for the columns based on the type of data that in-data discovery thinks the column contains.

Schema discovery results

Once the run is completed the table will populate with the report data from that run. The report data can be downloaded by clicking the "Download Report" button. The report will be downloaded as a CSV similar to the Sensitive Data Discovery report.

The CSV report contains the following columns:

Table schema The schema of the table discovered.
Table name The name of the table discovered.
Column name The name of the column discovered and matched against built-in keywords, Global Custom Data Classification keywords or Custom keywords if keyword matches are selected.
Constraint Whether the column is a Primary or Unique key. In parentheses it will list the columns in which the constraint is present.
Table Hash Column Indicates whether the column is selected as a table-level hash column for deterministic masking. This selection affects how hash_columns is configured in the generated ruleset.
Hash Columns Override For selected columns to be masked, shows the specific hash columns chosen to override the default table-level hash columns. Allows fine-grained control over which columns are used for hash calculation on a per-column basis.
Data Type The column data type specified in the database metadata.
Foreign Keys A list of any foreign keys reference this column, described in the following pattern (fk_name, referenced_column).
Max Length If the column is a text field, this contains the max length of the column, otherwise empty string.
Numeric Precision If the column is a numeric field, this contains the numeric precision of the column: the maximum number of digits allowed for the number. Otherwise, this value is an empty string.
Numeric Scale If the column is a numeric field, this contains the numeric scale of the column: the number of digits that are present after the decimal point. Otherwise, this value is an empty string.
Max Length If the column is a text field, this contains the max length of the column.
Reason for flag Description of pattern which caused the column to be flagged for sensitive data.
Flagged by Whether the column was flagged for sensitive data through in-data discovery or through the standard sensitive data discovery / keyword matching process.
Data classifications A comma-separated list of classifications for the flagged sensitive data. Possible classifications include PII (Personally Identifiable Information), PHI (Personal Health Information), PCI (Payment Card Information) and Custom (User specified custom keywords).

The columns intended to be masked can be selected from the table. Once all the intended columns have been selected the ruleset can be generated by clicking the "Generate Ruleset" button.

After the ruleset has been generated it can be previewed, downloaded, or sent to the ruleset editor.

Notes:

  • Foreign key columns cannot be selected in the user interface, as they should only be updated as the result of masking the columns they reference.

Hash Columns Configuration

Hash columns enable deterministic masking, ensuring that the same input data always produces the same masked output, which is crucial for maintaining referential integrity and consistency across related data. The ruleset generator provides two columns for configuring Hash Columns.

For comprehensive information about hash columns, their benefits, and advanced configuration options, see the Hash Columns documentation.

Table Hash Column

The Table Hash Column column contains checkboxes that allow you to select which columns should be used as table-level hash columns for deterministic masking. When you select columns here:

  • The selected columns will be used to calculate hash values for generating consistent masked data
  • This creates a table-level hash_columns configuration in the generated ruleset
  • All masked columns in the table will use these hash columns by default unless overridden

Hash Columns Override

The Hash Columns Override column provides dropdown menus that allow you to override the default table-level hash columns for specific masked columns, by adding hash_columns at the column level. This column is only available for:

  • Columns that are selected for masking (checked in the leftmost column)
  • Columns that are not primary keys (primary keys use unique masking functions)
  • Columns with available key columns (primary keys, unique keys, or foreign keys) in the same table

The dropdown options include:

  • Available key columns: Shows columns that have primary key (PK), unique key (UK), or foreign key (FK) constraints
  • "No Hash Columns": Explicitly excludes hash columns for this specific masked column
  • Multiple selections: You can select multiple columns to use as hash columns for a specific masked column

When no override is specified for a masked column, it will use the table-level hash columns if any are configured.

If multiple hash columns are selected, they will be sorted alphabetically when added to the ruleset. This means if multiple tables have hash columns with the same name, they will have the hash columns added in the same order, regardless of the order of the columns in the data dictionary.

Keywords

Built-in Keywords

Built-in keywords can be enabled or disabled, this will only stop the classification of the columns relating to PII, PHI or PCI and the reasons for those flags.

Additional Keywords

Additional keywords can be configured for a run_schema_discovery task run on a connection.

A modal will be opened in which keywords can be added manually to the list, or a CSV file with additional keywords can be uploaded. The format and interpretation of additional custom data classification keywords and ignored keywords entered on the ruleset generator page is exactly the same as for the global keywords - see the links below.

The global keywords set on the Settings page will also be included if the "Include Global Custom Data Classification Keywords" or "Include Sensitive Data Discovery Ignored Keywords" toggles are toggled on.

For more information about keywords please refer to:

Generated YAML Ruleset

After schema discovery has been run and the columns intended to be masked have been selected, the YAML ruleset can be generated by clicking the "Generate Ruleset" button. This will automatically generate a ruleset containing mask_table tasks for those columns.

Hash Columns in Generated Rulesets

When hash columns are configured through the ruleset generator UI, they are translated into hash_columns entries in the generated YAML ruleset:

Table Hash Columns

When columns are selected in the Table Hash Column column, a task-level hash_columns configuration is added to the ruleset:

tasks:
  - type: mask_table
    table: employees
    hash_columns:
      - '"user_id"'

This configuration applies to all masked columns in the table unless overridden at the column level.

Column-Level Hash Column Overrides

When specific hash columns are selected in the Hash Columns Override column, they create column-level hash_columns entries:

tasks:
  - type: mask_table
    table: employees
    hash_columns:
      - user_id  # Default for all columns
    rules:
      - column: email
        masks:
          - type: imitate  # Uses task-level hash_columns (user_id)
      - column: backup_email
        hash_columns:
          - '"backup_email"'  # Override: hash on itself instead

When No Hash Columns is selected for a specific column, it generates:

tasks:
  - type: mask_table
    table: employees
    hash_columns:
      - user_id  # Default for all columns
    rules:
      - column: name
        hash_columns: null  # Disable for this column

For more detailed information about hash column configuration and syntax, see the Hash Columns documentation.

Ruleset Generation Process

The generation of the ruleset is as follows:

  • mask_table tasks are generated for the selected columns.

    • Selected unique columns (unique keys, primary keys, and foreign key targets) are masked with unique-preserving masks: imitate_unique for text and integer columns, imitate_uuid for UUID columns, or from_unique as a fallback for data types that imitate_unique cannot handle (e.g. dates, times, floats). Unique columns masked in this way will be listed in a documentation block at the top of the generated YAML ruleset.
    • If one or more columns appear in multiple composite unique or primary keys or foreign key targets, a unique-preserving mask will only be generated for one of those column sets in order to not break uniqueness or referential integrity. If one of those column sets is a subset of another, the subset will always be masked in order to guarantee the uniqueness of both the subset and superset.
    • If a ruleset contains any mask_table tasks, a skip_defaults option to exclude masking of null values and empty strings will be added to the ruleset. This can be removed, or overridden in individual tasks to mask these blank values in those tasks.
    • For other columns, the column names are first matched to the Built-in Keywords using the same method as sensitive data discovery in order to select an appropriate mask. If a match is not found, then an appropriate mask type is selected based on the column's data type.

    The key of a mask_table task is a column name, or list of column names. It determines which rows will be masked with the same value. To maximise the security of the masked data, ideally no two rows would be deliberately masked to the same value and hence every row needs a different value in the key column (or a different combination of values across all key columns, where the ruleset specifies more than one). The following describes how the ruleset generator chooses the key column(s) to try to achieve this.

    The ruleset generator automatically validates column types when selecting keys, excluding columns with data types that are unsuitable for use as keys (such as XML, JSON, or certain database-specific types). Only columns with valid, supported data types are considered as potential keys. For a complete list of unsupported key column types by database, refer to the Database Connections documentation.

    For Oracle databases, the key for the mask_table task is always generated as ROWID (a unique number that Oracle gives to each row). For other databases, the key for the mask_table task is selected from the following options (in order of precedence):

    • The table's primary key, if it is not selected for masking in this task.
    • The unique key with the fewest columns, where none of those columns are selected for masking in this task.
    • The columns participating in the smallest (in terms of number of columns) unique index on the table, where none of those columns are selected for masking in this task.
    • The column or set of columns targeted by a foreign key with the fewest columns where none of those columns are selected for masking in this task. While the target of a foreign key is not guaranteed to be unique for all connections (e.g. MySQL) it is expected to be sufficiently unique to act as the key for mask_table.
    • Any column(s) that are flagged as Identification / Identifiers by sensitive data discovery, and are not selected for masking.
    • The first three column(s) alphabetically by name amongst those not selected for masking. The columns are selected as a set; the ruleset generator will always select exactly three columns in this case.

    Note: when all masks in a task are unique-preserving (e.g. imitate_unique), the key columns may include columns that are being masked, since these masks guarantee that uniqueness is maintained.

    The ruleset generator adds a comment above every key in a mask_table task explaining which key column(s) was/were chosen and why. The ruleset generator adds warning comments to the top of the generated ruleset in several situations:

    • Non-unique keys: When the selected key columns may not contain unique values, a warning is added. Review the generator's choice of key column(s) to ensure they are sufficiently unique for the table.
    • No valid keys available: When the generator cannot select any valid key columns (either because no suitable columns exist or all candidate columns have unsupported data types), it adds a warning and includes a commented-out key of REPLACE_ME. You must uncomment this line and replace REPLACE_ME with appropriate column name(s).
    • Column type validation: If columns were excluded from key selection due to unsupported data types, this is noted in the warnings.
    • Referential integrity: When columns cannot be masked or additional columns are masked to maintain referential integrity, these are documented in the warning block.

Further modifications to the ruleset may be required to achieve the intended mask on the database, which can be completed after passing the ruleset to the Ruleset Editor.

Notes:

  • In certain circumstances, the generated ruleset may not mask all selected columns, such as:
    • Columns where no masking approach can be determined that would not break referential integrity for one or more foreign keys
  • In certain circumstances, additional columns that were not selected may also be masked, such as:
    • Foreign keys referencing masked columns
    • Unselected columns in groups of jointly unique columns where at least one column is selected, including: composite unique keys, primary keys, and the targets of foreign keys.
  • In both of the above cases, the columns that could not be masked or were additionally masked will be listed in a documentation block at the top of the generated YAML ruleset.

JSON Columns

Any json or jsonb type columns detected by the Ruleset Generator will be masked with a from_fixed mask with the value {} (empty JSON object/dictionary). This provides a safe default by effectively blanking out any JSON columns.

For proper masking of JSON columns, please use a json mask instead. The json mask can traverse a JSON document and update individual elements while retaining its structure.

Troubleshooting Schema Discovery

If, after running discovery in the Ruleset Generator, an error or warning appears and results are not displayed, refer to the following troubleshooting guidance. You can also click on the link in the message to view the run log for the discovery run, which might contain more details about why discovery failed.

"Schema discovery failed"

  • Check that DataMasque is licensed.
  • Test the connection from the Database Masking page, and fix any connection issues such as the password being incorrect.
  • Check there is no masking run against the target database currently in progress.
  • Ensure the database user configured on the connection has sufficient database privileges.

"Schema discovery on connection successful, but no tables were discovered"

  • Check the schema(s) or database(s) specified for discovery. These are case-sensitive.
  • If there are no schemas or databases explicitly specified in the options, then check the default schema or database specified on the connection. This is also case-sensitive.
  • Check the database user has access to the target schema(s) or database(s).

Troubleshooting Generated Rulesets

Failures to satisfy primary or unique key constraints

If a mask_table task fails to satisfy a primary or unique key constraint, it could be due to one of the following issues:

  1. For columns masked with from_unique (used as a fallback for data types that imitate_unique cannot handle): the range of values generated by from_unique overlaps with the range of existing values in the column, resulting in duplicate values mid-masking. You should configure from_unique to generate values that do not overlap with the existing contents of the column, or add run_sql tasks to disable unique constraints during masking and re-enable them after.
  2. The key of the mask_table task is not a column or set of columns containing unique values (e.g. it is a non-unique set of columns referenced by a foreign key, which the ruleset generator assumes will typically be unique). You should change the key to a column or set of columns that is guaranteed to contain only unique values.

Generated ruleset warnings

Warning comments at the top of a generated ruleset indicate issues that require manual review:

  • "Key columns may not be unique": The selected key columns might contain duplicate values. Review and potentially change the key to ensure uniqueness.
  • "No valid key columns could be selected": No columns met the criteria for use as keys (either due to data type constraints or lack of unique columns). You'll need to manually specify appropriate key columns.
  • "Columns excluded due to unsupported types": Some columns were not considered as keys because their data types are not supported for key operations.
  • "Columns could not be masked": Review these to understand which columns were affected by referential integrity constraints.

Always review and address these warnings before running the generated ruleset.

Ruleset Editor

The Ruleset Editor is the same as the editor used when creating/editing a ruleset. See the Ruleset Editor guide for more information on this feature.