DataMasque Portal

Configuration Reference

Complete reference for DataMasque subsetting ruleset YAML structure, field definitions, and validation rules.

Ruleset Structure

Square brackets [] mean the value is a list.

version: "1.0"                       # Required: Ruleset schema version
tasks:                               # Required: Array of tasks
  - type: subset                     # Required: Task type
    mode: in_place                   # Required: Processing mode
    defaults:                        # Required: Default settings
      target_percent: integer        # Required: Target percentage (1-99)
      batch_size: integer            # Optional: Batch size (1-1,000,000)
    include: [pattern]               # Optional: Tables to include
    skip: [pattern]                  # Optional: Tables to skip
    drop: [pattern]                  # Optional: Tables to drop entirely
    truncate: [pattern]              # Optional: Tables to empty

Pattern Structure

Each pattern in include, skip, drop, and truncate lists must contain exactly one of:

  • table: Exact fully-qualified table name (e.g., schema.tablename)
  • regex: Regular expression pattern
  • glob: Glob pattern with wildcards (*, ?, [abc])

Patterns in include can also override default settings:

include:
  - table: schema.tablename          # Basic pattern
    target_percent: 5                # Optional override
    additional_cascades: [...]       # Optional custom relationships
  - glob: schema.orders_*            # Pattern with override
    target_percent: 15

Task Properties

The following sections describe the properties available for subsetting tasks.

Subset task (subset)

The subset task reduces the size of databases while maintaining referential integrity.

Parameter Type Required? Description
type String Yes Must be "subset".
mode String Yes Processing mode.
Currently only "in_place" is supported, meaning the subset data will replace the original data.
defaults Object Yes Default configuration for the subsetting process.
See Default Configuration.
include List of include patterns No* Tables to include in the subset.
See Include Patterns.
skip List of table patterns No* Tables to skip during subsetting.
See Skip Patterns.
drop List of table patterns No* Tables to drop entirely from the database.
See Drop Patterns.
truncate List of table patterns No* Tables to truncate (empty) after subsetting.
See Truncate Patterns.

* At least one of include, skip, drop, or truncate must be specified.

subset tasks cannot be used within a parallel block or serial block; they must appear only in the top level tasks section of the ruleset.

Default Configuration

The defaults object configures default behavior for the entire subsetting operation.

Parameter Type Required? Description
target_percent Integer Yes Target percentage of data to retain (1 to 99 inclusive).
Applies to root tables in dependency trees (tables with no parent relationships).
batch_size Integer No Controls the batch size for subsetting operations (1 to 1,000,000).
Defaults to 50,000.
When specified, overrides the batch size set in run options.
Use smaller values to reduce memory usage for large tables.

The target_percent applies only to the root tables in any dependency trees (tables that are not children in any foreign keys or configured additional cascades). The retained data percentage in child tables is determined by which rows relate to the parent records that are retained, unless overridden in the include patterns.

Note: For tables with self-referential foreign keys (recursive foreign keys), the actual retention percentage may exceed the configured subset_percent_goal to maintain referential integrity. The subsetting operation will complete successfully and the run will finish with a finished_with_warnings status.

The batch_size parameter controls how many rows are processed at once during subsetting operations. When specified in the ruleset, it overrides the batch size set in the run options. This allows for fine-tuned control over memory usage on a per-ruleset basis. Smaller batch sizes reduce memory consumption but may increase processing time, while larger batch sizes can improve performance for tables with many rows.

Include Patterns

The include list specifies which tables to include in the subset. If not specified, any tables that don't match at least one drop, truncate, or skip pattern will be included by default.

Each pattern must specify the table(s) to match using one of three pattern types, and can optionally override default settings for matched tables.

Note: Tables with foreign keys to included tables are automatically included to maintain referential integrity, even if not explicitly listed or if they match a skip pattern. This is recursive: if table A is a parent of B and B is a parent of C, including A will automatically include both B and C.

Pattern Types

Each pattern must contain exactly one of the following fields:

Parameter Type Required? Description
table String No* Exact fully-qualified table name (must use schema.table format).
For case-sensitive identifiers, use quoting: table: '"Schema"."CaseSensitiveTable"'.
glob String No* Glob pattern match supporting wildcards (must use schema.table format with explicit schema):
* (zero or more characters), ? (single character), [abc] (character set).
The schema prefix must be explicit and cannot use wildcards.
Case sensitivity configured via case_sensitive.
regex String No* Regular expression pattern match (matches against schema.table format).
The schema prefix must be explicit in the regex pattern.
Must be a valid regular expression.
Case sensitivity configured via case_sensitive.
case_sensitive Boolean No Whether pattern matching is case-sensitive (defaults to false).
Only applies to glob and regex patterns; ignored for table patterns which follow database quoting rules.

* Exactly one of table, glob, or regex must be specified.

All patterns must use fully-qualified schema.table format. For glob and regex patterns, the schema portion must be explicit (no wildcards or pattern matching in the schema name). Patterns like glob: "customer_*" (missing schema) or glob: "*.temp" (wildcard schema) will result in validation errors. Unquoted names are interpreted according to the target database engine's rules.

Note: Enclose patterns in single or double quotes ('' or "") to avoid potential misinterpretation due to YAML special characters. Backslashes must be escaped as \\ because the backslash is an escape character in YAML.

Pattern Overrides

Include patterns can override default configuration for matched tables:

Parameter Type Required? Description
target_percent Integer No Override the default percentage for matching tables (1 to 99).
Cannot be higher than the default percentage.
additional_cascades List of cascade objects No Table-specific custom relationships.
See Additional Cascades.

The target_percent represents the desired retention for matching tables. Actual retention may differ: it may be lower if parent tables were heavily subsetted, or higher if NOT NULL foreign key relationships require additional rows to maintain referential integrity. See Retention and Foreign Key Relationships for details.

When using additional_cascades in an include pattern, only table patterns are supported (not glob or regex).

Examples

include:
  # Exact table match (case-sensitive due to quoting)
  - table: '"Schema1"."Customers"'

  # Exact table match (case follows database rules - typically case-insensitive)
  - table: public.orders

  # Exact table with target_percent override
  - table: public.customers
    target_percent: 5

  # Glob pattern matching all tables starting with "user" in the analytics schema
  - glob: analytics.user*
    case_sensitive: false

  # Glob pattern with override
  - glob: sales.order_*
    target_percent: 15

  # Regex pattern matching tables ending with year suffix
  - regex: "^metrics\\.(analytics|stats)_\\d{4}$"
    target_percent: 1

  # Table with custom relationships
  - table: crm.accounts
    additional_cascades:
      - source_columns: [account_id]
        target_table: external.contacts
        target_columns: [acct_ref]

Skip Patterns

The skip list specifies tables to exclude from subsetting. Skipped tables will still be included if they have foreign keys to included tables, to maintain referential integrity.

Skip patterns use the same pattern types as include patterns, but do not support overrides.

skip:
  # Skip specific table
  - table: public.audit_log

  # Skip all temp tables in public schema
  - glob: "public.temp_*"
    case_sensitive: false

  # Skip backup tables in public schema
  - regex: "public\\..*_backup$"

Drop Patterns

The drop list specifies tables to completely remove from the database (DROP TABLE). These tables are dropped at the end of the subsetting operation.

Drop patterns use the same pattern types as skip patterns.

Warning: DROP is a destructive operation that permanently removes both table structure and data.

drop:
  # Drop specific staging table
  - table: staging.import_temp

  # Drop all tables in staging schema
  - glob: staging.*

  # Drop old partition tables
  - regex: "data\\.events_\\d{4}_\\d{2}$"  # Matches: data.events_2023_01, etc.

Truncate Patterns

The truncate list specifies tables to empty (TRUNCATE TABLE) after subsetting is complete. These tables will have all data removed but the table structure is preserved.

Truncate patterns use the same pattern types as skip patterns.

Note: TRUNCATE removes all rows but keeps the table structure, indexes, and constraints. It resets auto-increment counters.

truncate:
  # Empty audit log table
  - table: public.audit_log

  # Empty all cache tables
  - glob: cache.*
    case_sensitive: false

  # Empty session tables
  - regex: "app\\.session_\\w+$"

Additional Cascades

The additional_cascades list defines custom relationships between tables. These are relationships not represented by foreign key constraints in the database.

Additional cascades are defined within include patterns for specific tables:

include:
  - table: schema.parent_table
    additional_cascades:
      - source_columns: [col1, col2]      # Columns in this table (parent)
        target_table: schema.child_table  # Related table (child)
        target_columns: [ref1, ref2]      # Columns in related table

Additional cascades can refer to tables in schemas that aren't defined in the schemas section, and refer to tables that have not been explicitly included.

Parameter Type Required? Description
source_columns List of strings Yes List of column names in the source (parent) table (minimum 1 column).
target_table String Yes Fully qualified target (child) table name in schema.table format.
Must be a different table to the source table.
target_columns List of strings Yes List of column names in the target table (minimum 1 column).
Must contain the same number of elements as source_columns.

Ensure the order of the target_columns matches that of source_columns according to the relationship.

For both table and column names, use quoting where required for case-sensitive identifiers.

All specified tables and columns must exist in the database. It is an error to specify an additional cascade that reverses a foreign key, as this makes the relationship direction ambiguous. If an additional cascade is identical to an existing foreign key, a warning is issued and the cascade is ignored.

Example

include:
  - table: '"Orders_Schema"."Orders"'
    additional_cascades:
      - source_columns: [year, '"Order_REF_ID"']
        target_table: '"Accounts_Schema"."Invoices"'
        target_columns: [invoice_year, order_id]

Pattern Precedence Rules

When a table matches multiple patterns, the following precedence rules apply.

Important: The precedence rules are quite complex. It is strongly recommended to perform subsetting with a dry run first (click the Preview Dry Run button) to verify that the included tables and relationships are as you expect.

Precedence of table, glob and regex

When a table matches multiple patterns within the same action (include, skip, drop, or truncate):

  • If a table matches a table pattern, this has the highest precedence. Multiple table matches in the same action are considered an error.

  • If a table matches glob or regex patterns:

    • If there is exactly one matching pattern, that pattern applies.
    • If there are multiple matching patterns with different target_percent overrides, the lowest target_percent is used.
    • Multiple matches with conflicting non-percentage overrides cause an error.

Action precedence

When a table matches patterns in different actions:

  1. drop has highest precedence - if a table matches a drop pattern, it will be dropped regardless of other matches

  2. truncate has second precedence - if a table matches a truncate pattern (and no drop pattern), it will be truncated regardless of include/skip

  3. skip has third precedence:

    • If a table matches both include and skip patterns (glob or regex), skip takes precedence
    • Multiple table pattern matches in both include and skip is an error
  4. include has lowest precedence - used when no other actions match

  5. No match - tables that match no patterns are included in the subset by default, unless they are in a schema not configured for subsetting

Foreign Key Relationship Constraints

In addition to pattern precedence, tables with foreign key relationships must follow action compatibility rules:

  • Child tables (tables with foreign keys referencing parent tables) must have actions that are equally or more destructive than their parent tables
  • Destructiveness order: SKIPINCLUDETRUNCATEDROP (least to most destructive)
  • Circular dependencies: All tables in a circular foreign key relationship must have the same action

Note: These constraints ensure referential integrity is maintained throughout the subsetting process. See Action Limitations with Foreign Key Relationships in the advanced features guide for detailed examples and validation error messages.

Quick Examples

Basic Ruleset

version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 10
      batch_size: 10000              # Optional: Use smaller batches
    include:
      - glob: main_data.*            # Include all tables in main_data schema

With Filtering

version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 8
      batch_size: 250000             # Larger batches for better performance
    include:
      - table: ecommerce.customers   # Always include
        target_percent: 15           # Override to 15%
      - glob: ecommerce.customer_*   # Pattern match
      - glob: ecommerce.order*       # Pattern match
    skip:
      - glob: ecommerce.*_temp       # Skip temp tables
      - regex: "ecommerce\\..*_backup$"  # Skip backup tables

With DROP and TRUNCATE

version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 10
    include:
      - glob: production.*
    skip:
      - glob: production.temp_*
    drop:
      - glob: staging.*              # Remove all staging tables
      - table: production.old_data   # Remove specific table
    truncate:
      - table: production.audit_log  # Empty audit log
      - glob: production.cache_*     # Empty all cache tables

Case-Insensitive Patterns

include:
  - glob: mixed_case_db.user*
    case_sensitive: false            # Matches user*, USER*, User*
skip:
  - regex: "mixed_case_db\\..*_temp$"
    case_sensitive: false            # Matches *_temp, *_TEMP, *_Temp

For complete examples, see Examples and Scenarios.

For troubleshooting, see Troubleshooting.