DataMasque Portal

Advanced Subsetting Features

Learn how to enhance your basic subsetting configuration with advanced features. This guide builds on the Basic Setup and Use tutorial, showing you how to add filtering patterns, per-table overrides, custom relationships, DROP/TRUNCATE operations, and other advanced capabilities to create precisely tailored subsets.

Building Your Configuration

Building on the configuration from the Basic Setup guide, every DataMasque subsetting configuration follows this structure:

version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      # Default settings for all tables
    include:
      # Tables to include (optional)
    skip:
      # Tables to skip (optional)
    drop:
      # Tables to drop entirely (optional)
    truncate:
      # Tables to empty (optional)

Let's explore how to enhance this basic structure with advanced features.

Pattern Types in Detail

All subsetting operations use patterns to match tables. Understanding each pattern type helps you create precise configurations.

Table Patterns (Exact Match)

Table patterns provide exact table name matching and have the highest precedence. All patterns must use fully-qualified schema.table format:

include:
  # Matches exactly "public.customers" (case follows database rules)
  - table: public.customers

  # Matches precisely "Sales"."Customer_Profiles" (case-sensitive due to quoting)
  - table: '"Sales"."Customer_Profiles"'

  # With override
  - table: analytics.user_events
    target_percent: 5

Use when:

  • You want to match specific tables by name
  • You need to apply overrides to specific tables
  • You want the highest pattern precedence

Glob Patterns

Glob patterns use wildcards for flexible matching:

include:
  # Matches all tables in the customer_data schema
  - glob: customer_data.*

  # Matches "customer_profiles", "customer_orders", etc. in public schema
  - glob: public.customer_*

  # Matches "temp_daily_backup", "temp_weekly_backup", etc.
  - glob: staging.temp_*_backup

  # Case-insensitive matching
  - glob: public.user*
    case_sensitive: false     # Matches user*, USER*, User*

  # With override
  - glob: sales.order_*
    target_percent: 15

Wildcards:

  • * - Zero or more characters
  • ? - Any single character
  • [abc] - Character set

Important: The schema prefix must be explicit and cannot use wildcards. Patterns like glob: "customer_*" (missing schema) or glob: "*.temp" (wildcard schema) will result in validation errors.

Use when:

  • You want to match multiple tables with similar names
  • You need simple pattern matching
  • Schema or prefix-based matching is sufficient

Regex Patterns

Regular expressions provide precise matching with full regex syntax:

include:
  # Matches tables starting with "customer_" in public schema
  - regex: "^public\\.customer_.*$"

  # Matches specific table names in public schema
  - regex: "^public\\.(users|customers|orders)$"

  # Matches tables ending with "_archive_" followed by 4 digits (e.g., _archive_2023)
  - regex: "analytics\\..*_archive_\\d{4}$"

  # Case-insensitive matching
  - regex: "public\\.temp_.*"
    case_sensitive: false

  # With override
  - regex: "metrics\\.(analytics|stats)_\\d{4}$"
    target_percent: 1

Important:

  • The schema portion of the regex must be explicit - patterns like ".*\\.(users|customers)" (wildcard schema) are not allowed
  • Use anchors ^ (start) and $ (end) to avoid partial matches
  • Escape dots in schema/table names as \\. (backslash itself needs escaping in YAML)
  • Escape special regex characters like \d as \\d

Use when:

  • You need precise pattern matching
  • Simple glob wildcards aren't sufficient
  • You want to match complex naming patterns

Pattern Precedence Rules

When a table matches multiple patterns:

  1. table patterns have highest precedence - exact matches always win
  2. Multiple glob/regex matches - the most restrictive (lowest target_percent) applies
  3. Action precedence: drop > truncate > skip > include

Example:

include:
  - glob: public.customer_*      # Matches customer_data_2015, customer_data_2025
  - table: public.audit_backup   # Exact match
skip:
  - regex: "public\\..*_backup$" # Matches tables ending with "_backup"
  - regex: "public\\..*_20[01][0-9]$"  # Matches tables ending with year 2000-2019

# Results:
# - public.customer_data_2015 → SKIPPED (matches include AND skip; skip wins)
# - public.customer_data_2025 → INCLUDED (matches include, not skip)
# - public.transactions_backup → SKIPPED (matches skip)
# - public.audit_backup → INCLUDED (exact table match has highest precedence)

For complete precedence rules, see Configuration Reference.

Per-Table Overrides

You can override default settings for specific tables by adding fields to include patterns:

defaults:
  target_percent: 8      # Default for all tables

include:
  # Basic patterns use defaults
  - glob: public.user_*
  - glob: public.order_*

  # Override for specific table
  - table: public.customers
    target_percent: 15   # Keep more customers

  # Override with glob pattern
  - glob: analytics.*
    target_percent: 2    # Keep less analytics data

  # Override with regex
  - regex: "logs\\..*_archive_\\d{4}$"
    target_percent: 1    # Minimal archive data

Available overrides:

  • target_percent - Override the default percentage (1-99)

Use cases:

  • Core tables need higher data retention
  • Analytics/logs can be heavily reduced
  • Different business domains have different testing needs

Note: The target_percent represents the desired retention. Actual retention may be lower if parent tables were heavily subsetted, or higher if NOT NULL foreign key relationships require additional rows. See Retention and Foreign Key Relationships for details.

Skip Patterns

Skip patterns exclude tables from subsetting:

include:
  - glob: public.*      # Include all tables in public schema

skip:
  - table: public.audit_log           # Skip specific table
  - glob: public.temp_*               # Skip all temp tables
  - regex: "public\\..*_backup$"      # Skip backup tables
  - glob: public.*_cache              # Skip cache tables
    case_sensitive: false

Important: Skipped tables will still be included if they have foreign keys to included tables, to maintain referential integrity. See Conflict Resolution.

Note: Tables that have foreign keys have limitations on which actions can be used. See Action Limitations with Foreign Key Relationships for details.

Drop and Truncate Operations

Dropping Tables

The drop list specifies tables to completely remove from the database (DROP TABLE):

drop:
  # Drop specific staging table
  - table: staging.import_temp

  # Drop all tables in staging schema
  - glob: staging.*

  # Drop old partition tables
  - regex: "data\\.events_\\d{4}_\\d{2}$"  # Matches: data.events_2023_01, etc.

  # Drop with case-insensitive matching
  - glob: temp_schema.*
    case_sensitive: false

Warning: drop is a destructive operation that permanently removes both table structure and data. Always take a backup before running subsetting with drop patterns.

Note: Tables that have foreign keys have limitations on which actions can be used. See Action Limitations with Foreign Key Relationships for details.

Truncating Tables

The truncate list specifies tables to empty (TRUNCATE TABLE) after subsetting:

truncate:
  # Empty audit log table
  - table: public.audit_log

  # Empty all cache tables
  - glob: cache.*

  # Empty session tables
  - regex: "app\\.session_\\w+$"

  # Case-insensitive truncate
  - glob: public.temp_*
    case_sensitive: false

Note: truncate removes all rows but keeps table structure, indexes, and constraints. It is faster than DELETE and resets auto-increment counters.

Note: Tables that have foreign keys have limitations on which actions can be used. See Action Limitations with Foreign Key Relationships for details.

Use Cases

Dropping tables:

  • Remove staging/temporary schemas entirely
  • Clean up old partition tables
  • Remove development-only tables before promoting to higher environments

Truncating tables:

  • Clear audit logs while preserving structure
  • Empty cache/session tables
  • Reset counters while keeping table definitions

Example combining both:

version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 10
    include:
      - glob: production.*
    skip:
      - glob: production.temp_*
    drop:
      - glob: staging.*              # Remove entire staging schema
      - table: production.old_data   # Remove specific obsolete table
    truncate:
      - table: production.audit_log  # Keep structure, remove data
      - glob: production.cache_*     # Empty all cache tables

Action Limitations with Foreign Key Relationships

When tables have foreign key relationships, the actions you assign must be compatible to maintain referential integrity. For example, you cannot skip a child table while dropping its parent table, as the child records would reference parent rows that no longer exist.

Important: Child tables (tables with foreign keys referencing parent tables) must have actions that are equally or more destructive than their parent tables. The destructiveness order is: SKIPINCLUDETRUNCATEDROP (least to most destructive).

Understanding the Constraint

DataMasque validates your configuration to ensure that foreign key relationships remain valid. If a parent table is configured with a particular action, any child tables that reference it must use an action of equal or greater destructiveness.

Valid action combinations:

  • Parent: SKIP → Child: SKIP, INCLUDE, TRUNCATE, or DROP
  • Parent: INCLUDE → Child: INCLUDE, TRUNCATE, or DROP
  • Parent: TRUNCATE → Child: TRUNCATE or DROP
  • Parent: DROP → Child: DROP

Invalid action combinations:

  • Parent: INCLUDE → Child: SKIP
  • Parent: TRUNCATE → Child: SKIP or INCLUDE
  • Parent: DROP → Child: SKIP, INCLUDE, or TRUNCATE

Circular Foreign Key Relationships

For tables with circular foreign key relationships, all tables within the cycle must have the same action. This is to protect foreign key relationships, which would otherwise need to be left disabled or dropped to allow for one of the tables in the cycle to have rows missing after subsetting.

Validation Errors

If you configure incompatible actions, you'll receive a validation error during configuration validation or at the start of subsetting:

Cannot apply 'INCLUDE' action to table `sales.order_items`
when parent table `sales.orders` is marked for 'TRUNCATE'.
Child actions must be equally or more destructive.

Resolution: Adjust your configuration so child tables have actions equal to or more destructive than their parents, or ensure all tables in circular relationships use the same action.

Additional Cascades

Use additional cascades to define custom logical relationships that aren't defined as foreign key constraints in the database. These are often referred to as "logical" relationships within the database.

Additional cascades are defined within include patterns for specific tables:

include:
  # Table with custom relationship
  - table: customer_data.customers
    additional_cascades:
      # Link customers to external logging system
      - source_columns: [customer_id]
        target_table: logging_system.user_activity
        target_columns: [user_ref_id]

  # Another table with multiple cascades
  - table: order_data.orders
    additional_cascades:
      # Connect orders to warehouse system
      - source_columns: [order_id]
        target_table: warehouse_system.shipments
        target_columns: [order_reference]

      # Multi-column relationship
      - source_columns: [order_id, line_number]
        target_table: external_fulfillment.order_lines
        target_columns: [ext_order_id, ext_line_num]

Requirements:

  • Only table patterns support additional_cascades (not glob or regex)
  • Source table is the parent in the relationship
  • Target table is the child in the relationship
  • Column count must match between source and target
  • Column order matters for composite keys
  • All tables and columns must exist in the database

For complete additional cascades documentation, see Configuration Reference.

Batch Size Configuration

The batch_size parameter in defaults provides fine-grained control over subsetting performance and memory usage. This parameter determines how many rows are processed in each batch during subsetting operations.

Configuration Options

version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 10
      batch_size: 25000              # Override default of 50,000
    include:
      - glob: large_data_schema.*

Precedence and Defaults

The batch size configuration follows this precedence order:

  1. Ruleset-level: defaults.batch_size (highest priority)
  2. Run options level: Set via the UI or API run options (default: 50,000)

When batch_size is specified in the ruleset, it overrides the run options setting. This allows different rulesets to use different batch sizes based on their specific requirements.

Performance Considerations

  • Memory usage: Smaller batches use less memory but may increase total processing time
  • Database load: Larger batches reduce the number of database round trips
  • Progress visibility: Smaller batches provide more frequent progress updates in the logs
  • Error recovery: Smaller batches can make it easier to identify problematic data ranges

Example for memory-constrained environments:

defaults:
  target_percent: 5
  batch_size: 10000               # Small batches for large tables

Example for performance optimization:

defaults:
  target_percent: 15
  batch_size: 100000              # Large batches for speed

Configuration Validation

DataMasque validates your configuration before processing:

  • All patterns must use fully-qualified schema.table format
  • Include patterns must match at least one table
  • For additional cascades:
    • each additional cascade references fully-qualified tables
    • the tables exist, and the specified columns in those tables also exist
    • the additional cascade does not reverse or duplicate a foreign key relationship (see below)
    • the additional cascade has distinct source and target tables
  • There are no chains of foreign keys creating a circular dependency

Clarification for additional cascades

Additional cascades that reverse a foreign key relationship cause an error, as it's not clear in which direction the relationship should be processed.

Additional cascades that duplicate a foreign key relationship that was already discovered (exactly the same tables and columns, regardless of the order of the columns) are ignored in favour of the discovered foreign key.

The following example additional cascades demonstrate this. Assume that there are two tables which have been included for subsetting, schema.customers (parent) and schema.orders (child), with a foreign key between them: order_id in orders maps to customer_id in customers.

include:
  - table: schema.customers
    additional_cascades:
      # This cascade will be ignored as it duplicates the existing foreign key
      - source_columns: [customer_id]
        target_table: schema.orders
        target_columns: [order_id]

      # This cascade specification is an error,
      # as it reverses the direction of the existing foreign key
      # (orders -> customers instead of customers -> orders)
      - source_columns: [my_column]
        target_table: schema.orders
        target_columns: [my_column]

Cross-Schema Relationships

DataMasque automatically discovers relationships across schemas and includes necessary tables.

Example:

include:
  - glob: customer_data.*     # Contains customers table
  - glob: order_data.*        # Contains orders referencing customers
  # Note: payments_data not included

If order_data.orders has a foreign key to payments_data.payments, DataMasque will automatically include payments_data.payments and show a message in the logs about the conflict resolution.

Conflict Resolution

When foreign key relationships are discovered, but the child tables match a skip pattern or aren't explicitly included, these tables will still be included in the subsetting process to maintain referential integrity.

In the run log, you'll see:

INFO: ==== SUBSETTING CONFLICTS RESOLVED ====
INFO: Table `payments_data.payments` is not matched by include patterns,
      but was automatically added to preserve FK relationship with `order_data.orders`.

Dry Runs

By clicking the Preview Dry Run button rather than the Preview Run button on the Database Masking screen, you can execute a dry run. This allows you to validate your configuration is correct: the run log shows exactly what operations the subsetting task would perform, but DataMasque does not actually make any changes to the database.


Complete Example Rulesets

These examples show how to combine multiple configuration features to create precisely tailored subsets for different use cases.

E-commerce Platform with Multiple Concerns

version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 8

    include:
      # Core customer data - keep more for realistic testing
      - table: customer_management.customers
        target_percent: 15
      - glob: customer_management.customer_*
      - regex: "customer_management\\.user_profiles$"

      # Order processing - standard subset
      - glob: order_processing.order*
      - table: order_processing.shipments

      # Financial data
      - table: financial_data.payments
      - table: financial_data.invoices

      # Analytics - minimal data
      - glob: analytics.user_events
        target_percent: 2
      - glob: analytics.product_views
        target_percent: 2

      # Custom relationships
      - table: customer_management.customers
        additional_cascades:
          - source_columns: [customer_id]
            target_table: external_crm.contact_history
            target_columns: [customer_external_id]

      - table: order_processing.orders
        additional_cascades:
          - source_columns: [order_id]
            target_table: warehouse_system.shipment_tracking
            target_columns: [order_reference]

    skip:
      - glob: customer_management.*_audit
      - glob: customer_management.*_archive
      - glob: order_processing.*_temp
      - glob: financial_data.tax_*
      - regex: "analytics\\..*_daily_.*"

    drop:
      - glob: staging.*                    # Remove entire staging schema

    truncate:
      - table: analytics.raw_events        # Keep structure, clear data
      - glob: customer_management.*_cache  # Clear cache tables

Development Environment Optimization

version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 5
      batch_size: 30000              # Moderate batch size

    include:
      # Core business logic tables with higher retention
      - table: application_core.users
        target_percent: 12
      - table: application_core.accounts
        target_percent: 3
      - regex: "application_core\\.(users|accounts|profiles)$"

    skip:
      # Skip session, cache, backup and test tables
      - glob: application_core.*_session*
      - glob: application_core.*_cache*
      - glob: application_core.*_backup
      - regex: "application_core\\..*_test_.*"

      # Skip all reference data tables
      - glob: reference_data.*

      # Skip all audit/logging tables
      - glob: audit_logging.*

    drop:
      - glob: temp_data.*                  # Remove temp schema
      - table: application_core.old_migrations

    truncate:
      - glob: application_core.session_*   # Clear sessions
      - table: audit_logging.api_calls     # Clear logs but keep structure

Next Steps