Advanced Subsetting Features
Learn how to enhance your basic subsetting configuration with advanced features. This guide builds on the Basic Setup and Use tutorial, showing you how to add filtering patterns, per-table overrides, custom relationships, DROP/TRUNCATE operations, and other advanced capabilities to create precisely tailored subsets.
- Building Your Configuration
- Pattern Types in Detail
- Per-Table Overrides
- Skip Patterns
- Drop and Truncate Operations
- Additional Cascades
- Batch Size Configuration
- Configuration Validation
- Complete Example Rulesets
- Next Steps
Building Your Configuration
Building on the configuration from the Basic Setup guide, every DataMasque subsetting configuration follows this structure:
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
# Default settings for all tables
include:
# Tables to include (optional)
skip:
# Tables to skip (optional)
drop:
# Tables to drop entirely (optional)
truncate:
# Tables to empty (optional)
Let's explore how to enhance this basic structure with advanced features.
Pattern Types in Detail
All subsetting operations use patterns to match tables. Understanding each pattern type helps you create precise configurations.
Table Patterns (Exact Match)
Table patterns provide exact table name matching and have the highest precedence.
All patterns must use fully-qualified schema.table format:
include:
# Matches exactly "public.customers" (case follows database rules)
- table: public.customers
# Matches precisely "Sales"."Customer_Profiles" (case-sensitive due to quoting)
- table: '"Sales"."Customer_Profiles"'
# With override
- table: analytics.user_events
target_percent: 5
Use when:
- You want to match specific tables by name
- You need to apply overrides to specific tables
- You want the highest pattern precedence
Glob Patterns
Glob patterns use wildcards for flexible matching:
include:
# Matches all tables in the customer_data schema
- glob: customer_data.*
# Matches "customer_profiles", "customer_orders", etc. in public schema
- glob: public.customer_*
# Matches "temp_daily_backup", "temp_weekly_backup", etc.
- glob: staging.temp_*_backup
# Case-insensitive matching
- glob: public.user*
case_sensitive: false # Matches user*, USER*, User*
# With override
- glob: sales.order_*
target_percent: 15
Wildcards:
*- Zero or more characters?- Any single character[abc]- Character set
Important: The schema prefix must be explicit and cannot use wildcards. Patterns like
glob: "customer_*"(missing schema) orglob: "*.temp"(wildcard schema) will result in validation errors.
Use when:
- You want to match multiple tables with similar names
- You need simple pattern matching
- Schema or prefix-based matching is sufficient
Regex Patterns
Regular expressions provide precise matching with full regex syntax:
include:
# Matches tables starting with "customer_" in public schema
- regex: "^public\\.customer_.*$"
# Matches specific table names in public schema
- regex: "^public\\.(users|customers|orders)$"
# Matches tables ending with "_archive_" followed by 4 digits (e.g., _archive_2023)
- regex: "analytics\\..*_archive_\\d{4}$"
# Case-insensitive matching
- regex: "public\\.temp_.*"
case_sensitive: false
# With override
- regex: "metrics\\.(analytics|stats)_\\d{4}$"
target_percent: 1
Important:
- The schema portion of the regex must be explicit - patterns like
".*\\.(users|customers)"(wildcard schema) are not allowed- Use anchors
^(start) and$(end) to avoid partial matches- Escape dots in schema/table names as
\\.(backslash itself needs escaping in YAML)- Escape special regex characters like
\das\\d
Use when:
- You need precise pattern matching
- Simple glob wildcards aren't sufficient
- You want to match complex naming patterns
Pattern Precedence Rules
When a table matches multiple patterns:
tablepatterns have highest precedence - exact matches always win- Multiple glob/regex matches - the most restrictive (lowest
target_percent) applies - Action precedence:
drop>truncate>skip>include
Example:
include:
- glob: public.customer_* # Matches customer_data_2015, customer_data_2025
- table: public.audit_backup # Exact match
skip:
- regex: "public\\..*_backup$" # Matches tables ending with "_backup"
- regex: "public\\..*_20[01][0-9]$" # Matches tables ending with year 2000-2019
# Results:
# - public.customer_data_2015 → SKIPPED (matches include AND skip; skip wins)
# - public.customer_data_2025 → INCLUDED (matches include, not skip)
# - public.transactions_backup → SKIPPED (matches skip)
# - public.audit_backup → INCLUDED (exact table match has highest precedence)
For complete precedence rules, see Configuration Reference.
Per-Table Overrides
You can override default settings for specific tables by adding fields to include patterns:
defaults:
target_percent: 8 # Default for all tables
include:
# Basic patterns use defaults
- glob: public.user_*
- glob: public.order_*
# Override for specific table
- table: public.customers
target_percent: 15 # Keep more customers
# Override with glob pattern
- glob: analytics.*
target_percent: 2 # Keep less analytics data
# Override with regex
- regex: "logs\\..*_archive_\\d{4}$"
target_percent: 1 # Minimal archive data
Available overrides:
target_percent- Override the default percentage (1-99)
Use cases:
- Core tables need higher data retention
- Analytics/logs can be heavily reduced
- Different business domains have different testing needs
Note: The
target_percentrepresents the desired retention. Actual retention may be lower if parent tables were heavily subsetted, or higher if NOT NULL foreign key relationships require additional rows. See Retention and Foreign Key Relationships for details.
Skip Patterns
Skip patterns exclude tables from subsetting:
include:
- glob: public.* # Include all tables in public schema
skip:
- table: public.audit_log # Skip specific table
- glob: public.temp_* # Skip all temp tables
- regex: "public\\..*_backup$" # Skip backup tables
- glob: public.*_cache # Skip cache tables
case_sensitive: false
Important: Skipped tables will still be included if they have foreign keys to included tables, to maintain referential integrity. See Conflict Resolution.
Note: Tables that have foreign keys have limitations on which actions can be used. See Action Limitations with Foreign Key Relationships for details.
Drop and Truncate Operations
Dropping Tables
The drop list specifies tables to completely remove from the database (DROP TABLE):
drop:
# Drop specific staging table
- table: staging.import_temp
# Drop all tables in staging schema
- glob: staging.*
# Drop old partition tables
- regex: "data\\.events_\\d{4}_\\d{2}$" # Matches: data.events_2023_01, etc.
# Drop with case-insensitive matching
- glob: temp_schema.*
case_sensitive: false
Warning:
dropis a destructive operation that permanently removes both table structure and data. Always take a backup before running subsetting with drop patterns.Note: Tables that have foreign keys have limitations on which actions can be used. See Action Limitations with Foreign Key Relationships for details.
Truncating Tables
The truncate list specifies tables to empty (TRUNCATE TABLE) after subsetting:
truncate:
# Empty audit log table
- table: public.audit_log
# Empty all cache tables
- glob: cache.*
# Empty session tables
- regex: "app\\.session_\\w+$"
# Case-insensitive truncate
- glob: public.temp_*
case_sensitive: false
Note:
truncateremoves all rows but keeps table structure, indexes, and constraints. It is faster than DELETE and resets auto-increment counters.Note: Tables that have foreign keys have limitations on which actions can be used. See Action Limitations with Foreign Key Relationships for details.
Use Cases
Dropping tables:
- Remove staging/temporary schemas entirely
- Clean up old partition tables
- Remove development-only tables before promoting to higher environments
Truncating tables:
- Clear audit logs while preserving structure
- Empty cache/session tables
- Reset counters while keeping table definitions
Example combining both:
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 10
include:
- glob: production.*
skip:
- glob: production.temp_*
drop:
- glob: staging.* # Remove entire staging schema
- table: production.old_data # Remove specific obsolete table
truncate:
- table: production.audit_log # Keep structure, remove data
- glob: production.cache_* # Empty all cache tables
Action Limitations with Foreign Key Relationships
When tables have foreign key relationships, the actions you assign must be compatible to maintain referential integrity. For example, you cannot skip a child table while dropping its parent table, as the child records would reference parent rows that no longer exist.
Important: Child tables (tables with foreign keys referencing parent tables) must have actions that are equally or more destructive than their parent tables. The destructiveness order is: SKIP → INCLUDE → TRUNCATE → DROP (least to most destructive).
Understanding the Constraint
DataMasque validates your configuration to ensure that foreign key relationships remain valid. If a parent table is configured with a particular action, any child tables that reference it must use an action of equal or greater destructiveness.
Valid action combinations:
- Parent: SKIP → Child: SKIP, INCLUDE, TRUNCATE, or DROP
- Parent: INCLUDE → Child: INCLUDE, TRUNCATE, or DROP
- Parent: TRUNCATE → Child: TRUNCATE or DROP
- Parent: DROP → Child: DROP
Invalid action combinations:
- Parent: INCLUDE → Child: SKIP
- Parent: TRUNCATE → Child: SKIP or INCLUDE
- Parent: DROP → Child: SKIP, INCLUDE, or TRUNCATE
Circular Foreign Key Relationships
For tables with circular foreign key relationships, all tables within the cycle must have the same action. This is to protect foreign key relationships, which would otherwise need to be left disabled or dropped to allow for one of the tables in the cycle to have rows missing after subsetting.
Validation Errors
If you configure incompatible actions, you'll receive a validation error during configuration validation or at the start of subsetting:
Cannot apply 'INCLUDE' action to table `sales.order_items`
when parent table `sales.orders` is marked for 'TRUNCATE'.
Child actions must be equally or more destructive.
Resolution: Adjust your configuration so child tables have actions equal to or more destructive than their parents, or ensure all tables in circular relationships use the same action.
Additional Cascades
Use additional cascades to define custom logical relationships that aren't defined as foreign key constraints in the database. These are often referred to as "logical" relationships within the database.
Additional cascades are defined within include patterns for specific tables:
include:
# Table with custom relationship
- table: customer_data.customers
additional_cascades:
# Link customers to external logging system
- source_columns: [customer_id]
target_table: logging_system.user_activity
target_columns: [user_ref_id]
# Another table with multiple cascades
- table: order_data.orders
additional_cascades:
# Connect orders to warehouse system
- source_columns: [order_id]
target_table: warehouse_system.shipments
target_columns: [order_reference]
# Multi-column relationship
- source_columns: [order_id, line_number]
target_table: external_fulfillment.order_lines
target_columns: [ext_order_id, ext_line_num]
Requirements:
- Only
tablepatterns supportadditional_cascades(notgloborregex) - Source table is the parent in the relationship
- Target table is the child in the relationship
- Column count must match between source and target
- Column order matters for composite keys
- All tables and columns must exist in the database
For complete additional cascades documentation, see Configuration Reference.
Batch Size Configuration
The batch_size parameter in defaults provides fine-grained control over subsetting performance and memory usage.
This parameter determines how many rows are processed in each batch during subsetting operations.
Configuration Options
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 10
batch_size: 25000 # Override default of 50,000
include:
- glob: large_data_schema.*
Precedence and Defaults
The batch size configuration follows this precedence order:
- Ruleset-level:
defaults.batch_size(highest priority) - Run options level: Set via the UI or API run options (default: 50,000)
When batch_size is specified in the ruleset,
it overrides the run options setting.
This allows different rulesets to use different batch sizes based on their specific requirements.
Performance Considerations
- Memory usage: Smaller batches use less memory but may increase total processing time
- Database load: Larger batches reduce the number of database round trips
- Progress visibility: Smaller batches provide more frequent progress updates in the logs
- Error recovery: Smaller batches can make it easier to identify problematic data ranges
Example for memory-constrained environments:
defaults:
target_percent: 5
batch_size: 10000 # Small batches for large tables
Example for performance optimization:
defaults:
target_percent: 15
batch_size: 100000 # Large batches for speed
Configuration Validation
DataMasque validates your configuration before processing:
- All patterns must use fully-qualified
schema.tableformat - Include patterns must match at least one table
- For additional cascades:
- each additional cascade references fully-qualified tables
- the tables exist, and the specified columns in those tables also exist
- the additional cascade does not reverse or duplicate a foreign key relationship (see below)
- the additional cascade has distinct source and target tables
- There are no chains of foreign keys creating a circular dependency
Clarification for additional cascades
Additional cascades that reverse a foreign key relationship cause an error, as it's not clear in which direction the relationship should be processed.
Additional cascades that duplicate a foreign key relationship that was already discovered (exactly the same tables and columns, regardless of the order of the columns) are ignored in favour of the discovered foreign key.
The following example additional cascades demonstrate this.
Assume that there are two tables which have been included for subsetting,
schema.customers (parent) and schema.orders (child),
with a foreign key between them: order_id in orders maps to customer_id in customers.
include:
- table: schema.customers
additional_cascades:
# This cascade will be ignored as it duplicates the existing foreign key
- source_columns: [customer_id]
target_table: schema.orders
target_columns: [order_id]
# This cascade specification is an error,
# as it reverses the direction of the existing foreign key
# (orders -> customers instead of customers -> orders)
- source_columns: [my_column]
target_table: schema.orders
target_columns: [my_column]
Cross-Schema Relationships
DataMasque automatically discovers relationships across schemas and includes necessary tables.
Example:
include:
- glob: customer_data.* # Contains customers table
- glob: order_data.* # Contains orders referencing customers
# Note: payments_data not included
If order_data.orders has a foreign key to payments_data.payments,
DataMasque will automatically include payments_data.payments
and show a message in the logs about the conflict resolution.
Conflict Resolution
When foreign key relationships are discovered,
but the child tables match a skip pattern or aren't explicitly included,
these tables will still be included in the subsetting process
to maintain referential integrity.
In the run log, you'll see:
INFO: ==== SUBSETTING CONFLICTS RESOLVED ====
INFO: Table `payments_data.payments` is not matched by include patterns,
but was automatically added to preserve FK relationship with `order_data.orders`.
Dry Runs
By clicking the Preview Dry Run button rather than the Preview Run button on the Database Masking screen, you can execute a dry run. This allows you to validate your configuration is correct: the run log shows exactly what operations the subsetting task would perform, but DataMasque does not actually make any changes to the database.
Complete Example Rulesets
These examples show how to combine multiple configuration features to create precisely tailored subsets for different use cases.
E-commerce Platform with Multiple Concerns
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 8
include:
# Core customer data - keep more for realistic testing
- table: customer_management.customers
target_percent: 15
- glob: customer_management.customer_*
- regex: "customer_management\\.user_profiles$"
# Order processing - standard subset
- glob: order_processing.order*
- table: order_processing.shipments
# Financial data
- table: financial_data.payments
- table: financial_data.invoices
# Analytics - minimal data
- glob: analytics.user_events
target_percent: 2
- glob: analytics.product_views
target_percent: 2
# Custom relationships
- table: customer_management.customers
additional_cascades:
- source_columns: [customer_id]
target_table: external_crm.contact_history
target_columns: [customer_external_id]
- table: order_processing.orders
additional_cascades:
- source_columns: [order_id]
target_table: warehouse_system.shipment_tracking
target_columns: [order_reference]
skip:
- glob: customer_management.*_audit
- glob: customer_management.*_archive
- glob: order_processing.*_temp
- glob: financial_data.tax_*
- regex: "analytics\\..*_daily_.*"
drop:
- glob: staging.* # Remove entire staging schema
truncate:
- table: analytics.raw_events # Keep structure, clear data
- glob: customer_management.*_cache # Clear cache tables
Development Environment Optimization
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 5
batch_size: 30000 # Moderate batch size
include:
# Core business logic tables with higher retention
- table: application_core.users
target_percent: 12
- table: application_core.accounts
target_percent: 3
- regex: "application_core\\.(users|accounts|profiles)$"
skip:
# Skip session, cache, backup and test tables
- glob: application_core.*_session*
- glob: application_core.*_cache*
- glob: application_core.*_backup
- regex: "application_core\\..*_test_.*"
# Skip all reference data tables
- glob: reference_data.*
# Skip all audit/logging tables
- glob: audit_logging.*
drop:
- glob: temp_data.* # Remove temp schema
- table: application_core.old_migrations
truncate:
- glob: application_core.session_* # Clear sessions
- table: audit_logging.api_calls # Clear logs but keep structure
Next Steps
- See Examples and Scenarios for more complete real-world configurations.
- Review Configuration Reference for detailed field documentation.
- Check Troubleshooting if you encounter issues.