Configuration Reference
Complete reference for DataMasque subsetting ruleset YAML structure, field definitions, and validation rules.
Ruleset Structure
Square brackets [] mean the value is a list.
version: "1.0" # Required: Ruleset schema version
tasks: # Required: Array of tasks
- type: subset # Required: Task type
mode: in_place # Required: Processing mode
defaults: # Required: Default settings
target_percent: integer # Required: Target percentage (1-99)
batch_size: integer # Optional: Batch size (1-1,000,000)
include: [pattern] # Optional: Tables to include
skip: [pattern] # Optional: Tables to skip
drop: [pattern] # Optional: Tables to drop entirely
truncate: [pattern] # Optional: Tables to empty
Pattern Structure
Each pattern in include, skip, drop, and truncate lists must contain exactly one of:
table: Exact fully-qualified table name (e.g.,schema.tablename)regex: Regular expression patternglob: Glob pattern with wildcards (*,?,[abc])
Patterns in include can also override default settings:
include:
- table: schema.tablename # Basic pattern
target_percent: 5 # Optional override
additional_cascades: [...] # Optional custom relationships
- glob: schema.orders_* # Pattern with override
target_percent: 15
Task Properties
The following sections describe the properties available for subsetting tasks.
Subset task (subset)
The subset task reduces the size of databases while maintaining referential integrity.
| Parameter | Type | Required? | Description |
|---|---|---|---|
type |
String | Yes | Must be "subset". |
mode |
String | Yes | Processing mode. Currently only "in_place" is supported, meaning the subset data will replace the original data. |
defaults |
Object | Yes | Default configuration for the subsetting process. See Default Configuration. |
include |
List of include patterns | No* | Tables to include in the subset. See Include Patterns. |
skip |
List of table patterns | No* | Tables to skip during subsetting. See Skip Patterns. |
drop |
List of table patterns | No* | Tables to drop entirely from the database. See Drop Patterns. |
truncate |
List of table patterns | No* | Tables to truncate (empty) after subsetting. See Truncate Patterns. |
* At least one of include, skip, drop, or truncate must be specified.
subset tasks cannot be used within a parallel block
or serial block;
they must appear only in the top level tasks section of the ruleset.
Default Configuration
The defaults object configures default behavior for the entire subsetting operation.
| Parameter | Type | Required? | Description |
|---|---|---|---|
target_percent |
Integer | Yes | Target percentage of data to retain (1 to 99 inclusive). Applies to root tables in dependency trees (tables with no parent relationships). |
batch_size |
Integer | No | Controls the batch size for subsetting operations (1 to 1,000,000). Defaults to 50,000. When specified, overrides the batch size set in run options. Use smaller values to reduce memory usage for large tables. |
The target_percent applies only to the root tables in any dependency trees
(tables that are not children in any foreign keys or configured additional cascades).
The retained data percentage in child tables is determined
by which rows relate to the parent records that are retained,
unless overridden in the include patterns.
Note: For tables with self-referential foreign keys (recursive foreign keys),
the actual retention percentage may exceed the configured subset_percent_goal
to maintain referential integrity.
The subsetting operation will complete successfully and the run will finish with a finished_with_warnings status.
The batch_size parameter controls how many rows are processed at once during subsetting operations.
When specified in the ruleset,
it overrides the batch size set in the run options.
This allows for fine-tuned control over memory usage on a per-ruleset basis.
Smaller batch sizes reduce memory consumption but may increase processing time,
while larger batch sizes can improve performance for tables with many rows.
Include Patterns
The include list specifies which tables to include in the subset.
If not specified, any tables that don't match at least one drop, truncate, or skip pattern will be included by default.
Each pattern must specify the table(s) to match using one of three pattern types, and can optionally override default settings for matched tables.
Note: Tables with foreign keys to included tables are automatically included to maintain referential integrity, even if not explicitly listed or if they match a
skippattern. This is recursive: if tableAis a parent ofBandBis a parent ofC, includingAwill automatically include bothBandC.
Pattern Types
Each pattern must contain exactly one of the following fields:
| Parameter | Type | Required? | Description |
|---|---|---|---|
table |
String | No* | Exact fully-qualified table name (must use schema.table format).For case-sensitive identifiers, use quoting: table: '"Schema"."CaseSensitiveTable"'. |
glob |
String | No* | Glob pattern match supporting wildcards (must use schema.table format with explicit schema):* (zero or more characters), ? (single character), [abc] (character set).The schema prefix must be explicit and cannot use wildcards. Case sensitivity configured via case_sensitive. |
regex |
String | No* | Regular expression pattern match (matches against schema.table format).The schema prefix must be explicit in the regex pattern. Must be a valid regular expression. Case sensitivity configured via case_sensitive. |
case_sensitive |
Boolean | No | Whether pattern matching is case-sensitive (defaults to false).Only applies to glob and regex patterns; ignored for table patterns which follow database quoting rules. |
* Exactly one of table, glob, or regex must be specified.
All patterns must use fully-qualified schema.table format.
For glob and regex patterns, the schema portion must be explicit (no wildcards or pattern matching in the schema name).
Patterns like glob: "customer_*" (missing schema) or glob: "*.temp" (wildcard schema) will result in validation errors.
Unquoted names are interpreted according to the target database engine's rules.
Note: Enclose patterns in single or double quotes (
''or"") to avoid potential misinterpretation due to YAML special characters. Backslashes must be escaped as\\because the backslash is an escape character in YAML.
Pattern Overrides
Include patterns can override default configuration for matched tables:
| Parameter | Type | Required? | Description |
|---|---|---|---|
target_percent |
Integer | No | Override the default percentage for matching tables (1 to 99). Cannot be higher than the default percentage. |
additional_cascades |
List of cascade objects | No | Table-specific custom relationships. See Additional Cascades. |
The target_percent represents the desired retention for matching tables.
Actual retention may differ: it may be lower if parent tables were heavily subsetted,
or higher if NOT NULL foreign key relationships require additional rows to maintain referential integrity.
See Retention and Foreign Key Relationships for details.
When using additional_cascades in an include pattern,
only table patterns are supported (not glob or regex).
Examples
include:
# Exact table match (case-sensitive due to quoting)
- table: '"Schema1"."Customers"'
# Exact table match (case follows database rules - typically case-insensitive)
- table: public.orders
# Exact table with target_percent override
- table: public.customers
target_percent: 5
# Glob pattern matching all tables starting with "user" in the analytics schema
- glob: analytics.user*
case_sensitive: false
# Glob pattern with override
- glob: sales.order_*
target_percent: 15
# Regex pattern matching tables ending with year suffix
- regex: "^metrics\\.(analytics|stats)_\\d{4}$"
target_percent: 1
# Table with custom relationships
- table: crm.accounts
additional_cascades:
- source_columns: [account_id]
target_table: external.contacts
target_columns: [acct_ref]
Skip Patterns
The skip list specifies tables to exclude from subsetting.
Skipped tables will still be included if they have foreign keys to included tables,
to maintain referential integrity.
Skip patterns use the same pattern types as include patterns, but do not support overrides.
skip:
# Skip specific table
- table: public.audit_log
# Skip all temp tables in public schema
- glob: "public.temp_*"
case_sensitive: false
# Skip backup tables in public schema
- regex: "public\\..*_backup$"
Drop Patterns
The drop list specifies tables to completely remove from the database (DROP TABLE).
These tables are dropped at the end of the subsetting operation.
Drop patterns use the same pattern types as skip patterns.
Warning: DROP is a destructive operation that permanently removes both table structure and data.
drop:
# Drop specific staging table
- table: staging.import_temp
# Drop all tables in staging schema
- glob: staging.*
# Drop old partition tables
- regex: "data\\.events_\\d{4}_\\d{2}$" # Matches: data.events_2023_01, etc.
Truncate Patterns
The truncate list specifies tables to empty (TRUNCATE TABLE) after subsetting is complete.
These tables will have all data removed but the table structure is preserved.
Truncate patterns use the same pattern types as skip patterns.
Note: TRUNCATE removes all rows but keeps the table structure, indexes, and constraints. It resets auto-increment counters.
truncate:
# Empty audit log table
- table: public.audit_log
# Empty all cache tables
- glob: cache.*
case_sensitive: false
# Empty session tables
- regex: "app\\.session_\\w+$"
Additional Cascades
The additional_cascades list defines custom relationships between tables.
These are relationships not represented by foreign key constraints in the database.
Additional cascades are defined within include patterns for specific tables:
include:
- table: schema.parent_table
additional_cascades:
- source_columns: [col1, col2] # Columns in this table (parent)
target_table: schema.child_table # Related table (child)
target_columns: [ref1, ref2] # Columns in related table
Additional cascades can refer to tables in schemas that aren't defined in the schemas section,
and refer to tables that have not been explicitly included.
| Parameter | Type | Required? | Description |
|---|---|---|---|
source_columns |
List of strings | Yes | List of column names in the source (parent) table (minimum 1 column). |
target_table |
String | Yes | Fully qualified target (child) table name in schema.table format.Must be a different table to the source table. |
target_columns |
List of strings | Yes | List of column names in the target table (minimum 1 column). Must contain the same number of elements as source_columns. |
Ensure the order of the target_columns matches that of source_columns according to the relationship.
For both table and column names, use quoting where required for case-sensitive identifiers.
All specified tables and columns must exist in the database. It is an error to specify an additional cascade that reverses a foreign key, as this makes the relationship direction ambiguous. If an additional cascade is identical to an existing foreign key, a warning is issued and the cascade is ignored.
Example
include:
- table: '"Orders_Schema"."Orders"'
additional_cascades:
- source_columns: [year, '"Order_REF_ID"']
target_table: '"Accounts_Schema"."Invoices"'
target_columns: [invoice_year, order_id]
Pattern Precedence Rules
When a table matches multiple patterns, the following precedence rules apply.
Important: The precedence rules are quite complex. It is strongly recommended to perform subsetting with a dry run first (click the Preview Dry Run button) to verify that the included tables and relationships are as you expect.
Precedence of table, glob and regex
When a table matches multiple patterns within the same action (include, skip, drop, or truncate):
If a table matches a
tablepattern, this has the highest precedence. Multipletablematches in the same action are considered an error.If a table matches
globorregexpatterns:- If there is exactly one matching pattern, that pattern applies.
- If there are multiple matching patterns with different
target_percentoverrides, the lowesttarget_percentis used. - Multiple matches with conflicting non-percentage overrides cause an error.
Action precedence
When a table matches patterns in different actions:
drophas highest precedence - if a table matches a drop pattern, it will be dropped regardless of other matchestruncatehas second precedence - if a table matches a truncate pattern (and no drop pattern), it will be truncated regardless of include/skipskiphas third precedence:- If a table matches both
includeandskippatterns (glob or regex),skiptakes precedence - Multiple
tablepattern matches in bothincludeandskipis an error
- If a table matches both
includehas lowest precedence - used when no other actions matchNo match - tables that match no patterns are included in the subset by default, unless they are in a schema not configured for subsetting
Foreign Key Relationship Constraints
In addition to pattern precedence, tables with foreign key relationships must follow action compatibility rules:
- Child tables (tables with foreign keys referencing parent tables) must have actions that are equally or more destructive than their parent tables
- Destructiveness order: SKIP → INCLUDE → TRUNCATE → DROP (least to most destructive)
- Circular dependencies: All tables in a circular foreign key relationship must have the same action
Note: These constraints ensure referential integrity is maintained throughout the subsetting process. See Action Limitations with Foreign Key Relationships in the advanced features guide for detailed examples and validation error messages.
Quick Examples
Basic Ruleset
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 10
batch_size: 10000 # Optional: Use smaller batches
include:
- glob: main_data.* # Include all tables in main_data schema
With Filtering
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 8
batch_size: 250000 # Larger batches for better performance
include:
- table: ecommerce.customers # Always include
target_percent: 15 # Override to 15%
- glob: ecommerce.customer_* # Pattern match
- glob: ecommerce.order* # Pattern match
skip:
- glob: ecommerce.*_temp # Skip temp tables
- regex: "ecommerce\\..*_backup$" # Skip backup tables
With DROP and TRUNCATE
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 10
include:
- glob: production.*
skip:
- glob: production.temp_*
drop:
- glob: staging.* # Remove all staging tables
- table: production.old_data # Remove specific table
truncate:
- table: production.audit_log # Empty audit log
- glob: production.cache_* # Empty all cache tables
Case-Insensitive Patterns
include:
- glob: mixed_case_db.user*
case_sensitive: false # Matches user*, USER*, User*
skip:
- regex: "mixed_case_db\\..*_temp$"
case_sensitive: false # Matches *_temp, *_TEMP, *_Temp
For complete examples, see Examples and Scenarios.
For troubleshooting, see Troubleshooting.