Subsetting Introduction
DataMasque's subsetting feature reduces database size while maintaining referential integrity. It's designed for creating smaller development and testing environments from production data.
- What is Subsetting?
- Key Benefits
- How It Works
- Retention and Foreign Key Relationships
- Common Use Cases
- Getting Started
- Subsetting Documentation
- Best Practices
- Prerequisites
What is Subsetting?
Subsetting creates a smaller, representative sample of your database by selecting a percentage of rows while automatically preserving relationships between tables. Unlike simple data deletion, subsetting ensures that foreign key constraints remain valid and related data stays connected.
For example, suppose you have a database with a customers table,
with several child tables including orders and payments.
You can instruct DataMasque to keep 10% of customers,
and subsetting automatically keeps their related orders, payments,
and other associated records across all tables and schemas.
Subsetting does not do any masking; any rows retained are left exactly as they were, with no changes to the column values. If you need masking as well, you can include both subsetting and masking tasks in a single ruleset. Place the masking tasks after the subsetting tasks, to only mask the subset (which is much faster), rather than the full data set.
Key Benefits
Faster Development Environments: Smaller databases mean faster deployments, quicker test runs, and reduced infrastructure costs for development teams.
Maintained Data Integrity: All foreign key relationships are automatically discovered and preserved, so your subset works exactly like your full production database.
Deterministic Results: The same configuration always produces the same subset, ensuring consistency across environments and team members.
Cross-Schema Support: Automatically handles relationships that span multiple schemas, including tables not explicitly configured for subsetting.
How It Works
DataMasque supports in-place subsetting (meaning the subset replaces the original data) using a copy-truncate-copy strategy:
- Discover and map all table relationships and dependencies.
- Select data samples from parent tables based on the configured parameters, and copy to a temporary table.
- Select all related records from child tables and copy to temporary tables.
- Remove all data from the original tables while maintaining structure.
- Copy the selected subset of data back to the original tables.
The process works entirely in-place, meaning your existing database structure, permissions, and constraints remain unchanged.
Retention and Foreign Key Relationships
The target_percent you configure represents the desired retention for each table.
However, foreign key relationships may cause actual retention to differ from this target.
Understanding how DataMasque handles these relationships
helps you predict the final subset size.
Required (NOT NULL) Foreign Keys
When a foreign key column is defined as NOT NULL,
DataMasque must include the referenced parent rows to maintain referential integrity.
This can cause actual retention to exceed the target percentage.
For example, suppose you have an orders table with a NOT NULL foreign key to customers.
If you configure 10% retention and select 100 orders,
but those orders reference 150 distinct customers,
then 150 customers must be retained, even if that exceeds 10% of the customers table.
This behaviour is unavoidable: a valid subset cannot contain orders that reference non-existent customers.
Optional (Nullable) Foreign Keys
When a foreign key column allows NULL values,
DataMasque does not cascade through that relationship.
Instead, foreign key values that would reference rows outside the subset are set to NULL.
For example, consider an employees table with a nullable manager_id column
that references other employees.
When subsetting, DataMasque keeps only the employees selected by the target percentage.
Any manager_id values pointing to employees not in the subset are set to NULL,
indicating those employees have no manager in the subsetted data.
This approach achieves smaller subsets
by breaking optional relationship chains rather than following them.
The resulting data remains valid, as nullable columns can legitimately contain NULL values.
Practical Implications
Tables connected only by NOT NULL foreign keys may have retention above the target percentage due to cascading requirements.
Tables connected only by nullable foreign keys typically achieve retention at or below the target percentage, with orphaned references set to
NULL.Tables with a mix of both follow NOT NULL relationships while setting nullable references to
NULLwhen the referenced rows are not in the subset.
Note: When reviewing your subset results,
NULLvalues in nullable foreign key columns indicate that the originally referenced row was not included in the subset. This is expected behaviour and does not indicate data corruption.
Leaf Table Culling
After the initial cascade copy, leaf tables (tables that have no child tables depending on them) may have more rows than the target percentage suggests. DataMasque applies secondary culling to these leaf tables to bring them closer to the target percentage while preserving foreign key integrity.
How it works:
- After cascade copy completes, DataMasque identifies leaf tables that exceed the target percentage.
- For each such table, it ensures at least one row per foreign key reference is kept (to maintain referential integrity with parent tables).
- Additional rows beyond this minimum are randomly culled using hash-based selection to bring the table closer to the target percentage.
Important: Only leaf tables are culled. Non-leaf tables (tables that have child tables depending on them) are not culled, because removing rows from them could break foreign key integrity for their dependant tables.
For example, in a chain customers → orders → order_items:
order_items(a leaf table) can be culled to approach the target percentage.orderscannot be culled, becauseorder_itemsdepends on it.customerscannot be culled, becauseordersdepends on it.
Common Use Cases
- Create 5-10% subsets of production data for faster local development and testing.
- Use small datasets to speed up automated testing in CI pipelines without sacrificing data variety.
- Generate manageable datasets that retain production characteristics for demonstrations or training.
Getting Started
Start with the Basic Setup and Use Guide for a step-by-step walkthrough.
Subsetting tasks are defined in YAML rulesets,
in the same way as regular masking tasks (mask_table or mask_unique_key).
The full guide to all options available in the subsetting YAML
can be found in the Ruleset Reference.
Subsetting Documentation
- Basic Setup and Use
Start here for your first subsetting operation with simple examples. - Advanced Features
Complete guide to all subsetting options, from basic to advanced configurations. - Examples and Scenarios
Real-world examples and complete configurations for common use cases. - Troubleshooting
Solutions to common issues and error messages. - Configuration Reference
Complete YAML schema reference and field documentation.
Best Practices
- Take a backup of the database before subsetting (or ensure it can be recreated, for example by copying a production data source). In-place subsetting is a destructive operation.
- Be explicit with the configuration.
List all tables you want to subset,
and prefer to list tables using
tablerather than usingglobandregexwildcards to avoid inadvertent matches. DataMasque will raise an error if it needs to subset a table in a schema that has not been configured. - Use additional cascades. Additional cascades allow you to define custom logical relationships within and across schemas that are not expressed by foreign key relationships in the database.
- Use the dry run option. Review the run logs to understand which tables were auto-included and verify the subsetting run will do what you expect.
- Test with higher percentages first to validate relationships are preserved as desired (not just referential integrity which is maintained automatically, but also the reverse: avoiding creating parent records with no children).
Supported Databases
Subsetting is currently supported for the following database types:
- Oracle (11g and later)
- PostgreSQL
- Microsoft SQL Server
Note: Oracle 11g requires an additional database privilege for subsetting. See Oracle 11g Subsetting Requirements below.
Other databases supported by DataMasque (MySQL, MariaDB, DB2 LUW, Snowflake, Redshift, DynamoDB) do not currently support subsetting.
Oracle 11g Subsetting Requirements
Oracle 11g connections require the EXECUTE privilege on DBMS_OBFUSCATION_TOOLKIT
for subsetting operations. This package is used for deterministic row selection
during the subsetting process.
GRANT EXECUTE ON DBMS_OBFUSCATION_TOOLKIT TO <connection user>;
This privilege is not required for Oracle 12c and later, which use built-in functions that do not require additional grants.
If this privilege is not granted, subsetting operations on Oracle 11g will fail with an error indicating the missing permission.
Prerequisites
- DataMasque web interface access
- Database connection configured in DataMasque
- Database access with appropriate permissions for the configured connection (see Required privileges for database-specific permission requirements)
- Parent tables must have primary keys or unique indexes
- Oracle 12c+ tables can use ROWID as a fallback when no primary key or unique index exists