Basic Setup and Use
This guide walks you through creating your first subset using DataMasque's web interface. You'll learn the basic YAML structure and see how DataMasque preserves relationships automatically.
- Example Database Schema
- Creating a Simple Subsetting Ruleset
- Understanding the Configuration
- Running the Subset
- Understanding the Results
- Next Steps
Example Database Schema
For this tutorial, we'll work with a simple e-commerce database that demonstrates typical table relationships:
customersschemacustomerstable: Contains customer information (primary key:customer_id)customer_addressestable: Links to customers viacustomer_idforeign key
ordersschemaorderstable: Links to customers viacustomer_idforeign key (primary key:order_id)order_itemstable: Links to orders viaorder_idforeign key
paymentsschemapaymentstable: Links to orders viaorder_idforeign key
This creates two relationship chains: customers → orders → order_items,
and customers → orders → payments.
When you subset 10% of customers,
DataMasque automatically includes all their related orders, order items, and payments
to maintain referential integrity.
Creating a Simple Subsetting Ruleset
Let's start with the simplest possible subsetting configuration: reducing an entire database to 10% of its original size.
In the DataMasque web interface, click the Add Ruleset button, click the Skip to YAML Editor button, then create a ruleset with this YAML:
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 10
include:
- glob: customers.*
- glob: orders.*
- glob: payments.*
Give the ruleset a name at the top of the page, then click the Save or Save and Exit button at the bottom right of the screen to save the ruleset.
Understanding the Configuration
This configuration will:
- Reduce all tables in the
customers,orders, andpaymentsschemas to approximately 10% of their original size. - Automatically preserve all foreign key relationships between and within schemas.
Let's break down each part of the YAML:
Version and Task Type
version: "1.0"
tasks:
- type: subset
mode: in_place
version: "1.0"specifies the configuration schema version. ("1.0"is the current, and only, schema version.)type: subsettells DataMasque this is a subsetting operation.mode: in_placemeans the subset will replace the original data. (currently the only supported mode)
Default Settings
defaults:
target_percent: 10
This sets the target retention percentage (10%) for all tables.
The target_percent must be an integer from 1 to 99.
Additional default settings are available:
defaults:
target_percent: 10
batch_size: 50000 # Optional: batch size for operations (default: 50,000)
Include Patterns
include:
- glob: customers.*
- glob: orders.*
- glob: payments.*
This tells DataMasque which tables to include in the subset.
Important: All patterns must use fully-qualified schema.table format.
For glob and regex patterns, the schema prefix is mandatory and cannot use wildcards.
Patterns like glob: "customer_*" (without schema) or glob: "*.temp" (wildcard schema) will result in an error.
You have three ways to match tables:
include:
- table: customers.customers # Exact table match
- glob: orders.order_* # Glob pattern (wildcards: *, ?, [abc])
- regex: "^payments\\.payment_.*$" # Regular expression
In this example:
table: customers.customersmatches exactly one tableglob: orders.order_*matches all tables in theordersschema starting withorder_(the schemaordersis explicit, only the table name uses wildcards)regex: "^payments\\.payment_.*$"matches all tables in thepaymentsschema starting withpayment_(the schemapaymentsis explicit, only the table name uses regex)
Running the Subset
Once you've created your ruleset in the DataMasque web interface:
- Select your database connection from the available connections.
- Select your saved subset ruleset.
- Click the Preview Run button.
- Confirm the displayed details are correct, then click the Start Run button to start the subsetting process.
You'll see output similar to:
Starting subset operation
Discovering table relationships...
Created 2 dependency trees for subsetting
Table dependency graph A1B2C3D4 root `CUSTOMERS.CUSTOMERS`
Table dependency graph A1B2C3D4 relationship `CUSTOMERS.CUSTOMERS`: (`CUSTOMER_ID`) -> `ORDERS.ORDERS`: (`CUSTOMER_ID`)
Table dependency graph A1B2C3D4 relationship `ORDERS.ORDERS`: (`ORDER_ID`) -> `ORDERS.ORDER_ITEMS`: (`ORDER_ID`)
Table dependency graph A1B2C3D4 relationship `ORDERS.ORDERS`: (`ORDER_ID`) -> `PAYMENTS.PAYMENTS`: (`ORDER_ID`)
3 relationships discovered
Processing batch 1 of 10 (10.0% of data)...
Copying selected data to temporary tables...
Truncating original tables...
Restoring data from temporary tables...
Recreating foreign key constraints...
Subset operation completed successfully
Understanding the Results
Relationship Discovery
DataMasque automatically discovered the relationships between your tables:
- Customers have orders (via
customer_id) - Orders have items and payments (via
order_id)
This creates a table dependency graph that ensures related data stays together.
How DataMasque Organizes Tables
When DataMasque analyzes your database, it builds table dependency graphs (also called dependency trees or DAGs - Directed Acyclic Graphs) from the discovered relationships. Each table dependency graph represents a group of related tables connected by foreign key relationships.
For example, in the sample schema from this tutorial:
- The table dependency graph has
customers.customersas its root table (the table with no parent dependencies) - Child tables like
orders.orders,order_items, andpaymentsare connected through foreign keys - When you configure 10% retention, DataMasque applies this percentage at the root of each graph
- All related data cascades down automatically to maintain referential integrity
The logs show these relationships as they're discovered:
Table dependency graph A1B2C3D4 root `CUSTOMERS.CUSTOMERS`
Table dependency graph A1B2C3D4 relationship `CUSTOMERS.CUSTOMERS`: (`CUSTOMER_ID`) -> `ORDERS.ORDERS`: (`CUSTOMER_ID`)
This organizational structure ensures that subsetting maintains all relationships between tables while reducing the overall data volume.
Cross-Schema Discovery
DataMasque uses intelligent schema selection that goes beyond the tables you explicitly configure:
When you include tables using patterns like
schema.*, all tables matching the pattern will be included in the subsetting operation.If a table is not explicitly matched by your include patterns but has foreign key relationships to included tables, it will be automatically included to maintain referential integrity.
- This discovery process cascades across multiple schemas. For example, if Schema A references Schema B, and Schema B references Schema C, all necessary tables across all three schemas will be included even if you only configured one of them.
The subsetting percentage is applied at the root of each table dependency graph, not at each individual table. This means:
- DataMasque identifies the root tables (those with no parent dependencies)
- Applies the 10% selection to these root tables
- Cascades all related data down the dependency chain
Important Implication:
Child tables may end up with more, or significantly less, than 10% of their original data.
As an example, if customers is the parent table and orders the child table,
taking 10% of customers might result in 15% of orders being retained,
or perhaps only 2%,
depending on which rows in customers happened to be selected for retention.
Pattern overrides can be used to specify a target percentage for specific tables. However, there is no way to prevent a child table being subsetted more aggressively than the goal percentage suggests if the selected rows in the parent table happen to not match many rows in the child table. It is strongly recommended that you test with larger retention percentages first and gradually refine to smaller percentages.
Cross-Schema Discovery Example
Here's a concrete example of cross-schema discovery in action.
Consider this configuration that deliberately omits the payments schema:
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 10
include:
- glob: customers.*
- glob: orders.*
# Note: payments schema is NOT configured here
Even though payments.* isn't in the include patterns,
you'll see an output like:
Starting subset operation
Discovering table relationships...
==== SUBSETTING CONFLICTS RESOLVED ====
Table `payments.transactions` is in a schema not configured for subsetting,
but was automatically added to preserve FK relationship with `orders.orders`.
Created 1 dependency tree for subsetting
Table dependency graph A1B2C3D4 root `CUSTOMERS.CUSTOMERS`
Table dependency graph A1B2C3D4 relationship `CUSTOMERS.CUSTOMERS`: (`CUSTOMER_ID`) -> `ORDERS.ORDERS`: (`CUSTOMER_ID`)
Table dependency graph A1B2C3D4 relationship `ORDERS.ORDERS`: (`ORDER_ID`) -> `PAYMENTS.TRANSACTIONS`: (`ORDER_ID`)
Note: The key message here is that
payments.transactionsis in a schema not configured for subsetting, but was automatically added to preserve the FK relationship. This shows DataMasque's automatic relationship discovery in action.
DataMasque automatically discovered that orders.orders has a foreign key to payments.transactions
and included that specific table to maintain referential integrity.
However, other unrelated tables in the payments schema are not included.
Hash-Based Selection
The subsetting process uses deterministic hash-based selection, meaning:
- The same configuration with the same
run_secretandinstance_secretwill always produce the same subset. - Customer #12345 (for example) will either be included with all their data, or excluded entirely.
- The result will not have any orphaned records or broken relationships.
- If you run subsetting on an already-subsetted database,
it won't reduce the data further unless you specify a different
run_secret.
Data Integrity
After subsetting:
- All foreign key constraints remain valid.
- Table structures are unchanged.
- Indexes and permissions are preserved.
- You have approximately 10% of your original data volume. Actual retention may vary based on foreign key relationships; see Retention and Foreign Key Relationships.
Next Steps
Now that you've successfully created a basic subset, you can explore more advanced table selection options:
- Include only specific tables using
table,glob, and/orregexpatterns. - Override the default percentage for specific tables.
- Skip tables that aren't needed.
- Drop or truncate tables.
Continue with Advanced Features to learn about these advanced features, or jump to Examples and Scenarios for complete real-world configurations.