DataMasque Portal

Basic Setup and Use

This guide walks you through creating your first subset using DataMasque's web interface. You'll learn the basic YAML structure and see how DataMasque preserves relationships automatically.

Example Database Schema

For this tutorial, we'll work with a simple e-commerce database that demonstrates typical table relationships:

  • customers schema

    • customers table: Contains customer information (primary key: customer_id)
    • customer_addresses table: Links to customers via customer_id foreign key
  • orders schema

    • orders table: Links to customers via customer_id foreign key (primary key: order_id)
    • order_items table: Links to orders via order_id foreign key
  • payments schema

    • payments table: Links to orders via order_id foreign key

This creates two relationship chains: customersordersorder_items, and customersorderspayments. When you subset 10% of customers, DataMasque automatically includes all their related orders, order items, and payments to maintain referential integrity.

Creating a Simple Subsetting Ruleset

Let's start with the simplest possible subsetting configuration: reducing an entire database to 10% of its original size.

In the DataMasque web interface, click the Add Ruleset button, click the Skip to YAML Editor button, then create a ruleset with this YAML:

version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 10
    include:
      - glob: customers.*
      - glob: orders.*
      - glob: payments.*

Give the ruleset a name at the top of the page, then click the Save or Save and Exit button at the bottom right of the screen to save the ruleset.

Understanding the Configuration

This configuration will:

  • Reduce all tables in the customers, orders, and payments schemas to approximately 10% of their original size.
  • Automatically preserve all foreign key relationships between and within schemas.

Let's break down each part of the YAML:

Version and Task Type

version: "1.0"
tasks:
  - type: subset
    mode: in_place
  • version: "1.0" specifies the configuration schema version. ("1.0" is the current, and only, schema version.)
  • type: subset tells DataMasque this is a subsetting operation.
  • mode: in_place means the subset will replace the original data. (currently the only supported mode)

Default Settings

defaults:
  target_percent: 10

This sets the target retention percentage (10%) for all tables. The target_percent must be an integer from 1 to 99.

Additional default settings are available:

defaults:
  target_percent: 10
  batch_size: 50000   # Optional: batch size for operations (default: 50,000)

Include Patterns

include:
  - glob: customers.*
  - glob: orders.*
  - glob: payments.*

This tells DataMasque which tables to include in the subset.

Important: All patterns must use fully-qualified schema.table format. For glob and regex patterns, the schema prefix is mandatory and cannot use wildcards. Patterns like glob: "customer_*" (without schema) or glob: "*.temp" (wildcard schema) will result in an error.

You have three ways to match tables:

include:
  - table: customers.customers        # Exact table match
  - glob: orders.order_*              # Glob pattern (wildcards: *, ?, [abc])
  - regex: "^payments\\.payment_.*$"  # Regular expression

In this example:

  • table: customers.customers matches exactly one table
  • glob: orders.order_* matches all tables in the orders schema starting with order_ (the schema orders is explicit, only the table name uses wildcards)
  • regex: "^payments\\.payment_.*$" matches all tables in the payments schema starting with payment_ (the schema payments is explicit, only the table name uses regex)

Running the Subset

Once you've created your ruleset in the DataMasque web interface:

  1. Select your database connection from the available connections.
  2. Select your saved subset ruleset.
  3. Click the Preview Run button.
  4. Confirm the displayed details are correct, then click the Start Run button to start the subsetting process.

You'll see output similar to:

Starting subset operation
Discovering table relationships...
Created 2 dependency trees for subsetting
Table dependency graph A1B2C3D4 root `CUSTOMERS.CUSTOMERS`
Table dependency graph A1B2C3D4 relationship `CUSTOMERS.CUSTOMERS`: (`CUSTOMER_ID`) -> `ORDERS.ORDERS`: (`CUSTOMER_ID`)
Table dependency graph A1B2C3D4 relationship `ORDERS.ORDERS`: (`ORDER_ID`) -> `ORDERS.ORDER_ITEMS`: (`ORDER_ID`)
Table dependency graph A1B2C3D4 relationship `ORDERS.ORDERS`: (`ORDER_ID`) -> `PAYMENTS.PAYMENTS`: (`ORDER_ID`)
3 relationships discovered
Processing batch 1 of 10 (10.0% of data)...
Copying selected data to temporary tables...
Truncating original tables...
Restoring data from temporary tables...
Recreating foreign key constraints...
Subset operation completed successfully

Understanding the Results

Relationship Discovery

DataMasque automatically discovered the relationships between your tables:

  • Customers have orders (via customer_id)
  • Orders have items and payments (via order_id)

This creates a table dependency graph that ensures related data stays together.

How DataMasque Organizes Tables

When DataMasque analyzes your database, it builds table dependency graphs (also called dependency trees or DAGs - Directed Acyclic Graphs) from the discovered relationships. Each table dependency graph represents a group of related tables connected by foreign key relationships.

For example, in the sample schema from this tutorial:

  • The table dependency graph has customers.customers as its root table (the table with no parent dependencies)
  • Child tables like orders.orders, order_items, and payments are connected through foreign keys
  • When you configure 10% retention, DataMasque applies this percentage at the root of each graph
  • All related data cascades down automatically to maintain referential integrity

The logs show these relationships as they're discovered:

Table dependency graph A1B2C3D4 root `CUSTOMERS.CUSTOMERS`
Table dependency graph A1B2C3D4 relationship `CUSTOMERS.CUSTOMERS`: (`CUSTOMER_ID`) -> `ORDERS.ORDERS`: (`CUSTOMER_ID`)

This organizational structure ensures that subsetting maintains all relationships between tables while reducing the overall data volume.

Cross-Schema Discovery

DataMasque uses intelligent schema selection that goes beyond the tables you explicitly configure:

  • When you include tables using patterns like schema.*, all tables matching the pattern will be included in the subsetting operation.

  • If a table is not explicitly matched by your include patterns but has foreign key relationships to included tables, it will be automatically included to maintain referential integrity.

    • This discovery process cascades across multiple schemas. For example, if Schema A references Schema B, and Schema B references Schema C, all necessary tables across all three schemas will be included even if you only configured one of them.
  • The subsetting percentage is applied at the root of each table dependency graph, not at each individual table. This means:

    • DataMasque identifies the root tables (those with no parent dependencies)
    • Applies the 10% selection to these root tables
    • Cascades all related data down the dependency chain

Important Implication: Child tables may end up with more, or significantly less, than 10% of their original data. As an example, if customers is the parent table and orders the child table, taking 10% of customers might result in 15% of orders being retained, or perhaps only 2%, depending on which rows in customers happened to be selected for retention.

Pattern overrides can be used to specify a target percentage for specific tables. However, there is no way to prevent a child table being subsetted more aggressively than the goal percentage suggests if the selected rows in the parent table happen to not match many rows in the child table. It is strongly recommended that you test with larger retention percentages first and gradually refine to smaller percentages.

Cross-Schema Discovery Example

Here's a concrete example of cross-schema discovery in action. Consider this configuration that deliberately omits the payments schema:

version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 10
    include:
      - glob: customers.*
      - glob: orders.*
      # Note: payments schema is NOT configured here

Even though payments.* isn't in the include patterns, you'll see an output like:

Starting subset operation
Discovering table relationships...
==== SUBSETTING CONFLICTS RESOLVED ====
Table `payments.transactions` is in a schema not configured for subsetting,
but was automatically added to preserve FK relationship with `orders.orders`.
Created 1 dependency tree for subsetting
Table dependency graph A1B2C3D4 root `CUSTOMERS.CUSTOMERS`
Table dependency graph A1B2C3D4 relationship `CUSTOMERS.CUSTOMERS`: (`CUSTOMER_ID`) -> `ORDERS.ORDERS`: (`CUSTOMER_ID`)
Table dependency graph A1B2C3D4 relationship `ORDERS.ORDERS`: (`ORDER_ID`) -> `PAYMENTS.TRANSACTIONS`: (`ORDER_ID`)

Note: The key message here is that payments.transactions is in a schema not configured for subsetting, but was automatically added to preserve the FK relationship. This shows DataMasque's automatic relationship discovery in action.

DataMasque automatically discovered that orders.orders has a foreign key to payments.transactions and included that specific table to maintain referential integrity. However, other unrelated tables in the payments schema are not included.

Hash-Based Selection

The subsetting process uses deterministic hash-based selection, meaning:

  • The same configuration with the same run_secret and instance_secret will always produce the same subset.
  • Customer #12345 (for example) will either be included with all their data, or excluded entirely.
  • The result will not have any orphaned records or broken relationships.
  • If you run subsetting on an already-subsetted database, it won't reduce the data further unless you specify a different run_secret.

Data Integrity

After subsetting:

  • All foreign key constraints remain valid.
  • Table structures are unchanged.
  • Indexes and permissions are preserved.
  • You have approximately 10% of your original data volume. Actual retention may vary based on foreign key relationships; see Retention and Foreign Key Relationships.

Next Steps

Now that you've successfully created a basic subset, you can explore more advanced table selection options:

  • Include only specific tables using table, glob, and/or regex patterns.
  • Override the default percentage for specific tables.
  • Skip tables that aren't needed.
  • Drop or truncate tables.

Continue with Advanced Features to learn about these advanced features, or jump to Examples and Scenarios for complete real-world configurations.