DataMasque Portal

Examples and Scenarios

This section provides complete, real-world subsetting configurations for common use cases. Each example includes the full YAML configuration and explains the technical features demonstrated.

Scenario 1: E-commerce Platform Development

Business Context: Large e-commerce platform with separate schemas for customers, orders, inventory, and analytics. Development team needs realistic data for testing checkout flows, recommendation engines, and reporting features.

Technical Features Demonstrated:

  • Table and glob patterns for precise control
  • Per-table retention percentage overrides
  • Additional cascades for external systems
  • Skip patterns for privacy and cleanup
  • Truncate patterns for clearing logs

Requirements:

  • Keep enough customers to test user journeys (15%)
  • Maintain order history but reduce granular transaction data (5%)
  • Preserve inventory relationships but clean up old data
  • Minimal analytics data for performance testing
  • Clear session and cache data
version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 10

    include:
      # Customer data - need variety for testing user flows
      - table: customer_management.customers
        target_percent: 15                    # More customers for variety
      - table: customer_management.customer_addresses
        target_percent: 15
      - glob: customer_management.customer_*  # All customer-related tables
      - table: customer_management.user_profiles

      # Order processing - core business logic
      - glob: order_processing.order*         # All order tables
      - table: order_processing.shipments
      - table: order_processing.returns

      # Order items need less detail
      - table: order_processing.order_items
        target_percent: 5

      # Inventory management - product relationships important
      - table: inventory.products
        target_percent: 20                    # More products for catalog testing
      - glob: inventory.product_*
      - glob: inventory.categories
      - glob: inventory.suppliers
      - glob: inventory.stock_levels

      # Payment processing - minimal but functional
      - glob: payment_processing.transactions
      - glob: payment_processing.payment_methods
      - glob: payment_processing.refunds

      # Analytics - minimal data for performance testing
      - glob: analytics.user_events
        target_percent: 2
      - glob: analytics.product_views
        target_percent: 2
      - table: analytics.conversion_funnel
        target_percent: 2

      # External system connections
      - table: customer_management.customers
        additional_cascades:
          # Link to external CRM system
          - source_columns: [customer_id]
            target_table: external_crm.customer_interactions
            target_columns: [customer_ref]

      - table: order_processing.orders
        additional_cascades:
          # Connect to warehouse management system
          - source_columns: [order_id]
            target_table: warehouse_system.picking_lists
            target_columns: [order_reference]

      - table: inventory.products
        additional_cascades:
          # Link to recommendation engine
          - source_columns: [product_id]
            target_table: ml_system.product_embeddings
            target_columns: [product_ref]

    skip:
      # Privacy - skip detailed PII
      - glob: customer_management.*_pii_detailed
      - glob: customer_management.*_gdpr_*
      - glob: customer_management.*_session*
      - glob: customer_management.*_backup

      # Order processing - skip old data
      - regex: "order_processing\\..*_archive_.*"
      - glob: order_processing.*_temp

      # Inventory - skip historical data
      - glob: inventory.*_historical
      - glob: inventory.*_import_*

      # Payment - skip fraud detection and audit
      - glob: payment_processing.*_fraud_*
      - glob: payment_processing.*_audit_*

      # Analytics - skip aggregations
      - regex: "analytics\\..*_daily_.*"
      - regex: "analytics\\..*_hourly_.*"

    truncate:
      # Clear session and cache data
      - glob: customer_management.*_session*
      - glob: customer_management.*_cache*
      - glob: analytics.raw_events

Scenario 2: Data Migration and Cleanup

Business Context: Legacy system migration with years of accumulated data. Need to create clean subset for testing migration scripts while removing obsolete data.

Technical Features Demonstrated:

  • Extensive skip patterns for cleanup
  • Higher percentages for migration testing
  • DROP patterns for removing obsolete schemas
  • Pattern precedence with exact table matches
  • Conservative batch sizes for large operations

Requirements:

  • Preserve recent data for migration testing
  • Remove deprecated tables and old backups
  • Clean up staging areas completely
  • Different retention periods for different data types
version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 20                     # Higher percentage for migration testing
      batch_size: 30000                      # Conservative for large tables

    include:
      # Core business data - clean but preserve relationships
      - table: legacy_core.customers
        target_percent: 35                   # More customers for testing
      - table: legacy_core.orders
        target_percent: 25                   # More orders for variety
      - table: legacy_core.products
        target_percent: 50                   # Comprehensive product catalog
      - regex: "^legacy_core\\.core_.*"      # Tables starting with 'core_'

      # Integration layer - selective preservation
      - glob: integration.api_*
      - glob: integration.webhook_*

    skip:
      # Legacy core - skip old versions
      - regex: "legacy_core\\..*_v[0-9]+$"   # Skip versioned tables (customers_v1, etc.)
      - glob: legacy_core.*_deprecated
      - glob: legacy_core.*_backup_*
      - regex: "legacy_core\\..*_[0-9]{4}$"  # Skip year-suffixed tables
      - glob: legacy_core.temp_*

      # Integration - skip logs and old data
      - regex: "integration\\..*_[0-9]{8}$"  # Skip date-suffixed tables
      - glob: integration.*_log*
      - glob: integration.*_queue*
      - glob: integration.*_failed*

      # Archive - skip everything
      - glob: archive.*

    drop:
      # Remove entire obsolete schemas
      - glob: staging.*                      # Remove staging schema
      - glob: old_system.*                   # Remove old system schema
      - glob: deprecated_*.*                 # Remove all deprecated schemas

      # Remove specific obsolete tables
      - table: legacy_core.migration_temp
      - regex: "legacy_core\\.import_[0-9]{8}$"

    truncate:
      # Keep structure but clear data from logs
      - glob: legacy_core.*_log
      - table: integration.error_log

Scenario 3: CI/CD Pipeline Optimization

Business Context: Fast-moving development team with frequent deployments. Need small, fast-loading subset for automated testing that still covers all code paths.

Technical Features Demonstrated:

  • Very aggressive data reduction (2% default)
  • Selective higher percentages for key tables
  • Extensive skip patterns for non-essential data
  • Additional cascades to maintain test relationships
  • Truncate for clearing test artifacts

Requirements:

  • Minimize data volume for fast test execution
  • Maintain data variety for comprehensive test coverage
  • Automated testing needs predictable data set
  • Clear test artifacts between runs
version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 2                      # Very small for fast CI/CD
      batch_size: 10000                      # Small batches

    include:
      # Application core - minimal but complete
      - table: app_core.users
        target_percent: 15                   # Need variety in users
        additional_cascades:
          # Ensure test data relationships are maintained
          - source_columns: [user_id]
            target_table: business_logic.orders
            target_columns: [created_by]

      - table: app_core.accounts
        target_percent: 8                    # Need variety in accounts
      - glob: app_core.permissions
      - glob: app_core.settings

      # Business logic - minimal data for all code paths
      - table: business_logic.products
        target_percent: 10                   # Products need variety
        additional_cascades:
          - source_columns: [product_id]
            target_table: external_integrations.inventory_sync
            target_columns: [product_ref]

      - table: business_logic.orders
        target_percent: 1                    # Orders can be minimal
      - table: business_logic.categories
        target_percent: 50                   # Small table, keep variety

    skip:
      # App core - skip all non-essential data
      - glob: app_core.*_session*
      - glob: app_core.*_cache*
      - glob: app_core.*_log*
      - glob: app_core.*_audit*

      # Business logic - skip historical and analytics
      - glob: business_logic.*_history
      - glob: business_logic.*_analytics*

      # External integrations - skip all tables
      - glob: external_integrations.*

    truncate:
      # Clear test artifacts between runs
      - glob: app_core.test_*
      - glob: app_core.*_cache*
      - table: business_logic.temp_calculations

Scenario 4: Multi-Environment Data Refresh

Business Context: Production data refresh for staging/UAT environments. Need production-like data but with privacy protections and cleanup.

Technical Features Demonstrated:

  • drop patterns for removing production-only data
  • truncate patterns for clearing operational data
  • skip patterns for privacy compliance
  • Moderate retention for realistic testing

Requirements:

  • Remove production-only tables (monitoring, operations)
  • Clear sensitive logs and audit trails
  • Remove API keys and credentials
  • Maintain realistic data volume
version: "1.0"
tasks:
  - type: subset
    mode: in_place
    defaults:
      target_percent: 25                     # Moderate size for staging

    include:
      # Core application data
      - glob: public.users
      - glob: public.accounts
      - glob: public.orders
      - glob: public.products
      - glob: public.transactions

      # Reference data (keep more)
      - glob: reference.*
        target_percent: 75                   # Keep most reference data

    skip:
      # Skip sensitive data
      - glob: public.*_pii
      - glob: public.*_ssn
      - glob: public.credit_cards
      - glob: public.*_sensitive

      # Skip operational data
      - glob: public.*_metrics
      - glob: public.*_monitoring

    drop:
      # Remove production-only schemas
      - glob: production_ops.*               # Remove operations schema
      - glob: monitoring.*                   # Remove monitoring schema
      - glob: alerting.*                     # Remove alerting schema

      # Remove specific production tables
      - table: public.api_keys
      - table: public.service_credentials
      - glob: public.prod_*                  # Remove prod-prefixed tables

    truncate:
      # Clear operational and audit data
      - glob: public.audit_log*              # Keep structure, clear data
      - glob: public.*_session*              # Clear sessions
      - glob: public.*_cache*                # Clear caches
      - table: public.login_history          # Clear login history
      - table: public.api_call_log           # Clear API logs

Choosing the Right Configuration

When creating your own subsetting configuration, consider:

Data Volume Needs

  • Development: 5-15% of production data
  • Testing: 2-10% depending on test complexity
  • Staging/UAT: 20-30% for realistic scenarios
  • Demo/Training: 10-25% for realistic scenarios

Pattern Selection Strategy

Use table patterns when:

  • You need exact control over specific tables
  • You want to apply overrides to specific tables
  • You need additional cascades

Use glob patterns when:

  • You want to match all tables in a schema (schema.*)
  • You need prefix/suffix matching (customer_*, *_temp)
  • Simple wildcards are sufficient

Use regex patterns when:

  • You need complex matching logic
  • Glob patterns aren't precise enough
  • You want to match multiple patterns in one rule

Cleanup Operations

Use skip when:

  • You want to exclude tables but preserve them if needed by FK relationships
  • You're unsure if tables are safe to remove

Use drop when:

  • You're certain tables should be completely removed
  • Removing obsolete/deprecated schemas
  • Cleaning up staging/temporary tables

Use truncate when:

  • You want to keep table structure but clear data
  • Clearing logs, caches, sessions
  • Resetting counters while preserving schema

Performance Considerations

For large databases:

  • Use moderate batch sizes (25,000-50,000)
  • Start with higher percentages and reduce gradually
  • Test with dry runs first

For fast CI/CD:

  • Use very low percentages (1-5%)
  • Skip all non-essential data
  • Use small batch sizes for quick iterations

For memory-constrained environments:

  • Use smaller batch sizes (10,000-20,000)
  • Reduce target percentages
  • Skip large historical tables

Business Requirements

Regulatory compliance:

  • Skip or drop PII/sensitive data
  • Keep audit trails if required
  • Document data handling in compliance docs

Privacy concerns:

  • Use skip patterns for detailed PII

Data variety:

  • Higher percentages for key business entities
  • Lower percentages for transactional detail
  • Consider relationships when setting percentages

For more specific guidance, see Advanced Features or Configuration Reference.