Examples and Scenarios
This section provides complete, real-world subsetting configurations for common use cases. Each example includes the full YAML configuration and explains the technical features demonstrated.
- Scenario 1: E-commerce Platform Development
- Scenario 2: Data Migration and Cleanup
- Scenario 3: CI/CD Pipeline Optimization
- Scenario 4: Multi-Environment Data Refresh
- Choosing the Right Configuration
Scenario 1: E-commerce Platform Development
Business Context: Large e-commerce platform with separate schemas for customers, orders, inventory, and analytics. Development team needs realistic data for testing checkout flows, recommendation engines, and reporting features.
Technical Features Demonstrated:
- Table and glob patterns for precise control
- Per-table retention percentage overrides
- Additional cascades for external systems
- Skip patterns for privacy and cleanup
- Truncate patterns for clearing logs
Requirements:
- Keep enough customers to test user journeys (15%)
- Maintain order history but reduce granular transaction data (5%)
- Preserve inventory relationships but clean up old data
- Minimal analytics data for performance testing
- Clear session and cache data
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 10
include:
# Customer data - need variety for testing user flows
- table: customer_management.customers
target_percent: 15 # More customers for variety
- table: customer_management.customer_addresses
target_percent: 15
- glob: customer_management.customer_* # All customer-related tables
- table: customer_management.user_profiles
# Order processing - core business logic
- glob: order_processing.order* # All order tables
- table: order_processing.shipments
- table: order_processing.returns
# Order items need less detail
- table: order_processing.order_items
target_percent: 5
# Inventory management - product relationships important
- table: inventory.products
target_percent: 20 # More products for catalog testing
- glob: inventory.product_*
- glob: inventory.categories
- glob: inventory.suppliers
- glob: inventory.stock_levels
# Payment processing - minimal but functional
- glob: payment_processing.transactions
- glob: payment_processing.payment_methods
- glob: payment_processing.refunds
# Analytics - minimal data for performance testing
- glob: analytics.user_events
target_percent: 2
- glob: analytics.product_views
target_percent: 2
- table: analytics.conversion_funnel
target_percent: 2
# External system connections
- table: customer_management.customers
additional_cascades:
# Link to external CRM system
- source_columns: [customer_id]
target_table: external_crm.customer_interactions
target_columns: [customer_ref]
- table: order_processing.orders
additional_cascades:
# Connect to warehouse management system
- source_columns: [order_id]
target_table: warehouse_system.picking_lists
target_columns: [order_reference]
- table: inventory.products
additional_cascades:
# Link to recommendation engine
- source_columns: [product_id]
target_table: ml_system.product_embeddings
target_columns: [product_ref]
skip:
# Privacy - skip detailed PII
- glob: customer_management.*_pii_detailed
- glob: customer_management.*_gdpr_*
- glob: customer_management.*_session*
- glob: customer_management.*_backup
# Order processing - skip old data
- regex: "order_processing\\..*_archive_.*"
- glob: order_processing.*_temp
# Inventory - skip historical data
- glob: inventory.*_historical
- glob: inventory.*_import_*
# Payment - skip fraud detection and audit
- glob: payment_processing.*_fraud_*
- glob: payment_processing.*_audit_*
# Analytics - skip aggregations
- regex: "analytics\\..*_daily_.*"
- regex: "analytics\\..*_hourly_.*"
truncate:
# Clear session and cache data
- glob: customer_management.*_session*
- glob: customer_management.*_cache*
- glob: analytics.raw_events
Scenario 2: Data Migration and Cleanup
Business Context: Legacy system migration with years of accumulated data. Need to create clean subset for testing migration scripts while removing obsolete data.
Technical Features Demonstrated:
- Extensive
skippatterns for cleanup - Higher percentages for migration testing
- DROP patterns for removing obsolete schemas
- Pattern precedence with exact table matches
- Conservative batch sizes for large operations
Requirements:
- Preserve recent data for migration testing
- Remove deprecated tables and old backups
- Clean up staging areas completely
- Different retention periods for different data types
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 20 # Higher percentage for migration testing
batch_size: 30000 # Conservative for large tables
include:
# Core business data - clean but preserve relationships
- table: legacy_core.customers
target_percent: 35 # More customers for testing
- table: legacy_core.orders
target_percent: 25 # More orders for variety
- table: legacy_core.products
target_percent: 50 # Comprehensive product catalog
- regex: "^legacy_core\\.core_.*" # Tables starting with 'core_'
# Integration layer - selective preservation
- glob: integration.api_*
- glob: integration.webhook_*
skip:
# Legacy core - skip old versions
- regex: "legacy_core\\..*_v[0-9]+$" # Skip versioned tables (customers_v1, etc.)
- glob: legacy_core.*_deprecated
- glob: legacy_core.*_backup_*
- regex: "legacy_core\\..*_[0-9]{4}$" # Skip year-suffixed tables
- glob: legacy_core.temp_*
# Integration - skip logs and old data
- regex: "integration\\..*_[0-9]{8}$" # Skip date-suffixed tables
- glob: integration.*_log*
- glob: integration.*_queue*
- glob: integration.*_failed*
# Archive - skip everything
- glob: archive.*
drop:
# Remove entire obsolete schemas
- glob: staging.* # Remove staging schema
- glob: old_system.* # Remove old system schema
- glob: deprecated_*.* # Remove all deprecated schemas
# Remove specific obsolete tables
- table: legacy_core.migration_temp
- regex: "legacy_core\\.import_[0-9]{8}$"
truncate:
# Keep structure but clear data from logs
- glob: legacy_core.*_log
- table: integration.error_log
Scenario 3: CI/CD Pipeline Optimization
Business Context: Fast-moving development team with frequent deployments. Need small, fast-loading subset for automated testing that still covers all code paths.
Technical Features Demonstrated:
- Very aggressive data reduction (2% default)
- Selective higher percentages for key tables
- Extensive skip patterns for non-essential data
- Additional cascades to maintain test relationships
- Truncate for clearing test artifacts
Requirements:
- Minimize data volume for fast test execution
- Maintain data variety for comprehensive test coverage
- Automated testing needs predictable data set
- Clear test artifacts between runs
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 2 # Very small for fast CI/CD
batch_size: 10000 # Small batches
include:
# Application core - minimal but complete
- table: app_core.users
target_percent: 15 # Need variety in users
additional_cascades:
# Ensure test data relationships are maintained
- source_columns: [user_id]
target_table: business_logic.orders
target_columns: [created_by]
- table: app_core.accounts
target_percent: 8 # Need variety in accounts
- glob: app_core.permissions
- glob: app_core.settings
# Business logic - minimal data for all code paths
- table: business_logic.products
target_percent: 10 # Products need variety
additional_cascades:
- source_columns: [product_id]
target_table: external_integrations.inventory_sync
target_columns: [product_ref]
- table: business_logic.orders
target_percent: 1 # Orders can be minimal
- table: business_logic.categories
target_percent: 50 # Small table, keep variety
skip:
# App core - skip all non-essential data
- glob: app_core.*_session*
- glob: app_core.*_cache*
- glob: app_core.*_log*
- glob: app_core.*_audit*
# Business logic - skip historical and analytics
- glob: business_logic.*_history
- glob: business_logic.*_analytics*
# External integrations - skip all tables
- glob: external_integrations.*
truncate:
# Clear test artifacts between runs
- glob: app_core.test_*
- glob: app_core.*_cache*
- table: business_logic.temp_calculations
Scenario 4: Multi-Environment Data Refresh
Business Context: Production data refresh for staging/UAT environments. Need production-like data but with privacy protections and cleanup.
Technical Features Demonstrated:
droppatterns for removing production-only datatruncatepatterns for clearing operational dataskippatterns for privacy compliance- Moderate retention for realistic testing
Requirements:
- Remove production-only tables (monitoring, operations)
- Clear sensitive logs and audit trails
- Remove API keys and credentials
- Maintain realistic data volume
version: "1.0"
tasks:
- type: subset
mode: in_place
defaults:
target_percent: 25 # Moderate size for staging
include:
# Core application data
- glob: public.users
- glob: public.accounts
- glob: public.orders
- glob: public.products
- glob: public.transactions
# Reference data (keep more)
- glob: reference.*
target_percent: 75 # Keep most reference data
skip:
# Skip sensitive data
- glob: public.*_pii
- glob: public.*_ssn
- glob: public.credit_cards
- glob: public.*_sensitive
# Skip operational data
- glob: public.*_metrics
- glob: public.*_monitoring
drop:
# Remove production-only schemas
- glob: production_ops.* # Remove operations schema
- glob: monitoring.* # Remove monitoring schema
- glob: alerting.* # Remove alerting schema
# Remove specific production tables
- table: public.api_keys
- table: public.service_credentials
- glob: public.prod_* # Remove prod-prefixed tables
truncate:
# Clear operational and audit data
- glob: public.audit_log* # Keep structure, clear data
- glob: public.*_session* # Clear sessions
- glob: public.*_cache* # Clear caches
- table: public.login_history # Clear login history
- table: public.api_call_log # Clear API logs
Choosing the Right Configuration
When creating your own subsetting configuration, consider:
Data Volume Needs
- Development: 5-15% of production data
- Testing: 2-10% depending on test complexity
- Staging/UAT: 20-30% for realistic scenarios
- Demo/Training: 10-25% for realistic scenarios
Pattern Selection Strategy
Use table patterns when:
- You need exact control over specific tables
- You want to apply overrides to specific tables
- You need additional cascades
Use glob patterns when:
- You want to match all tables in a schema (
schema.*) - You need prefix/suffix matching (
customer_*,*_temp) - Simple wildcards are sufficient
Use regex patterns when:
- You need complex matching logic
- Glob patterns aren't precise enough
- You want to match multiple patterns in one rule
Cleanup Operations
Use skip when:
- You want to exclude tables but preserve them if needed by FK relationships
- You're unsure if tables are safe to remove
Use drop when:
- You're certain tables should be completely removed
- Removing obsolete/deprecated schemas
- Cleaning up staging/temporary tables
Use truncate when:
- You want to keep table structure but clear data
- Clearing logs, caches, sessions
- Resetting counters while preserving schema
Performance Considerations
For large databases:
- Use moderate batch sizes (25,000-50,000)
- Start with higher percentages and reduce gradually
- Test with dry runs first
For fast CI/CD:
- Use very low percentages (1-5%)
- Skip all non-essential data
- Use small batch sizes for quick iterations
For memory-constrained environments:
- Use smaller batch sizes (10,000-20,000)
- Reduce target percentages
- Skip large historical tables
Business Requirements
Regulatory compliance:
- Skip or drop PII/sensitive data
- Keep audit trails if required
- Document data handling in compliance docs
Privacy concerns:
- Use skip patterns for detailed PII
Data variety:
- Higher percentages for key business entities
- Lower percentages for transactional detail
- Consider relationships when setting percentages
For more specific guidance, see Advanced Features or Configuration Reference.