DataMasque Portal

Word Document Masking

The unstructured_text mask can be applied to Microsoft Word documents (.docx) via mask_file tasks. Text is extracted from the document, entities are detected and masked, then the document is rebuilt with the masked text in place.

Supported Content

The following document elements are extracted and rebuilt:

  • Paragraphs with full run formatting (font, size, colour, bold, italic, underline, strikethrough, caps, superscript/subscript, highlight, shading, hidden text)
  • Character styles (e.g. Hyperlink, Strong)
  • Hyperlink display text (URLs are discarded to avoid PII leakage)
  • Paragraph properties (styles, alignment, indentation, spacing, line spacing, shading, borders, tab stops)
  • Pagination properties (page break before, keep with next, keep lines together)
  • Numbering and bullets
  • Tables, including nested tables, merged cells (horizontal and vertical), borders, shading, cell widths, row heights, header rows, and cell vertical alignment
  • Headers and footers (default, first-page, and even-page variants)
  • Section properties (page size, margins, orientation, columns, section break type, page numbering, page borders, form protection, text direction)
  • Inline and anchored drawings, with position, sizing, and wrapping preserved
    • Images: see image_handling below
    • Non-image drawings (charts, SmartArt, shapes, etc.) are always replaced with a black rectangle
  • Live pagination field codes (PAGE, NUMPAGES, SECTIONPAGES) that Word recalculates on open
  • Non-pagination field codes (AUTHOR, DATE, MERGEFIELD, etc.) are stripped to static display text
  • Document-level settings (default tab stop, auto-hyphenation, hyphenation zone, locale, theme font language, compatibility settings)
  • styles.xml, numbering.xml, and theme1.xml are copied from the source document unchanged

Unsupported Content

These elements are silently removed during masking:

  • Structured document tags / content controls (including TOC wrappers)
  • Footnotes and endnotes
  • Comments and track changes
  • Equations (Office Math ML)
  • Embedded OLE objects
  • Bookmarks and cross-reference targets
  • Run-level bidirectional text properties
  • Drop caps, watermarks, page background
  • Document metadata (core.xml, app.xml), digital signatures, and macros

image_handling

Controls how images are handled in Word documents. This parameter only applies to Word document masking and has no effect on text columns.

Value Description
redact (default) Replaces all images with a black rectangle.
retain Preserves original images.

Note: Image properties (alt text, title, description) are always removed regardless of this setting.

File Type Detection

Word document masking is supported in two contexts:

  • In mask_file tasks, the file extension (.docx) determines the file type.
  • In mask_table tasks, binary column values are inspected: a valid ZIP archive containing a word/document.xml entry is treated as a Word document. This allows masking of Word documents stored as blobs in database columns.

Limitations

  • Images may contain PII that text-based matchers cannot detect. Use image_handling: redact (the default) when images might contain sensitive data.
  • Masking can change the length of text, which may cause content to reflow across pages. Live pagination fields (PAGE, NUMPAGES) will update when the document is opened in Word, but fixed layout assumptions may break.
  • The fallback mask for Word documents only supports from_blob. Other fallback mask types are not supported for binary documents.