Word Document Masking
The unstructured_text mask can be applied to Microsoft Word documents (.docx) via mask_file tasks.
Text is extracted from the document, entities are detected and masked,
then the document is rebuilt with the masked text in place.
Supported Content
The following document elements are extracted and rebuilt:
- Paragraphs with full run formatting (font, size, colour, bold, italic, underline, strikethrough, caps, superscript/subscript, highlight, shading, hidden text)
- Character styles (e.g.
Hyperlink,Strong) - Hyperlink display text (URLs are discarded to avoid PII leakage)
- Paragraph properties (styles, alignment, indentation, spacing, line spacing, shading, borders, tab stops)
- Pagination properties (page break before, keep with next, keep lines together)
- Numbering and bullets
- Tables, including nested tables, merged cells (horizontal and vertical), borders, shading, cell widths, row heights, header rows, and cell vertical alignment
- Headers and footers (default, first-page, and even-page variants)
- Section properties (page size, margins, orientation, columns, section break type, page numbering, page borders, form protection, text direction)
- Inline and anchored drawings, with position, sizing, and wrapping preserved
- Images: see
image_handlingbelow - Non-image drawings (charts, SmartArt, shapes, etc.) are always replaced with a black rectangle
- Images: see
- Live pagination field codes (
PAGE,NUMPAGES,SECTIONPAGES) that Word recalculates on open - Non-pagination field codes (
AUTHOR,DATE,MERGEFIELD, etc.) are stripped to static display text - Document-level settings (default tab stop, auto-hyphenation, hyphenation zone, locale, theme font language, compatibility settings)
styles.xml,numbering.xml, andtheme1.xmlare copied from the source document unchanged
Unsupported Content
These elements are silently removed during masking:
- Structured document tags / content controls (including TOC wrappers)
- Footnotes and endnotes
- Comments and track changes
- Equations (Office Math ML)
- Embedded OLE objects
- Bookmarks and cross-reference targets
- Run-level bidirectional text properties
- Drop caps, watermarks, page background
- Document metadata (
core.xml,app.xml), digital signatures, and macros
image_handling
Controls how images are handled in Word documents. This parameter only applies to Word document masking and has no effect on text columns.
| Value | Description |
|---|---|
redact (default) |
Replaces all images with a black rectangle. |
retain |
Preserves original images. |
Note: Image properties (alt text, title, description) are always removed regardless of this setting.
File Type Detection
Word document masking is supported in two contexts:
- In
mask_filetasks, the file extension (.docx) determines the file type. - In
mask_tabletasks, binary column values are inspected: a valid ZIP archive containing aword/document.xmlentry is treated as a Word document. This allows masking of Word documents stored as blobs in database columns.
Limitations
- Images may contain PII that text-based matchers cannot detect. Use
image_handling: redact(the default) when images might contain sensitive data. - Masking can change the length of text, which may cause content to reflow across pages. Live pagination fields (
PAGE,NUMPAGES) will update when the document is opened in Word, but fixed layout assumptions may break. - The fallback mask for Word documents only supports
from_blob. Other fallback mask types are not supported for binary documents.