How to Automatically Redact PII
from PDFs and Documents
Before Sharing

Covering text with a black box doesn't erase it. The federal courts found out the hard way — thousands of Social Security numbers left uncovered in so called "redacted" filings, as reported in an official Federal Judicial Center study. This is the technical guide on how to redact PII correctly: every layer of the document, automatically, at the scale real organizations actually need.

4,525
federal court documents found with unredacted SSNs — ~22,000 numbers across ~8,300 individuals
$7.42M
average healthcare data breach cost in 2025 — highest of any industry for the 14th year running
$10.22M
average US data breach cost in 2025 — an all-time record, driven by regulatory fines
5 layers
distinct content layers in a typical PDF where PII can hide — text, images, OCR, form fields, metadata
PDF / ISO 32000 container format

The most dangerous misconception about document security is that if you draw a black box over the sensitive text, it is gone. It is not. In most PDF tools a black rectangle is a graphical overlay, a shape drawn *over* the text, and the text itself is fully intact in the data layer of the file. Anyone can select the area, copy it and paste the "redacted" text into a text editor and read it. It is not a theoretical corner case, it is the single most common reason for redaction failures and it has burned governments, corporations and law firms repeatedly.

Redaction is the process of permanently deleting the underlying data – from the text layer, the metadata, the embedded images and every other place it can be hidden. It is not a manual process to do it reliably on the amount of documents a real organization deals with. This guide explains exactly how automatic PII redaction works, why PDFs are the most difficult, the pipeline that does it correctly and how to test if your method really works.

What a "black box" redaction actually looks like in the file
↓ Hover or tap any black bar — the "hidden" text is still right there
Patient: Sarah Jenkins, SSN 123-45-6789
Diagnosis: Type 2 Diabetes, Plan ID HMO-99182
Black rectangles are simply drawn on top of the text – the characters below are left completely intact in the document's data layer. Copy-paste, "select all," or any text extractor pulls them out completely. This is how the Paul Manafort court filing leaked in 2019 – journalists copied the blacked-out text straight out of the PDF, exposing sealed information the defense had fought to keep secret.

Why PDFs Are Uniquely Hard to Redact

PDF is not a document like a text file is, it is a container format. The same file can contain five different types of content at the same time and PII can be in any of them. A redaction tool that only works on the visible text layer will miss everything else and in a compliance context missed PII is a breach in the making.

The five layers of a PDF — hover any layer to see where PII hides
1 Native text layer Selectable text +
2 Embedded images ID cards, signatures +
3 Scanned pages (image-of-text) Needs OCR +
4 Form field values Interactive fields +
5 Metadata (XMP / DocInfo) Author, history, comments +
Hover or tap a layer above to see what hides inside it — and why a text-only redactor misses four of the five.

This layered architecture is why redaction is more difficult than it seems. Suppose a mixed PDF with 50 pages of digital text and 10 scanned inserts. If the tool does not apply OCR to the scanned pages, then the 10 pages go through with all the PII fully preserved – even if the text layer was completely redacted. And a document titled "Patient Intake — Jane Doe, DOB 03/14/1972" in its metadata is just as non-compliant as an unredacted form field, no matter how clean the visible content looks.

The Redaction Failures That Made Headlines

These are not hypotheticals. Each of these is a real, documented redaction failure – and each one is traceable back to a layer that was not handled.

Federal Courts PACER filingsMetadata not scrubbed

According to a recent study by the Federal Judicial Center, 4,525 federal court documents on the PACER system were found to have unredacted Social Security numbers, about 22,000 SSNs of approximately 8,300 people. The problems were due to SSNs left in image layers and documents where redaction was tried but not successful.

Lesson: redaction must process metadata (Layer 5), not just visible content.
Paul Manafort court filingBlack-box overlay

In a high-profile federal case, redactions were made as black boxes drawn over text in a PDF. The text underneath was never deleted – a simple copy-and-paste revealed the supposedly hidden content, including sensitive information the court wanted to seal.

Lesson: overlays don't remove data. The text layer must be deleted, not covered.
FTC v. Microsoft (2023)Sharpie + scan failure

In the FTC's case against Microsoft's purchase of Activision, Sony's documents submitted to the FTC were redacted with a black Sharpie, but once scanned into the digital filing the redacted numbers were still visible, and confidential development cost figures that the parties had fought to keep under seal were revealed.

Lesson: physical redaction can fail too — OCR and high-resolution scans recover "blacked-out" text.

"True redaction strips the underlying data from the file. A black rectangle that merely sits on top of the text is not redaction — it's a cover that anyone can lift."

— Polygraf AI, on the most common cause of redaction failures

True Redaction vs. The Methods That Fail

Not all "redaction" is the same. Here's how the popular methods actually stack up in terms of whether the data is really gone.

Method Data actually removed? Recoverable by Safe to share?
Black box / shape overlay No Copy-paste, text extraction, moving the shape Never
Highlight in black / font color change No Select-all, changing colors back Never
Flatten to image only Partially OCR re-extraction; metadata often survives Not alone
True redaction (text layer deletion) Yes Nothing — text is gone from the data layer If metadata also handled
Pixel-burn + metadata scrub Yes Nothing — no recoverable layer remains Yes
The Copy-Paste Test

Before you post a "redacted" file, do the simplest test there is: open the final file, select all the text (Ctrl/Cmd+A), copy it and paste it into a plain text editor. If any of the supposedly redacted content shows up – even once – the redaction failed and the file is not safe to post. This 10-second test would have stopped the Manafort and FTC v. Microsoft leaks. It detects overlay-based "redaction" immediately. Then check the metadata separately: in the file properties or with a metadata viewer, verify that no PII remains in author, title or custom fields.

⊘ Manual redaction
2–4 hrs per 100-page document, across 10–15 PII categories
  • Fatigue makes mistakes – a name in a footnote, a number in a different format
  • No two reviewers redact the same way – inconsistent and hard to justify
  • Large-scale collapses – works for one file, fails for fifty
  • Forgotten metadata and scanned page layers
✓ Automated pipeline
minutes for the same document — with consistent rules on every page
  • Same rules for every page, every file, every time
  • Processes all layers — text, images, OCR, form fields, metadata
  • Scales thousands of documents without additional review hours
  • Create an audit log that is defensible in case of redaction.

The Automated Redaction Pipeline — Seven Stages

A production-grade automatic redaction pipeline is not a single operation, it is seven different steps, each of which is responsible for a different part of the problem. If you skip one of them you will leave a gap which the PII will slip through.

1
Ingestion

PDFs, Word files, Excel sheets, scanned images arrive into the pipeline. The file type is recognized and the file is prepared for layer-aware processing. Batch ingestion is what enables scale: hundreds or thousands of files in queue at once rather than one at a time.

2
Content extraction Critical

The system finds and extracts all the content layers (selectable text, embedded images, form-field values, metadata) separately, because each one requires a different analysis method. This is the stage that distinguishes real redaction from text-only tools.

3
OCR processing Critical

Scanned pages and image layers are OCRed to make the visual content machine readable as text. Without this, PII as pixels is completely unreadable. The quality of OCR has an impact on downstream accuracy: enterprise quality OCR can read handwriting, low resolution scans and mixed language content.

4
PII detection (ML + pattern)

The text that is extracted is passed through a detection that is a combination of pattern matching (for structured data such as SSNs and card numbers) and ML-based named-entity recognition (for contextual PII such as names, addresses, medical record numbers that regex cannot catch). Context is important: "Jordan" can be a name, a country or a brand – only context-aware detection can tell the difference.

5
Redaction application Critical

The PII that is found is removed at the content level – text from the text layer, image areas filled with solid blocks, form-field values cleared and the page often re-rendered so no underlying data remains. This is removal, not masking. The output is black bars or replacement characters where the data was and nothing underneath is recoverable.

6
Metadata scrubbing Critical

Author names, titles, creation date, custom properties, tracked changes, comments and revision history are removed. This is the stage the PACER incident missed. A perfectly redacted body still leaks if the metadata names the patient. Every complete pipeline must process metadata along with the content.

7
Quality assurance + audit log

A sampling step in which a reviewer inspects a percentage of the output – even a 5% spot-check – can prevent a misconfiguration from impacting thousands of pages. Every redaction is logged: what was redacted, when, by whom, and by what rules. This audit trail is what makes the redaction defensible in court or a regulatory review.

Why Human-in-the-Loop Still Matters

The best automated workflows are not fully automated, but rather high-recall detection plus human confirmation. The AI finds PII candidates with high recall (almost everything, including the edge cases that a tired reviewer would miss after page 80) and a human confirms the detections and adds anything industry-specific the model does not recognize. This is faster and more accurate than a purely manual review (which is subject to fatigue-related errors) and safer than fully automated redaction with no checkpoint. For high-volume, low-sensitivity workflows, full automation with periodic sampling is appropriate; for high-sensitivity documents, confirmation before export is worth the extra minute.

The Hardest Part: Detection Accuracy on Real Documents

The redaction process (removing text, scrubbing metadata) is well known. The hard problem is detection: to find every piece of PII in dirty real world documents in a reliable way. This is where most tools fail silently and it is worth to understand why.

Pattern matching alone is brittle. A Social Security number is easy to match when it's formatted 123-45-6789 — but OCR on a scanned form might produce 123 45 6789 with errant spaces, or read a name like "Dr. O'Brien" as "Dr. O8rien." Names, addresses and contextual identifiers do not have any fixed pattern at all — they need an ML-based named-entity recognition that understands the context. And quasi-identifiers (a ZIP code, a job title, a date of birth) look harmless in themselves but can re-identify a person when combined. Strong detection has to catch all three: structured patterns, contextual entities and dangerous combinations.

Where Polygraf AI Fits

Polygraf AI's detection engine was designed for this exact problem: to find contextual PII in the full spectrum of how it actually exists, not just clean structured formats. It is a contextual pattern matching for structured identifiers and context aware ML detection for names, addresses, medical record numbers and quasi-identifiers that generic tools miss, in all the categories that map to HIPAA's 18 identifiers, GDPR and financial PII. Detection is on-premise with zero data egress – the documents never leave your environment for analysis – and every detection is logged for the audit trail that compliance requires. For organizations that need to redact before sharing, the same engine that protects AI prompts protects documents.

Your Pre-Share Redaction Checklist

Before any document with PII leaves your organization, do this. This is a direct mapping to the failure modes above.

Before you hit "share"
The text layer is completely removed - not covered with a box. Checked with copy-paste test.
Scanned pages and embedded images were OCR'd and checked – not passed as un-inspected pixels.
Metadata is removed — author, title, creation software, custom properties, tracked changes, and comments.
The form-field values were not only hidden but also cleared.
Quasi-identifiers were considered: ZIP + DOB + gender can re-identify even without names.
A final copy was opened in a new viewer, preferably on a different device, to make sure nothing was left behind.
Redaction is logged – what, when, by whom, on what basis – for a defensible audit trail.
Polygraf AI

Detect Every Identifier — Before the Document Leaves

Polygraf AI's detection engine detects contextual PII as it is written in real documents, structured patterns, contextual names and addresses, and quasi-identifiers generic tools cannot detect. On-premise, zero data egress, full audit trail. The same engine that protects your AI prompts protects your documents.

Request a Demo →
Air-gap ready · HIPAA · SOC 2
Deploys in under an hour

NEWS & More

Insights & Updates from Polygraf.

Blog Posts

Documents shared without redaction are your biggest untracked compliance risk. Polygraf AI created a guide on automatic redaction of PII from PDFs and documents.

AI Compliance Library

Boards are asking for AI risk reports. This 2-page quarterly template: RAG status, key metrics, incidents, vendor risk, regulatory changes, and what you're asking the board to decide.

To learn more about Polygraf, please get in touch.

At Polygraf, we envision a future where AI augments human capabilities without compromising safety, privacy, or ethical standards. Trust in our commitment to building this future with you.

Products

thank you

Your download will start now.

Thank you!

Please provide information below and
we will send you a link to download the white paper.