The PDF Wall: Why Your Clinical RAG Keeps Hallucinating
Your clinical RAG system is not hallucinating because the model is bad. It is hallucinating because your document pipeline is broken. Here is what clinical PDF parsing actually requires, and why Docling is the fix.
Listen to this post
The PDF Wall: Why Your Clinical RAG Keeps Hallucinating
The model sounds completely confident. It cites a source, gives a dosing recommendation, references a section number. You check the original PDF. The section does not exist.
Your first instinct is the prompt. You try rephrasing, adding context, adding constraints. Nothing changes. So you start suspecting the model itself: maybe try a different temperature, a larger context window, a different provider altogether.
You spend three hours debugging the wrong layer.
The actual problem is upstream of the model. It is upstream of your prompt. It is in the document processing step you ran before any of that, probably in under ten seconds, probably with a library you found in a tutorial, probably without thinking much about it.
Not the model wall. Not the cost wall. Not even the ethics wall.
The PDF wall.
This is not a model problem. It is a document pipeline problem. It lives entirely upstream of the model, and it is invisible until you know where to look.
Where Clinical Data Lives (and Why It Hurts)
Before any RAG system can work, someone has to get the documents in. For physician-builders, the documentary reality of clinical practice is predominantly PDF.
Outside records come as scanned PDFs. Lab panels come as faxed PDFs. SMFM consult series come as PDF chapters. The EvidenceMD summary your institution licensed is a PDF. The textbook chapter you annotated is a PDF. The protocols your pathway is built on are PDFs.
The false assumption embedded in most tutorials is this: “I can just extract the text.”
What you get back from naive PDF-to-text extraction is not text. It is chaos. Reading order collapses. A two-column SMFM guideline becomes a string where left and right columns interleave randomly. Tables become streams of disconnected numbers with no row or column relationships. Headers detach from the content they introduce. Multi-column layouts destroy the logical structure of the document entirely.
You paste that into a prompt and the model sounds confused because the input is confused. The model is doing exactly what you asked. The problem is what you gave it to read.
What RAG Actually Requires (The Silent Assumption)
Most RAG tutorials skip an assumption buried in the architecture. It is worth naming explicitly before we go further.
RAG works by splitting your documents into chunks, converting each chunk into a vector, storing those vectors in a database, and retrieving the most relevant chunks at query time to pass to the model as context. If the pattern is new to you, the full architecture and its clinical implications are explained in this post.
The silent assumption: the chunks have to make sense.
If your PDF parser collapsed the reading order, your chunks contain fragments from different parts of the document stitched together at random. The chunk that should contain the SMFM recommendation on antenatal corticosteroids at 34 weeks instead contains half a dosing table, three unrelated words from the column to its left, and a page footer.
The retrieval step finds that chunk because the query matches some of its words. The model reads it. The response sounds authoritative and is wrong.
The retrieval logic is blameless. It found the closest semantic match it could find. The closest match was garbage.
I thought the model was hallucinating. It was. But the model was doing exactly what I asked: reasoning over corrupted input that I had built.
Fix the document layer. The model gets dramatically better without any other change.
The Anatomy of a Bad Parse: Show, Don’t Tell
Here is what this looks like in practice.
The same clinical document, processed two ways. The source is a preterm surveillance protocol: a two-column table listing recommended monitoring frequency by gestational age for fetuses with absent end-diastolic velocity.
# Path 1: naive extraction -- what most tutorials default to
import pdfplumber
with pdfplumber.open("surveillance-protocol.pdf") as pdf:
text = "\n".join(page.extract_text() for page in pdf.pages)
# What you get back: unstructured string, reading order collapsed
# Example output fragment:
# "28 wks Twice weekly BPP NST every other visit 32 wks Three times
# weekly NST every visit BPP weekly Doppler 34-36 wks Consider delivery
# at 37 wks if stable [Footer: APA Protocol v2.3]"
# Path 2: Docling extraction -- structure preserved
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("surveillance-protocol.pdf")
markdown = result.document.export_to_markdown()
# What you get back: structured Markdown, table cells and relationships intact
# | Gestational Age | Frequency | BPP | Doppler |
# |-----------------|---------------------|--------------|---------|
# | 28 weeks | Twice weekly | Every visit | Weekly |
# | 32 weeks | Three times weekly | Every visit | Weekly |
# | 34-36 weeks | Three times weekly | Every visit | Weekly |
Now chunk both outputs and send the same query: “What is the recommended surveillance frequency at 32 weeks for absent end-diastolic velocity?”
The chunk from the naive extraction path contains fragments from multiple gestational age rows, interleaved with footer text. The retrieval step finds it because it matches the query semantically. The model reads it and produces a confident, incorrect answer.
The chunk from the Docling path contains a coherent table row: the gestational age, the surveillance frequency, the monitoring modalities. The model reads it and produces a correct answer.
Same query. Same model. Different input.
Docling: What It Actually Does
Docling is an open-source document intelligence library from IBM Research. MIT license. Runs entirely on your machine. No API calls to external servers. No data leaving your environment.
It was built to solve exactly the problem described above.
When you give Docling a PDF, it runs a two-model pipeline. DocLayNet analyzes the layout of every page: it identifies reading order, recognizes headings and their structural relationships, locates figure boundaries, and marks table regions even when they span columns or pages. TableFormer then takes every detected table and reconstructs the cell relationships, row and column structure, merged cells, and header rows.
The output is a structured document. Not a text blob. A document.
That distinction matters because chunking a structured document produces coherent chunks. A chunk from a Docling-processed PDF contains a logically complete unit: a section with its heading intact, a table with its cells preserved, a recommendation with its context. That chunk, when retrieved, gives the model something real to reason over.
Tables Deserve Their Own Paragraph
Clinical documents are full of tables: dosing tables, reference range tables, evidence grading tables, surveillance frequency tables. A standard OCR tool returns the text that was in the table without the structure. Columns lose their relationships. The numbers in column three are no longer connected to the row labels in column one.
Docling’s TableFormer model is specifically trained to preserve table structure. The output can be queried as a table, embedded as a table, and retrieved as a table. For clinical data, this is not an abstract benefit. The difference between “magnesium sulfate 4-6 g IV bolus then 2 g/hour” and a scrambled string of numbers is the difference between safe and unsafe information retrieval.
Performance That Actually Matters for a Corpus
Docling processes a page in about 3 seconds on a standard GPU and 1.2 seconds on Apple Silicon. The nearest open-source competitor runs at roughly 16 seconds per page. For a 200-PDF corpus, that is the difference between an overnight job and a lunch-break job.
The Local Execution Argument
This is not a compliance note. It is a design principle.
Many document processing services send your documents to external servers. The PDF goes up. The processed text comes back. Somewhere in between, your clinical documents transited infrastructure you do not control.
Docling runs locally. Your PDFs do not leave your machine. The processing happens on your hardware, inside your environment, under your governance. For physician-builders working with de-identified clinical data, local execution is not optional. It is the architecture.
The Full Pipeline, With Docling in It
Here is how these pieces connect in a physician-builder’s stack.
Your clinical documents (PDFs, DOCX, outside records, guidelines)
↓
Docling (document parsing + structure preservation)
↓
Chunking (by section, table, or heading -- not arbitrary character count)
↓
Embedding (convert chunks to vectors)
↓
Vector database (Pinecone, ChromaDB, Qdrant)
↓
Retrieval (find the relevant chunks for a given query)
↓
Context injection -> LLM -> Response
Each layer matters. Docling is the layer that determines whether everything downstream is working on real information or structured noise.
With Docling output, the chunking step improves qualitatively: you chunk by section heading or table boundary instead of arbitrary character count. A chunk aligned to a document section is semantically coherent by construction. The embedding step converts those chunks to vectors. The retrieval step finds the most relevant ones for a query. The context injection step passes them to the model.
None of those steps can compensate for bad input. Docling is where you make the input good.
Provenance: The Non-Negotiable Clinical Requirement
Here is a feature that does not appear in most RAG tutorials, and that matters more for clinical use than almost anything else.
Every element that Docling extracts carries its source location: the page number and the bounding box coordinates on that page. This is provenance data.
Consider the clinical scenario directly. Your RAG system returns: the recommended surveillance for absent end-diastolic velocity is twice weekly. You need to know where that came from. Not just the document title. The specific page and section. Because the recommended interval at 28 weeks, at 32 weeks, and at 34 weeks may not be the same. The clinical decision at each gestational age is different.
Docling’s provenance data lets you build retrieval systems that do not just answer questions. They cite their answers with precision: document title, page number, section heading, and if you build the output carefully, a pointer to the exact paragraph. The physician can verify it in under 30 seconds. The system can be audited. The answer can be traced.
That is not a nice feature for a clinical system. That is a non-negotiable requirement.
A system that gives you a confident answer without a citable source is not a clinical tool. It is a liability.
What This Unlocks: AI Agents That Work on Real Clinical Documents
Once the document pipeline is clean, you stop building on demo data and start building on your own clinical corpus. That shift is what makes agent workflows real rather than theoretical.
RAG answers questions. AI agents take actions. The difference matters more than it initially appears.
A RAG pipeline over your clinical literature can answer: what does SMFM say about magnesium sulfate dosing in preterm labor before 32 weeks? An AI agent with the same knowledge base and access to a tool set can do more.
Three workflows buildable today by a physician who understands the components:
- Incoming consult triage: receive the consult note, retrieve the relevant evidence, compare the referring plan against the evidence, identify gaps, and produce a structured briefing before you walk in the room.
- Discharge summary extraction: pull the diagnosis, delivery indication, birth weight, and neonatal outcome from each summary automatically, and populate a structured table from unstructured text.
- Protocol delta monitoring: compare a new protocol version to the previous one, identify which clinical decision points changed, and flag the delta for physician review.
These are not hypothetical. These are workflows a physician who understands the components can build today.
Docling is the layer that makes them operate on real clinical documents instead of clean synthetic examples. Without a working document pipeline, your agents are reasoning over noise.
Where to Start: Five Minutes to Proof of Concept
The barrier to entry is lower than you expect.
Install Docling:
pip install docling
Convert a single PDF from the command line:
docling your-clinical-guideline.pdf
Convert a folder of PDFs to Markdown and JSON:
docling ./clinical-pdfs --from pdf --to md --to json --output ./parsed
Open the output. Look at what Docling did with the tables, the headings, the reading order. Compare it to copying and pasting from the same PDF. The difference is visible in under five minutes.
From there, the progression is straightforward:
- CLI on one document: verify the output quality before writing any code.
- Python API on a folder: build the ingest pipeline for your corpus.
- Add an embedding step (ChromaDB or Qdrant locally): first working RAG prototype.
- Add provenance tracking to the retrieval output: production-grade citations.
Each step is independent and reversible. You are not committing to a framework or a vendor at any point. The components swap without rebuilding the pipeline.
The Physician-Builder Position
The clinical AI conversation is dominated by two groups.
One group has institutional EHR pipeline access: large organizations with data engineering teams, existing FHIR integrations, and years of negotiated data use agreements. Most individual physicians cannot touch that infrastructure.
The other group is building on synthetic demos. Clean datasets, tidy tables, documents that never arrived as faxed PDFs scanned at 150 dpi. Those demos work in controlled conditions. They do not reflect the documentary reality of clinical practice and they do not build into anything a physician would actually trust with a patient.
Physician-builders who understand Docling, RAG, and agents occupy a different position. They can build on their own clinical documents. Their own literature library. Their own protocol stack. Their own institutional guidelines, de-identified and processed locally. They are not waiting for IT. They are not waiting for a vendor.
The curation layer is physician work. Deciding which sources to include, how to handle conflicting guidelines, what an edge case actually means clinically at 28 weeks versus 34 weeks: an engineer cannot do that from first principles. The physician-builder who understands both sides is not a hobbyist. They are the person every clinical AI team should want in the room.
The model is not the product. The harness is the product.
Docling is part of the harness.
Chukwuma Onyeije, MD, FACOG is a Maternal-Fetal Medicine specialist and Medical Director at Atlanta Perinatal Associates. He writes at DoctorsWhoCode.blog about building clinical software at the intersection of medicine, engineering, and AI.
Related Posts