Physician Developer Tutorial

Stop Feeding Your RAG
Garbage PDFs

Chukwuma Onyeije, MD, FACOG Atlanta Perinatal Associates April 2026

Your RAG pipeline is only as good as your parsed input. Most physicians building clinical AI tools ignore this. They plug in PyPDF2, get mangled text from scanned guidelines, and wonder why their LLM gives wrong answers. Docling fixes this at the source.

Why PDF parsing matters for clinical AI

Most physician developers building RAG systems spend their time picking LLMs, writing retrieval logic, and tuning prompts. Almost none of them look hard at the first stage of the pipeline: document ingestion.

That is a mistake. Garbage in, garbage out applies with particular force in medicine, where a single misread table — a dosing chart, a gestational age curve, a lab reference range — can propagate downstream into a wrong clinical recommendation.

Clinical reality

ACOG and SMFM guidelines are published as multi-column PDFs with complex tables, footnotes, and embedded figures. A naive PDF extractor collapses those columns into garbled text. Your LLM then ingests that garbled text and confabulates.

The fix is not a better prompt. The fix is better parsing. Docling is the tool that solves this problem in the open-source world.

Raw PDF

→

Docling Parser

→

Structured chunks

→

Vector DB

→

LLM + RAG

Swap out the second box in that pipeline and everything downstream improves. That is the entire argument for Docling.

What Docling actually does

Docling is an open-source Python library developed at IBM. It uses specialized AI models to understand documents, not just extract text from them.

Two core models power it:

🗺️

DocLayNet

AI model for page layout analysis. Identifies headers, body text, figures, tables, footnotes, and reading order across complex multi-column layouts.

📊

TableFormer

AI model for table structure recognition. Extracts table rows, columns, and cell relationships from scanned and digital PDFs.

🔍

OCR engine

Handles scanned documents and images inside PDFs. Supports both classical OCR and vision-language model (VLM) backends.

📄

DoclingDocument

A unified document representation format. Exports to Markdown, HTML, or JSON. LangChain and LlamaIndex integrations are built on top of it.

What Docling can parse

PDF, DOCX, PPTX, XLSX, HTML, images (PNG, TIFF, JPEG), WAV, MP3, LaTeX, and more. For physician developers, this means one ingestion library handles clinical PDFs, Word letters, and scanned referral documents in the same pipeline.

It runs locally. No API calls, no cloud service, no data leaving your machine. That matters for clinical workflows where patient data is involved.

Docling vs. PyPDF2 and friends

Here is a direct comparison of the common PDF parsing tools physician developers reach for and how they perform on clinically relevant document types.

Library	Multi-column layouts	Table extraction	Scanned PDFs	Local / private	RAG-ready chunks
PyPDF2 / pypdf	Fails	No	No	Yes	No
pdfplumber	Partial	Heuristic	No	Yes	No
PyMuPDF	Partial	Basic	No	Yes	No
Unstructured.io	OK	OK	Paid tier	Partial	OK
Docling	AI-powered	AI-powered	Yes	Yes	Native chunks

The benchmark studies confirm this. Independent evaluations in 2025 consistently ranked Docling as the top open-source performer on complex document structures. The only tool that consistently outperforms it is LlamaParse, which is a commercial API. Docling is free and runs locally.

Install and first parse

Requires Python 3.10 or higher. Install into a virtual environment.

bash terminal

# Create and activate a virtual environment
python3 -m venv venv
source venv/bin/activate

# Install Docling
pip install docling

# For GPU acceleration (recommended if available)
pip install docling[gpu]

First parse — this is all it takes to get structured output from any PDF:

python basic_parse.py

from docling.document_converter import DocumentConverter

# Initialize the converter
converter = DocumentConverter()

# Parse a clinical PDF — ACOG bulletin, SMFM consult, scanned referral, etc.
result = converter.convert("acog_gdm_guidelines.pdf")

# Export to Markdown (preserves structure, tables, headings)
markdown_output = result.document.export_to_markdown()
print(markdown_output)

# Or export to JSON for programmatic access
json_output = result.document.export_to_dict()

The difference between this and pypdf.PdfReader is not cosmetic. Docling returns a structured document model. Tables are tables. Headings are headings. Reading order is correct even in multi-column layouts. That structure is what makes downstream RAG retrieval accurate.

Inspect what Docling extracted

python inspect_output.py

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("smfm_consult_periviability.pdf")
doc = result.document

# Iterate over all content elements
for element, level in doc.iterate_items():
    print(type(element).__name__, ":", str(element)[:80])

# Access tables directly
for table in doc.tables:
    print("\n--- TABLE ---")
    print(table.export_to_markdown())

Performance note

First run downloads the DocLayNet and TableFormer model weights (~1–2 GB). Subsequent runs are fast. On CPU-only hardware, plan for 30–60 seconds per complex PDF page. On GPU, it is significantly faster.

The full RAG pipeline with LangChain

The LangChain integration is the easiest path to a working RAG system. The DoclingLoader class handles the entire ingestion step.

bash terminal

pip install langchain-docling langchain-core langchain langchain-openai chromadb

Step 1: Load and chunk your clinical PDFs

Docling supports two export modes. DOC_CHUNKS is the right choice for RAG — it produces semantically coherent chunks that respect document structure.

python load_documents.py

from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType

# Your clinical document library
# Can be local file paths or URLs
CLINICAL_DOCS = [
    "acog_gdm_practice_bulletin.pdf",
    "smfm_periviability_consult.pdf",
    "acog_preeclampsia_task_force.pdf",
    "smfm_fgr_consult_series.pdf",
]

# DOC_CHUNKS: splits document into structured chunks
# Each chunk preserves section context (heading, page, table vs. paragraph)
loader = DoclingLoader(
    file_path=CLINICAL_DOCS,
    export_type=ExportType.DOC_CHUNKS
)

docs = loader.load()
print(f"Loaded {len(docs)} chunks from {len(CLINICAL_DOCS)} documents")
print("\nSample chunk metadata:")
print(docs[0].metadata)

Step 2: Embed and store in a vector database

python vectorstore.py

from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Using OpenAI embeddings — swap for a local model if needed
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

# Build the vector store from Docling chunks
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    collection_name="clinical_guidelines",
    persist_directory="./chroma_db"
)

print("Vector store built. Documents indexed.")

Step 3: Build the retrieval chain

python rag_chain.py

from langchain_openai import ChatOpenAI
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser

# Load the persisted vector store
vectorstore = Chroma(
    collection_name="clinical_guidelines",
    embedding_function=OpenAIEmbeddings(model="text-embedding-3-small"),
    persist_directory="./chroma_db"
)

retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Clinical-context prompt
PROMPT_TEMPLATE = """You are a Maternal-Fetal Medicine clinical decision support tool.
Answer the question using only the retrieved guideline excerpts below.
Cite the source document when possible.
If the answer is not in the context, say so — do not speculate.

Context:
{context}

Question: {question}

Answer:"""

prompt = PromptTemplate.from_template(PROMPT_TEMPLATE)
llm = ChatOpenAI(model="gpt-4o", temperature=0)

def format_docs(docs):
    return "\n\n".join(d.page_content for d in docs)

chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | prompt
    | llm
    | StrOutputParser()
)

# Ask a clinical question
question = "What are the diagnostic criteria for gestational diabetes using the two-step approach?"
answer = chain.invoke(question)
print(answer)

Why this works better

Docling's DOC_CHUNKS mode produces chunks that know their structural context — whether they came from a table, a heading, a figure caption, or body text. That metadata surfaces in retrieval, so your LLM gets chunks with semantic coherence, not arbitrary character-length slices.

LlamaIndex alternative

If you are already using LlamaIndex, the Docling integration is equally clean.

bash terminal

pip install llama-index-core llama-index-readers-docling llama-index-node-parser-docling

python llamaindex_rag.py

from llama_index.core import VectorStoreIndex
from llama_index.readers.docling import DoclingReader
from llama_index.node_parser.docling import DoclingNodeParser

# DoclingReader: loads and converts the PDFs
reader = DoclingReader()

# Load your documents
documents = reader.load_data(
    file_path=[
        "acog_gdm_practice_bulletin.pdf",
        "smfm_fgr_consult_series.pdf",
    ]
)

# DoclingNodeParser: produces structure-aware nodes for indexing
node_parser = DoclingNodeParser()
nodes = node_parser.get_nodes_from_documents(documents)

# Build the vector index
index = VectorStoreIndex(nodes)

# Query
query_engine = index.as_query_engine()
response = query_engine.query(
    "What fetal weight percentile defines fetal growth restriction?"
)
print(response)

The DoclingNodeParser is the key addition. Standard LlamaIndex text splitters cut at character counts. The Docling parser cuts at structural boundaries — end of a section, end of a table — so each node is semantically complete.

Clinical document use cases

Here are the specific workflows where Docling makes the biggest difference for physician developers building clinical tools.

ACOG/SMFM practice bulletins and consult series

These are published in two-column journal format with numbered references, footnoted tables, and embedded figures. Standard parsers collapse the columns. Docling preserves reading order and extracts tables as structured data. A RAG system built on top can accurately answer dosing, threshold, and management questions.

Scanned prior records and referral letters

Real clinical practice involves scanned documents — old operative notes, outside records, handwritten lab results. Docling's OCR engine handles these. You can build an ingestion pipeline that processes uploaded patient documents and makes them queryable.

python scanned_doc_pipeline.py

from docling.document_converter import DocumentConverter
from docling.datamodel.pipeline_options import PdfPipelineOptions

# Enable full OCR pipeline for scanned documents
pipeline_options = PdfPipelineOptions()
pipeline_options.do_ocr = True
pipeline_options.do_table_structure = True

converter = DocumentConverter(
    format_options={
        "pdf": {"pipeline_options": pipeline_options}
    }
)

result = converter.convert("scanned_outside_records.pdf")
markdown = result.document.export_to_markdown()

Lab result tables and growth charts

Docling's TableFormer model extracts table structure including multi-row headers and merged cells. A GDM monitoring dashboard, a growth chart percentile lookup, a lab threshold reference system — all of these depend on accurate table extraction. Docling delivers it.

Clinical trial papers and systematic reviews

Research-based clinical AI tools need to ingest academic PDFs with complex layouts, math notation, and citation structures. Docling handles all of these, including formula extraction.

HIPAA note

If you are ingesting documents that contain PHI, run Docling locally. It requires no external API calls. The local execution model is a feature, not a limitation.

Local execution for HIPAA environments

This is where Docling separates from commercial parsing APIs. The entire pipeline runs on your machine.

No data transmission

Documents never leave your server. Model weights download once at setup; inference is local.

Air-gapped deployment

Cache the model weights and run entirely offline. Valid for clinical environments with strict network policies.

Pair with a local LLM

Use Docling for parsing, Chroma for vector storage, and Ollama for inference. A completely local, zero-egress RAG pipeline.

BAA compliance path

Local deployment eliminates the BAA requirement for the parsing step. Combine with HIPAA-compliant cloud LLM APIs (OpenAI Healthcare, AWS Bedrock) for the generation step.

python local_rag_stack.py

# Fully local stack: Docling + Chroma + Ollama
# pip install docling chromadb langchain-community

from langchain_docling import DoclingLoader
from langchain_docling.loader import ExportType
from langchain_community.embeddings import OllamaEmbeddings
from langchain_community.vectorstores import Chroma
from langchain_community.llms import Ollama

# Parse with Docling — local, no API
loader = DoclingLoader(
    file_path=["patient_records/"],
    export_type=ExportType.DOC_CHUNKS
)
docs = loader.load()

# Embed with local Ollama model
embeddings = OllamaEmbeddings(model="nomic-embed-text")

# Store in local Chroma DB
vectorstore = Chroma.from_documents(
    documents=docs,
    embedding=embeddings,
    persist_directory="./local_chroma_db"
)

# Query with local LLM — zero data egress
llm = Ollama(model="llama3.2")
retriever = vectorstore.as_retriever()

What to do next

The path from here is straightforward.

Audit your current parser

Take a PDF you already use in production — an ACOG bulletin, a local protocol — and compare raw output between PyPDF and Docling. Look specifically at tables.

Build a small test corpus

Pick 5–10 clinical PDFs that represent your actual use case. Parse them with Docling. Inspect the Markdown output. That is your ground truth for downstream retrieval quality.

Implement DOC_CHUNKS retrieval

Drop DoclingLoader into your existing LangChain or LlamaIndex pipeline. Run the same test questions you use to evaluate your current system. Compare retrieval quality.

Build your local stack if you handle PHI

Docling + Chroma + Ollama is a complete, zero-egress RAG stack. Deploy it on Railway or a local server and you have a HIPAA-compatible clinical AI foundation.

Key resources

Docling documentation: docling-project.github.io/docling
LangChain integration: pip install langchain-docling
LlamaIndex integration: pip install llama-index-readers-docling
Source code: github.com/docling-project/docling

The physician developer who fixes their parsing layer fixes everything downstream. Better chunks mean better retrieval. Better retrieval means fewer hallucinations. Fewer hallucinations mean a tool you can actually trust in a clinical workflow.

That is the entire case for Docling. It starts with the PDF.

Stop Feeding Your RAGGarbage PDFs

Why PDF parsing matters for clinical AI

What Docling actually does

Docling vs. PyPDF2 and friends

Install and first parse

Inspect what Docling extracted

The full RAG pipeline with LangChain

Step 1: Load and chunk your clinical PDFs

Step 2: Embed and store in a vector database

Step 3: Build the retrieval chain

LlamaIndex alternative

Clinical document use cases

ACOG/SMFM practice bulletins and consult series

Scanned prior records and referral letters

Lab result tables and growth charts

Clinical trial papers and systematic reviews

Local execution for HIPAA environments

What to do next

Stop Feeding Your RAG
Garbage PDFs