The Local Advantage: Why Physician-Developers Should Build on Local LLMs Instead of Consumer AI
Consumer AI tools like ChatGPT and Claude are useful. But a physician-developer who deploys a locally fine-tuned model on controlled infrastructure has something more powerful: a clinical tool that learns your practice, respects your data, and costs less at scale.
Listen to this post
The Local Advantage: Why Physician-Developers Should Build on Local LLMs Instead of Consumer AI
I use Claude every day. I am not here to tell you consumer AI is bad.
But if you are a physician-developer building clinical tools, relying on consumer-grade AI as your production backbone is a strategic mistake. Not because the models are weak. Because the architecture is wrong.
There is a better approach. It starts with running the model yourself.
The Consumer AI Trap
ChatGPT, Claude, Gemini — these are extraordinary general-purpose reasoning engines. They are also designed for general-purpose users. That distinction matters more than most physician-developers realize until they hit the wall.
The wall looks like this:
- A patient note you cannot send to a third-party API without a Business Associate Agreement you may or may not have
- A fine-tuning request that requires sending your proprietary clinical data to a vendor’s servers
- A context window that resets every session, carrying no memory of your practice’s workflows
- A billing model that scales against you as your usage grows
- A system prompt someone else controls
You are building on a foundation you do not own. That is fine for prototyping. It is a liability for production.
What a Local LLM Actually Gives You
When I say local LLM, I mean a model you run on hardware you control — your own server, a clinic workstation, a Railway-deployed container, a local Mac Studio. The model weights live with you. The inference runs on your infrastructure.
The leading open models right now — Llama 3, Mistral, Qwen, Phi-4 — are close enough to GPT-4-class performance on clinical tasks that the gap is smaller than the governance gap you close by running them locally.
Here is what that unlocks:
1. Real HIPAA-compliant inference
When the model runs locally, PHI never leaves your network. No BAA negotiations. No vendor data-retention policies to audit. No compliance ambiguity. You are the data processor and the data controller. That is the cleanest legal posture available.
2. Fine-tuning on your clinical vocabulary
Consumer APIs offer fine-tuning, but you are uploading your data to their servers and getting a model you still do not own. With a local model, you can run LoRA fine-tuning on your actual clinical corpus — your APSO notes, your consultation language, your ICD-10 coding patterns — and the resulting weights belong to you. The model learns your practice.
I did this with a small MFM consultation dataset. The difference in output quality on domain-specific tasks was immediate. The model stopped hallucinating obscure periviability thresholds. It started writing in my voice.
3. Persistent system prompts you control
Consumer models let you set a system prompt per session. A locally deployed model served through Ollama, LM Studio, or a FastAPI wrapper lets you bake the system prompt into the serving layer. Every call to your clinical tool hits your prompt, your guardrails, your workflow logic. No one can update the model underneath you overnight.
4. Cost structure that favors you at scale
OpenAI and Anthropic charge per token. That is fine at low volume. At clinical scale — hundreds of notes, daily summaries, real-time coding suggestions — it becomes a significant line item. A local model running on a $500 GPU amortizes quickly. At 10,000 tokens per day, the math favors local within months.
5. Offline and latency-resilient operation
A clinic does not always have perfect internet. A local model works on a LAN. Inference latency is deterministic because you control the hardware. For real-time clinical decision support, that matters.
Getting Started: Ollama in Five Minutes
The fastest path to a running local LLM is Ollama. It handles model downloads, serving, and an OpenAI-compatible API endpoint — all from a single binary.
Install on macOS or Linux:
curl -fsSL https://ollama.com/install.sh | sh
Install on Windows — download the installer from ollama.com/download, run the .exe, then from PowerShell or WSL:
ollama pull llama3.1
ollama run llama3.1
You now have a chat interface in your terminal and a REST API running at http://localhost:11434.
Pull a clinically capable model:
For most clinical NLP tasks, start with llama3.1:8b on a machine with 16GB RAM, or llama3.1:70b with Q4 quantization on a machine with 48GB+ RAM.
# Lightweight, fast — good for note formatting, coding suggestions
ollama pull llama3.1:8b
# Higher quality, slower — better for clinical reasoning tasks
ollama pull llama3.1:70b-instruct-q4_K_M
Query the API from Python:
Ollama exposes an OpenAI-compatible endpoint. You can call it directly with requests:
import requests
def query_local_llm(prompt: str, system: str = "") -> str:
"""
Send a prompt to a locally running Ollama instance.
The model runs entirely on local hardware — no PHI leaves your network.
"""
payload = {
"model": "llama3.1:8b",
"messages": [
{"role": "system", "content": system},
{"role": "user", "content": prompt}
],
"stream": False
}
response = requests.post(
"http://localhost:11434/api/chat",
json=payload
)
response.raise_for_status()
return response.json()["message"]["content"]
# Example: format a raw dictation into structured APSO sections
system_prompt = """
You are a clinical documentation assistant for a Maternal-Fetal Medicine practice.
Format dictations into APSO structure: Assessment, Plan, Subjective, Objective.
Use concise, professional clinical language. Do not fabricate clinical details.
"""
raw_dictation = """
Patient is a 28-year-old G2P1 at 28 weeks with gestational diabetes on diet control.
Fasting glucoses running 85-95. Post-prandials under 120. Growth scan today shows
estimated fetal weight at 50th percentile. Biophysical profile 8 out of 8.
Plan: continue diet management, repeat growth scan in 4 weeks, NST weekly starting 32 weeks.
"""
result = query_local_llm(raw_dictation, system_prompt)
print(result)
Open WebUI: A ChatGPT Interface for Your Local Model
If you have clinical colleagues who need a familiar interface without the command line, Open WebUI drops a polished chat UI on top of your Ollama instance.
Deploy with Docker:
docker run -d \
-p 3000:8080 \
--add-host=host.docker.internal:host-gateway \
-v open-webui:/app/backend/data \
--name open-webui \
--restart always \
ghcr.io/open-webui/open-webui:main
Open http://localhost:3000 in a browser. Connect it to Ollama at http://host.docker.internal:11434.
You now have a self-hosted ChatGPT running entirely on your infrastructure. No accounts. No API keys. No data leaving the building.
Building a FastAPI Wrapper for Clinical Workflows
Raw Ollama is a starting point. For production clinical tools, you want a wrapper that handles authentication, logging, prompt management, and output validation. Here is a minimal FastAPI service that wraps Ollama for a consultation note task:
# main.py — Local LLM API wrapper for clinical documentation
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import requests
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI(title="Clinical LLM API", version="0.1.0")
OLLAMA_BASE_URL = "http://localhost:11434"
DEFAULT_MODEL = "llama3.1:8b"
# System prompt baked into the serving layer — not session-dependent
MFM_SYSTEM_PROMPT = """
You are a clinical documentation assistant for Atlanta Perinatal Associates,
a Maternal-Fetal Medicine practice. Your role is to convert clinical dictations
and structured data into professional APSO consultation notes.
Rules:
- Use standard obstetric and MFM terminology
- Never fabricate clinical findings or test results not mentioned in the input
- Flag any ambiguous or missing clinical information with [CLARIFY: ...]
- Output only the note — no preamble, no explanation
"""
class ConsultRequest(BaseModel):
dictation: str
patient_context: str = ""
class ConsultResponse(BaseModel):
note: str
model_used: str
flagged_items: list[str]
@app.post("/generate-note", response_model=ConsultResponse)
async def generate_consultation_note(request: ConsultRequest):
"""
Generate an APSO consultation note from a dictation.
All inference runs locally — no PHI transmitted externally.
"""
user_content = request.dictation
if request.patient_context:
user_content = f"Patient context: {request.patient_context}\n\nDictation: {request.dictation}"
payload = {
"model": DEFAULT_MODEL,
"messages": [
{"role": "system", "content": MFM_SYSTEM_PROMPT},
{"role": "user", "content": user_content}
],
"stream": False
}
try:
resp = requests.post(f"{OLLAMA_BASE_URL}/api/chat", json=payload, timeout=120)
resp.raise_for_status()
except requests.RequestException as e:
logger.error(f"Ollama inference failed: {e}")
raise HTTPException(status_code=503, detail="Local LLM inference unavailable")
note_text = resp.json()["message"]["content"]
# Extract any flagged items the model surfaced
flagged = [
line.strip()
for line in note_text.splitlines()
if line.strip().startswith("[CLARIFY:")
]
return ConsultResponse(
note=note_text,
model_used=DEFAULT_MODEL,
flagged_items=flagged
)
@app.get("/health")
async def health_check():
"""Verify local Ollama is reachable."""
try:
resp = requests.get(f"{OLLAMA_BASE_URL}/api/tags", timeout=5)
models = [m["name"] for m in resp.json().get("models", [])]
return {"status": "ok", "available_models": models}
except requests.RequestException:
raise HTTPException(status_code=503, detail="Ollama not reachable")
Run it with:
pip install fastapi uvicorn requests pydantic
uvicorn main:app --host 0.0.0.0 --port 8000 --reload
Your clinical tool now calls http://localhost:8000/generate-note with a POST body. The system prompt is fixed at the serving layer. No session management. No API key rotation. No vendor terms to review.
LoRA Fine-Tuning on Clinical Text
This is where the real differentiation happens. LoRA (Low-Rank Adaptation) lets you fine-tune a model on your specific vocabulary and documentation style without retraining the full model.
Unsloth makes this accessible. You can run a LoRA fine-tuning job on a single A100 GPU instance in under two hours for a small clinical dataset.
Setup:
pip install "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
pip install --no-deps trl peft accelerate bitsandbytes
Fine-tuning script for a clinical note corpus:
# fine_tune_clinical.py
# Fine-tune Llama 3.1 8B on anonymized MFM consultation notes using LoRA.
# Run on: single A100 40GB, ~90 minutes, ~$8 on Lambda Labs or RunPod.
# IMPORTANT: All training data must be de-identified before use.
from unsloth import FastLanguageModel
from datasets import Dataset
from trl import SFTTrainer
from transformers import TrainingArguments
import json
MAX_SEQ_LENGTH = 2048
LORA_RANK = 16
# Load base model with 4-bit quantization
model, tokenizer = FastLanguageModel.from_pretrained(
model_name="unsloth/Meta-Llama-3.1-8B-Instruct",
max_seq_length=MAX_SEQ_LENGTH,
load_in_4bit=True,
)
# Apply LoRA adapters — only these layers are trained
model = FastLanguageModel.get_peft_model(
model,
r=LORA_RANK,
target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj"],
lora_alpha=16,
lora_dropout=0,
bias="none",
use_gradient_checkpointing="unsloth",
)
# Load your de-identified clinical corpus
# Each record: {"instruction": "...", "input": "...", "output": "..."}
with open("data/mfm_notes_deidentified.jsonl") as f:
records = [json.loads(line) for line in f]
def format_prompt(record):
"""Format training records into the chat template."""
return {
"text": tokenizer.apply_chat_template(
[
{"role": "system", "content": "You are a clinical documentation assistant for a Maternal-Fetal Medicine practice."},
{"role": "user", "content": record["instruction"] + "\n\n" + record["input"]},
{"role": "assistant", "content": record["output"]}
],
tokenize=False,
add_generation_prompt=False
)
}
dataset = Dataset.from_list([format_prompt(r) for r in records])
trainer = SFTTrainer(
model=model,
tokenizer=tokenizer,
train_dataset=dataset,
dataset_text_field="text",
max_seq_length=MAX_SEQ_LENGTH,
args=TrainingArguments(
per_device_train_batch_size=2,
gradient_accumulation_steps=4,
warmup_steps=10,
num_train_epochs=3,
learning_rate=2e-4,
fp16=True,
logging_steps=10,
output_dir="outputs/mfm-lora",
save_strategy="epoch",
),
)
trainer.train()
# Save LoRA adapters (small — a few hundred MB)
model.save_pretrained("models/mfm-lora-adapters")
tokenizer.save_pretrained("models/mfm-lora-adapters")
# Export merged model for Ollama serving
model.save_pretrained_merged("models/mfm-merged", tokenizer, save_method="merged_16bit")
After training, convert the merged model to GGUF format for Ollama:
# Convert to GGUF using llama.cpp tools
python llama.cpp/convert_hf_to_gguf.py models/mfm-merged \
--outfile models/mfm-llama3-q4.gguf \
--outtype q4_k_m
# Create a Modelfile for Ollama
cat > Modelfile <<'EOF'
FROM ./models/mfm-llama3-q4.gguf
SYSTEM """
You are a clinical documentation assistant for Atlanta Perinatal Associates.
You write APSO-format MFM consultation notes. Never fabricate clinical data.
"""
PARAMETER temperature 0.2
PARAMETER top_p 0.9
EOF
ollama create mfm-clinical -f Modelfile
ollama run mfm-clinical
You now have a model that knows your vocabulary, outputs in your format, and runs entirely on your hardware. The training data never left your environment.
The Objection: Local Models Are Harder
True. I will not argue otherwise.
You need to manage model serving. You need to handle quantization decisions (Q4 versus Q8 versus full precision). You need to build the API wrapper. You need to think about memory and compute requirements. Fine-tuning requires a training pipeline.
This is exactly why physician-developers have an edge. We already understand the clinical domain deeply enough to supervise the model’s behavior. We can tell when the output is wrong in ways a general-purpose developer cannot. That domain knowledge, combined with the engineering skills to deploy and customize a local model, is a rare combination. It is also a moat.
The stack is accessible. The barrier is willingness to operate it, not technical impossibility.
What Consumer AI Is Still Good For
I want to be precise here. I am not arguing for abandoning Claude or GPT-4.
Use consumer AI for:
- Rapid prototyping before you know what you are building
- Tasks involving publicly available data with no PHI
- High-level reasoning tasks where general capability matters more than domain specificity
- Workflows where you need the absolute frontier of capability and latency is not a constraint
- Your own personal productivity — this is where Claude excels
I use Claude as a thought partner and development accelerator. It helps me write code faster, structure arguments, and pressure-test clinical reasoning. That workflow involves no patient data and no production dependency on the API.
The distinction is: use consumer AI for your thinking, build clinical tools on infrastructure you control.
A Concrete Architecture
Here is what this looks like in practice for a physician-developer building an MFM consultation tool:
Patient transcript (local capture)
|
Preprocessing pipeline (Python, runs locally)
|
Local LLM inference via Ollama API
- Model: Llama 3.1 70B, Q4 quantized
- System prompt: baked into serving layer
- Fine-tuned on anonymized MFM corpus
|
Structured output: APSO note + ICD-10 + CPT
|
DOCX generation (python-docx, local)
|
EHR paste or direct upload via HL7 FHIR
No PHI leaves the clinic network at any step. The model knows MFM vocabulary. The output matches my documentation style. The whole stack runs on a single machine.
This is not science fiction. I have built versions of each component of this pipeline. The local LLM piece is the one that took the longest to commit to. Once I did, the reliability and control justified the setup cost immediately.
Hardware Guidance
You do not need a data center.
For a solo practice or a pilot project, a Mac Studio M2 Ultra (192GB unified memory) will run llama3.1:70b at a usable inference speed. If you want dedicated GPU hardware, an NVIDIA RTX 4090 with 24GB VRAM handles Q4-quantized 70B models with acceptable latency for documentation tasks.
For cloud GPU access during fine-tuning without the capital expense, RunPod and Lambda Labs offer A100 instances by the hour. A fine-tuning run for a small clinical dataset costs under $10.
The local workstation handles daily inference. Cloud GPU time handles periodic retraining. That split keeps costs low and keeps PHI local.
The Broader Point
The physician-developer community is at an inflection point. We are moving from building tools that wrap consumer APIs to building tools that are genuinely infrastructure-independent clinical software.
That shift requires us to think about model ownership the way we think about data ownership. Your clinical data belongs to your practice. Your clinical AI should too.
Consumer AI will continue to improve. The frontier models will keep getting better. But the governance gap between “model I run” and “model someone else runs” is structural. It does not close just because the model gets smarter.
Build smart tools. Run them yourself.
Chukwuma Onyeije, MD, FACOG is a Maternal-Fetal Medicine specialist and Medical Director at Atlanta Perinatal Associates. He is the founder of CodeCraftMD and OpenMFM.org. He writes at DoctorsWhoCode.blog about building clinical software at the intersection of medicine, engineering, and AI.
Related Posts