Clinical + Code 8 min read

AI Just Outperformed Physicians at Clinical Writing. This Should Not Surprise You.

HealthBench Professional shows AI has already crossed the threshold in clinical writing and documentation. The real lesson is not replacement. It is that physician-developers need to build the harness.

Listen to this post

AI Just Outperformed Physicians at Clinical Writing. This Should Not Surprise You.

0:00
Physician-developer at a night workstation reviewing a clinical note workflow, code, and model evaluation dashboards

On April 22, 2026, OpenAI released ChatGPT for Clinicians and published HealthBench Professional, a benchmark built around real clinician chat tasks.

The result that mattered most was not subtle.

GPT-5.4 inside ChatGPT for Clinicians outperformed human physicians across care consult, writing and documentation, and medical research.

That should not surprise anyone who has been paying attention.

I have been arguing on DoctorsWhoCode.blog that the threshold was never the real question. The real question was when evaluation would catch up to what these systems were already becoming good at in practice.

HealthBench Professional looks a lot like that moment.

This Is Not Benchmark Theater

The methodology matters here.

HealthBench Professional is not built on board-style multiple choice questions, synthetic vignettes, or narrow knowledge recall. It evaluates clinician-facing work that already lives inside real workflows: care consult, writing and documentation, and medical research.

According to the paper, 190 physicians from 50 countries and 26 specialties contributed to a candidate pool of 15,079 examples. The final benchmark contains 525 physician-authored tasks, selected for quality, representativeness, and difficulty. Every conversation and every rubric went through three stages of physician authorship, review, and adjudication. Difficult cases were deliberately overrepresented by about 3.5x, and about one-third of the benchmark comes from red-teaming.

The human baseline was not weak.

Specialty-matched physicians wrote responses with unbounded time and web access. That is a serious comparison target.

GPT-5.4 in ChatGPT for Clinicians still cleared it.

Overall score: 59.0. Human physicians: 43.7. The reported p-value is 3.7 x 10^-10.

That gap is not noise. It is signal.

The Row That Matters

If you build or evaluate tools in this space, this is the table to study:

Use CaseChatGPT for CliniciansHuman PhysiciansBase GPT-5.4p-value vs physicians
Care consult51.042.7~51.00.025
Writing and documentation64.132.134.61.2 x 10^-8
Medical research67.056.358.10.0015

Read the writing and documentation row again.

64.1 versus 32.1.

Nearly double the physician baseline on note generation, summarization, coding, patient messaging, and structured documentation. The paper also reports that its primary score is adjusted for response length, specifically to reduce the chance that verbosity inflates performance.

This is not a reason for clinicians to feel replaced.

It is a reason for physician-builders to stop pretending this category is still hypothetical.

The Framing Is Wrong

People will hear a result like this and ask whether AI is replacing doctors.

That is the wrong question.

Clinical writing has always contained at least two very different kinds of work: meaning and assembly.

Meaning is interpretation, prioritization, clinical judgment, and accountability. Assembly is reconstruction, formatting, summarization, coding support, and translation of the encounter into the structures the system requires.

Those two kinds of work have been fused together for years because our software is bad.

That does not mean they are inseparable.

I made this argument recently in Clinical Documentation Automation Should Remove Friction, Not Replace the Doctor. The goal is not to erase the physician from the note. The goal is to stop wasting physician judgment on tasks that mostly consume time.

HealthBench Professional does not prove that doctors are obsolete.

It proves that documentation friction was always more automatable than many institutions wanted to admit.

Why This Result Feels Familiar

I built the signout-to-APSO pipeline for exactly this reason.

Not because I could not write the note myself. Because the cognitive load at the end of a complex inpatient day is real. Documentation debt accumulates. It pulls attention away from the patient in front of you and drags judgment into clerical cleanup after the meaningful thinking is already done.

When a system can reliably compress signout, structure the active problems, and draft a clean clinical scaffold, that is not an attack on physician identity.

It is restored bandwidth.

That is why this benchmark feels less like a surprise than a formal confirmation.

The Most Important Technical Insight In The Paper

There is another result in the paper that matters just as much as the headline score.

Base GPT-5.4 scored 34.6 on writing and documentation. GPT-5.4 inside ChatGPT for Clinicians scored 64.1.

Same underlying model family. Very different outcome.

That is the builder’s lesson.

The product layer matters.

The harness matters.

The retrieval path, the prompt structure, the workflow design, the use of tools, the domain constraints, the citation behavior, the interaction model, and the surrounding interface architecture matter as much as the raw model. I made a version of this argument in Doctors Who Code: Build Systems, Not Just Models. HealthBench Professional gives that argument numbers.

Physician-developers should read this paper as validation of a design principle:

The model is not the product.

The clinical system around the model is the product.

For Clinicians Who Do Not Code

You do not need to write TypeScript to use this result well.

You do need to ask better questions.

If your institution is evaluating an AI vendor, do not settle for aggregate claims. Ask how the system performs on the specific categories your department actually cares about. Ask whether the benchmark used real clinician tasks or synthetic stand-ins. Ask how the rubrics were authored and adjudicated. Ask whether the product layer is disclosed, because this paper makes clear that the harness can create a 30-point swing.

That is not implementation trivia.

That is the system.

For Physicians Who Code

Build something.

The opportunity is not gone because a frontier model got better. The opportunity is clearer because the core substrate now works well enough to be worth shaping.

The specialty gaps are still real. The paper notes that several specialty comparisons, including OB/GYN, were statistically mixed. I noticed that immediately.

That is where physician-developers still matter most.

We know where the context is thin, where the stakes are asymmetric, where the workflow breaks, and where general-purpose tooling still does not fit the lived structure of care. The next wave of useful clinical software will not come from generic capability alone. It will come from people who understand both the medicine and the system design.

That remains the DoctorsWhoCode thesis.

Physicians who build will shape the harness.

Everyone else will inherit it.

Read the Sources

The threshold was never the story.

The story is who builds what comes next.

Share X / Twitter Bluesky LinkedIn

Related Posts

Physician-developer reviewing a clinical workflow while pre-visit summary, ambient capture, draft note, verification, coding support, and handoff steps appear around him
Clinical + Code Featured

Clinical Documentation Automation Should Remove Friction, Not Replace the Doctor

Documentation automation is not a typing solution. It is a workflow design problem. For physicians and physician-developers, the real goal is protecting clinical judgment from administrative drag.

· 8 min read
clinical-aidoctors-who-codemedical-software
Physician at a workstation reviewing medical images, code, and system monitoring dashboards side by side
Clinical + Code Featured

Doctors Who Code: Build Systems, Not Just Models

A TEDx pitch says physicians should build AI. I agree. But the work that matters is governance, validation, and delivery, not one-afternoon demos.

· 8 min read
aidoctors-who-codemedical-ai
Pregnant woman sitting alone on a dark couch late at night, her face lit by the glow of her phone as she types a medical question into a chatbot
AI in Medicine Featured

Your Patients Are Already Using ChatGPT to Decide Whether to Call You

They are not asking it for fun. They are asking it because the triage line puts them on hold for 45 minutes. The threat is not the technology. The threat is the system that made the technology necessary.

· 10 min read
patient-safetyclinical-aiphysician-developer
Chukwuma Onyeije, MD, FACOG

Chukwuma Onyeije, MD, FACOG

Maternal-Fetal Medicine Specialist

MFM specialist at Atlanta Perinatal Associates. Founder of CodeCraftMD and OpenMFM.org. I write about building physician-owned AI tools, clinical software, and the case for doctors who code.