The PERFORM Study: What Happens When AI and OB-GYN Residents Go Head-to-Head?
4 min read

The PERFORM Study: What Happens When AI and OB-GYN Residents Go Head-to-Head?

Artificial intelligence is no longer just a buzzword in healthcare—it is becoming a real, measurable force in medical education and clinical decision-making. The PERFORM Study, recently published in Mayo Clinic Proceedings: Digital Health, takes this conversation to the next level by directly comparing AI large language models (LLMs) with human OB-GYN residents across standardized clinical scenarios.

The results? Both promising and provocative.


Why This Study Matters

We’ve all seen headlines about AI passing medical licensing exams with flying colors. But standardized tests are one thing—real-world decision-making under stress, language barriers, and variable training levels is another. That’s what the PERFORM Study set out to explore.

Researchers examined 8 different AI LLMs alongside 24 OB-GYN residents (years 1–5). They tested performance on 60 clinical cases, in both English and Italian, under timed and untimed conditions. The focus wasn’t just on “who scored higher,” but also on how error patterns differed and how AI might be integrated into medical training.


Key Findings

1. AI Outperforms on Accuracy and Stability

  • AI LLMs averaged 73.8% accuracy, compared to 65.4% for residents.
  • The top models (ChatGPT-01 preview, GPT-4o, Claude Sonnet 3.5) scored above 80%, with remarkable consistency across languages and time pressures.
  • Human residents, by contrast, saw their accuracy drop from 73.2% to 49.6% when placed under time constraints.

Takeaway: AI doesn’t sweat under the clock or stumble over language—two very human vulnerabilities.


2. Different Brains, Different Blind Spots

Error analysis showed a moderate correlation between AI and human mistakes (r = 0.666), but the errors themselves were different.

That means:

  • AI can catch what humans miss.
  • Humans can catch what AI misses.

This complementary pattern points to a future where AI is less competitor and more safety net.


3. Training-Level Impact: A Tale of Two Residents

The biggest winners from AI augmentation were early-career residents:

  • Year 1 residents improved by +29.7% when assisted by AI.
  • Year 2 residents saw +28.1% gains.

By Year 4–5, the benefits dwindled (and even reversed slightly), suggesting that AI support should be stage-specific. For juniors, AI scaffolds learning. For seniors, over-reliance could conflict with developing autonomy.


What This Means for Medical Education

This study underscores a crucial truth: AI is not one-size-fits-all.

  • For novice clinicians, AI could act like a senior resident at your side, double-checking and filling in gaps.
  • For experienced trainees, AI is best framed as a “second opinion” tool—helpful under stress, but not a crutch.
  • For attendings, AI may help validate intuition, provide consistency under fatigue, and even challenge unconscious biases.

As medical educators, the challenge will be weaving AI into curricula in a way that enhances growth without eroding independent reasoning.


The Risks We Can’t Ignore

The authors rightly caution that AI is not infallible:

  • Hallucinations: AI can generate wrong but convincing answers.
  • Bias: Training data may not represent diverse populations.
  • Explainability: Without explainable AI (XAI), it’s hard to know why the model recommended a choice.

For doctors who code (and teach), this is the next frontier: building transparent, bias-aware, explainable AI systems that clinicians can trust.


Reflections for Doctors Who Code

Reading the PERFORM Study, I was struck by how it crystallizes the role of AI in medicine—not as a replacement, but as a silent partner.

Imagine:

  • A first-year resident in the delivery room, nerves high, time short—AI could act as a stabilizer, quietly checking the logic of their decision-making.
  • A senior fellow, burned out at 2am, facing a complex case—AI could provide a steady, consistent “second set of eyes.”
  • A program director designing curricula—AI could be woven into simulation labs, helping identify where trainees struggle most.

For those of us exploring the intersection of AI, coding, and medicine, the PERFORM Study isn’t just a data point. It’s a call to build thoughtfully, to design tools that augment human care without erasing the very qualities—empathy, intuition, creativity—that make us physicians.


Final Word

The PERFORM Study shows us both the promise and perils of AI in OB-GYN. The path forward isn’t about choosing between AI and humans—it’s about designing systems where each strengthens the other.

The future of OB-GYN training (and medicine at large) may well rest in this balance: AI’s data-driven consistency + the human clinician’s contextual judgment.

What do you think? Should AI become part of residency training? Or should it stay a tool reserved for practicing physicians?


🔖 Citation: Martinelli C, Giordano A, Carnevale V, et al. The PERFORM Study: Artificial Intelligence Versus Human Residents in Cross-Sectional Obstetrics-Gynecology Scenarios Across Languages and Time Constraints. Mayo Clin Proc Digital Health. 2025;3(2):100206