The AI check-up: we pitted BMJ Best Practice against Gen-AI. The results weren't even close.
Picture this: It’s 2 AM, you’re covering the ICU, and a complex patient lands on your doorstep. A 68-year-old patient with newly diagnosed atrial fibrillation, diabetes, and mild kidney disease needs anticoagulation decisions – but which agent, what dose, and how do you navigate the contraindications? In the old days, you’d call your most trusted colleague for a curbside consult. Today, you might be tempted to ask ChatGPT or Grok.
The question keeping many of us awake (beyond those 2 AM consults) is whether generative AI can truly serve as that reliable clinical advisor. Can these impressive language models deliver the nuanced, evidence-based guidance we need for high-stakes decisions? Or should we stick with the clinical decision support tools we’ve trusted for years?
To find out, we conducted an informal head-to-head evaluation that might surprise you.
The clinical scenarios: where the rubber meets the road
We selected seven clinical questions that represent the bread and butter of modern practice—the scenarios where we most need reliable guidance. These ranged from diagnostic challenges like determining the right tests for a patient with suspected multiple sclerosis, to complex management decisions like treating a COPD exacerbation in a patient with diabetes, to navigating anticoagulation in a patient with permanent atrial fibrillation.
Each question was designed to test not just basic medical knowledge, but the kind of contextual, evidence-based reasoning that separates good clinical advice from dangerous oversimplification. Think about managing a patient with suspected celiac disease—it’s not enough to know the tests; you need to understand the sequence, the interpretation, and the clinical context that guides decision-making.
The experiment: David vs. Goliath (times two)
We posed identical clinical questions to three different systems: BMJ Best Practice, our established evidence-based champion, against two of the most sophisticated generative AI models available—Grok 3.0 and ChatGPT 4.0.
The evaluation criteria were straightforward but rigorous: clinical validity, evidence-based approach, comprehensiveness, contextual relevance, and patient safety considerations. Essentially, we asked: “If we were residents seeking guidance, which response would actually help us provide better patient care?”
The results: a clear winner emerges
The results weren’t even close. BMJ Best Practice delivered consistently superior performance across every clinical scenario and evaluation dimension. But the real story lies in the details of how each system approached clinical reasoning.
Performance dimension | BMJ Best Practice | Grok 3.0 | ChatGPT 4.0 |
---|---|---|---|
Evidence-based approach | Superior – Extensive current citations | Moderate – Some evidence basis | Limited – Minimal evidence references |
Clinical comprehensiveness | Superior – Comprehensive coverage | Moderate – Adequate coverage | Limited – Basic coverage |
Contextual relevance | Superior – Detailed contextual guidance | Moderate – Some context provided | Limited – Minimal contextual adaptation |
Clinical organization | Superior – Clear structured hierarchy | Good – Well-organized | Moderate – Basic organisation |
Practical utility | Superior – Actionable clinical guidance | Moderate – Generally practical | Limited – Basic utility |
Patient safety considerations | Superior – Extensive safety guidance | Moderate – Some safety considerations | Limited – Basic safety awareness |
BMJ Best Practice consistently provided what clinicians actually need: current evidence citations, clear hierarchies of first-line versus second-line interventions, detailed considerations for complex patients with comorbidities, and explicit safety warnings. When discussing pulmonary embolism management, for instance, it provided nuanced guidance for patients with renal impairment, obesity, and cancer—the real-world complexity we face daily.
The LLMs, while impressive in their breadth of knowledge, fell short in critical ways. They provided plausible-sounding answers that often lacked the depth and evidence base necessary for clinical decision-making. ChatGPT 4.0 was particularly concerning, offering superficial responses that might satisfy a curious patient but would leave a clinician wanting. Grok 3.0 performed better but still couldn’t match the systematic, evidence-based approach of a purpose-built clinical tool.
Implications for modern practice: the difference between plausible and proven
Here’s what this means for those of us at the bedside: There’s a crucial difference between information that sounds right and guidance that is right. Generative AI excels at producing coherent, convincing text, but clinical decision-making requires something more—systematic evidence evaluation, expert clinical judgment, and the kind of rigorous validation processes that ensure patient safety.
BMJ Best Practice represents years of systematic literature review, expert editorial oversight, and continuous updating by clinical specialists. It’s not just a database of medical facts; it’s a curated clinical reasoning system designed specifically for healthcare decision-making. The LLMs, impressive as they are, operate fundamentally differently—they’re pattern-matching systems trained on diverse text corpora without the clinical validation and evidence hierarchy that healthcare decisions demand.
This doesn’t mean generative AI has no place in medicine. These tools show remarkable promise for administrative tasks, making patient education materials accessible, and clinical documentation. But when it comes to direct clinical decision support—the moments when patient outcomes hang in the balance—we need systems built specifically for healthcare, with the evidence base and validation processes that patient safety requires.
The lesson here isn’t to fear AI, but to be discerning consumers of it. As the healthcare landscape evolves, our obligation to patients remains constant: to base our decisions on the strongest available evidence, delivered through systems designed and validated for clinical use. That’s not just good medicine—it’s the foundation of professional integrity in an age of artificial intelligence.
In the end, there’s still no substitute for rigorous, evidence-based clinical decision support. The AI revolution in healthcare is real, but it’s not here yet—at least not for the decisions that matter most.
About the authors
Dr. Blackford Middleton is a leading expert in clinical informatics and healthcare technology. At BMJ Group, he serves as a consultant for Digital Knowledge Products.
Dr. Kieran Walsh is Clinical Director at BMJ Group. He is a general physician with extensive experience in medical education and evidence-based practice.
Competing interests
Dr Blackford Middleton acts as a consultant at BMJ Group. Dr Kieran Walsh works for BMJ Group.

Appendix
The clinical case questions:
- What tests should I order for a patient with heart failure with reduced ejection fraction?
- How should I treat a patient with acute exacerbation of COPD and diabetes?
- You suspect a patient has multiple sclerosis – what tests should you order?
- What tests should I order for an adult patient with hypertension?
- How should I manage a patient with permanent atrial fibrillation with no contraindications to long-term anticoagulation?
- What tests should I order for a patient with suspected celiac disease?
- What drug management should I advise for a patient with pulmonary embolism with an intermediate-low risk or low risk PESI/sPESI score with no contraindication to anticoagulation?