BRIDGE Benchmark Exposes Clinical LLM Performance Gaps

There is a version of an AI demo that almost every clinician has seen by now: a frontier model walks through a medical vignette, nails the diagnosis, cites the guideline, and the audience is duly impressed. The demo is real. The vignette, however, is not. Real clinical text looks nothing like a multiple-choice question. It looks like an emergency department note typed at 2 a.m. by a resident who abbreviates everything, switches between shorthand and full sentences mid-paragraph, and occasionally records the date in three different formats within the same record. BRIDGE was built to test models on that second document, not the first. ## The Benchmark Problem Nobody Wanted to Talk About Most LLM evaluations in healthcare have leaned on two sources: medical licensing exam questions and PubMed abstracts. Both are clean, well-structured, and written to be read. According to the BRIDGE paper published in Nature Biomedical Engineering, this is exactly the problem , existing benchmarks "rely on medical examination-style questions or PubMed-derived text, failing to capture the complexity of real-world electronic health record data." The structural flaw goes deeper than data cleanliness. David Talby, writing about two clinical AI deployments he worked on directly, put it plainly: "GPT-4 passes the medical exam" became shorthand for "GPT-4 is ready for clinical text," and those two claims have almost nothing to do with each other. One is a closed-book multiple-choice test. The other is a live pipeline processing notes from a dozen specialties, in multiple languages, under time pressure. A broader systematic review of 39 clinical LLM benchmarks, published on PubMed Central, named this the "knowledge-practice performance gap" , the consistent finding that benchmark scores on medical knowledge questions do not reliably predict performance on clinical practice tasks. That review examined 39 separate benchmarks and reached the same conclusion each time: the leaderboard number and the deployment reality are measuring different things. BRIDGE was designed specifically to close that gap. ## What BRIDGE Actually Measures BRIDGE, developed with involvement from Harvard Medical School, Mass General Brigham, the Broad Institute, and YLab, is a multilingual benchmark comprising 87 tasks sourced from real-world EHR data, according to the BRIDGE leaderboard documentation on Hugging Face. The benchmark covers multiple languages, clinical specialties, and task types , everything from named entity recognition to clinical reasoning over patient timelines. Mass General Brigham's press release describes its intent as evaluating AI performance on "everyday patient care" text rather than idealized scenarios, which is a more honest framing than most benchmark launches manage. The scale of evaluation has grown since the original arXiv preprint. The Nature Biomedical Engineering publication evaluated 95 LLMs across those 87 tasks, and the live leaderboard on Hugging Face had reached 107 models evaluated as of its most recent update, according to the leaderboard documentation. That breadth matters: comparing 107 models across 87 tasks spanning real clinical text gives you a very different signal than comparing five models on 50 USMLE questions. ## Why EHR Text Is a Different Beast The reason standard benchmarks miss this gap is not mysterious , it is architectural. Clinical notes introduce abbreviation sets that vary by institution, inconsistent formatting, implicit temporal reasoning ("symptoms worsening since last Tuesday" requires knowing when Tuesday was relative to the note date), and cross-lingual complexity in health systems that serve multilingual populations. According to the BRIDGE paper in Nature Biomedical Engineering, the benchmark was specifically designed to capture performance differences across models, languages, tasks, and specialties , dimensions that exam-style benchmarks collapse into a single accuracy score. Talby's analysis of two specific deployment failures , one involving adverse-event extraction from opioid progress notes for an FDA Sentinel program, and another involving drug-causality reasoning over patient timelines , illustrates what the gap looks like in practice. In both cases, models that performed well on standard evaluations struggled on the actual clinical text pipeline. The benchmark score had predicted confidence; the deployment revealed the limits of that confidence. These are exactly the failure modes BRIDGE was designed to make visible before a system goes anywhere near a patient record. ## What This Means for Builders and Evaluators If you are building or evaluating any AI system that will touch clinical text, BRIDGE gives you a concrete alternative to the usual evaluation theater. The leaderboard is live and public on Hugging Face, which means you can compare how specific models perform across specific task types rather than relying on a single aggregate score. The multilingual scope is also worth noting: if your deployment environment includes non-English clinical text, a benchmark that only scores English USMLE questions is telling you almost nothing useful. The broader lesson here extends well past healthcare. Every domain has its version of this problem , the clean benchmark that measures a proxy for the real task rather than the real task itself. Clinical NLP just happens to be a domain where the cost of that mismatch is high enough that researchers finally built a benchmark rigorous enough to expose it. The Knowledge-Practice Performance Gap review on PubMed Central found this pattern across 39 separate evaluations; BRIDGE is the most comprehensive attempt yet to instrument the gap directly. For anyone serious about deploying AI in high-stakes settings, understanding how your model performs on BRIDGE-style evaluation is now table stakes, not a nice-to-have. The BRIDGE leaderboard will keep updating as new models are submitted, which means the comparison set only gets richer over time. Watch for how domain-specific fine-tuned models perform relative to frontier general-purpose models across the multilingual tasks specifically , that is where the most instructive performance differences are likely to emerge. A model that aces the exam and fumbles the chart note is not a clinical AI tool. It is a very expensive study partner. ## Sources - BRIDGE: Benchmarking Large Language Models for Understanding Real-world Clinical Practice Text