How Lyrebird measures what matters - with the same rigour we expect from clinical practice itself.
At Lyrebird Health, evaluation and quality monitoring are foundational to how we build and improve our AI systems.
As a clinician-first organisation, we believe that trustworthy AI in healthcare demands the same rigour we expect from clinical practice: evidence-based, continuously reviewed, and transparent about limitations.
Our approach is informed by real-world application at scale. Lyrebird is the most widely-used medical AI scribe for Australian GPs, supporting millions of consultations every year. That scale and depth gives us both the data to identify issues early and the clinical relationships to resolve them meaningfully.
We believe that human judgement remains the gold standard for assessing the quality of complex, free-text clinical notes and documentation.
Our evaluation framework reflects this by inviting clinicians into every stage, from early model development through to post-deployment monitoring. We pair this with automated methods to guide development more efficiently, ensuring human reviewer time is always spent where it matters most.
A clinical note is more than a simple record.
It informs downstream care decisions, supports billing and compliance, and reflects the clinical reasoning of the treating clinician. A clinical note that is inaccurate, incomplete, or poorly-structured can create genuine clinical risk.
This is why we treat evaluation as a core function, not a periodic check. The goal is not just to catch errors, but to build a precise and shared understanding of what a good clinical note actually looks like, and to measure systematically how close our system gets.
Lyrebird's documentation engine is composed of two core components working in sequence.
Each component is evaluated both individually and as part of the end-to-end pipeline. An issue at either stage can affect the quality of the final note, so both must be held to a high standard.
Our framework is built around a simple but important design choice: binary and countable metrics over subjective numeric scales.
Our framework evaluates clinical notes across five dimensions.
For hallucinations specifically, we use a support-level classification system that grades the severity of the error based on how much of the generated content reflects what was actually said or could reasonably be inferred from the transcript.
This distinction matters because not all hallucinations carry the same risk - reasonable clinical inference differs meaningfully from complete fabrication - and our framework is designed to reflect that gradation.
Could reasonably lead to patient harm if the note were used without correction - for example, a wrong medication in a patient with a documented allergy, a missing red flag symptom, or a dosage error of 10x or more.
An error with a plausible but lower-probability path to patient harm. Clinically meaningful, but unlikely to cause serious harm in most scenarios if left uncorrected.
Little to no impact on clinical care - for example, missing documentation of social chit-chat, minor formatting issues, or a slight variation in routine follow-up timing.
Automated metrics are efficient and essential - but they're not sufficient.
While our framework uses binary, countable error classifications rather than subjective numeric scales, applying those classifications to complex clinical notes requires human judgement - determining whether content represents reasonable clinical inference or unsupported fabrication, whether an omission is clinically relevant, or whether documented reasoning appropriately reflects the encounter.
Human review validates whether quantitative improvements translate to meaningful gains in clinical utility, and flags issues that metrics alone might miss.
In our framework, human judgement operates at two levels.
Automated evaluation establishes a consistent baseline we track over time, detecting performance changes as models evolve - while human judgement validates that quantitative improvements translate to meaningful clinical utility.
Clinical note quality cannot be reduced to a single metric, which is why our framework evaluates across five distinct dimensions using binary, countable classifications that provide actionable insight rather than subjective scores.
Setting standards for clinical documentation quality is not something any single company should do alone. Lyrebird is committed to participating in the broader conversation through our research, our clinical partnerships, and our engagement with the wider community as we work toward evaluation frameworks that allow for meaningful progress across the field.
This article was reviewed by the clinical and research leadership team at Lyrebird Health, who are committed to objective interpretation of research findings and transparent discussion of both benefits and limitations. We'll continue publishing what we learn — and we welcome feedback from clinicians, researchers, and healthcare leaders.