Clinical Research · Quality & Monitoring

A Scientific Framework for Evaluating
AI-Generated Clinical Note Quality

How Lyrebird measures what matters - with the same rigour we expect from clinical practice itself.

OrganisationLyrebird Health
Reviewed byClinical & Research Leadership
FocusAI Scribe Quality Evaluation
01 /Introduction

Commitment and expertise

At Lyrebird Health, evaluation and quality monitoring are foundational to how we build and improve our AI systems.

As a clinician-first organisation, we believe that trustworthy AI in healthcare demands the same rigour we expect from clinical practice: evidence-based, continuously reviewed, and transparent about limitations.

Our approach is informed by real-world application at scale. Lyrebird is the most widely-used medical AI scribe for Australian GPs, supporting millions of consultations every year. That scale and depth gives us both the data to identify issues early and the clinical relationships to resolve them meaningfully.

We believe that human judgement remains the gold standard for assessing the quality of complex, free-text clinical notes and documentation.

Our evaluation framework reflects this by inviting clinicians into every stage, from early model development through to post-deployment monitoring. We pair this with automated methods to guide development more efficiently, ensuring human reviewer time is always spent where it matters most.

Lyrebird Health clinical team
02 /Quality & Monitoring

Why clinical note quality
evaluation matters

A clinical note is more than a simple record.

It informs downstream care decisions, supports billing and compliance, and reflects the clinical reasoning of the treating clinician. A clinical note that is inaccurate, incomplete, or poorly-structured can create genuine clinical risk.

This is why we treat evaluation as a core function, not a periodic check. The goal is not just to catch errors, but to build a precise and shared understanding of what a good clinical note actually looks like, and to measure systematically how close our system gets.

How the Lyrebird clinical documentation engine works

Lyrebird's documentation engine is composed of two core components working in sequence.

Component 01
Speech Recognition
A speech recognition system optimised for the language of clinical encounters, capturing the terminology, cadence, and complexity of real consultations.
Component 02
Note Generation
A note-generation system that takes the resulting transcript and produces a structured clinical note, shaped by the context and requirements of the encounter.

Each component is evaluated both individually and as part of the end-to-end pipeline. An issue at either stage can affect the quality of the final note, so both must be held to a high standard.

03 /Evaluation Design

The design principle behind
our evaluation framework

Our framework is built around a simple but important design choice: binary and countable metrics over subjective numeric scales.

The problem with numeric scales
Rate this note
1 to 5
Different clinicians may interpret subjective numeric scales (such as 1-5 or 0-10) in different ways. This variability makes it difficult to determine whether a difference in scores reflects a genuine quality issue, and provides limited insight into what specifically went wrong or how the system should be improved.
The Lyrebird approach
Count discrete,
observable errors
These questions have more clearly defined answers, which helps improve consistency across reviewers and makes it easier to link evaluation findings to specific system improvements.
Did this note contain a hallucination?
Was a clinically relevant problem omitted?
Is there an inaccuracy in the documented history?
04 /Framework

Five dimensions of evaluation

Our framework evaluates clinical notes across five dimensions.

01
HallucinationsContent in the note not present in or inferable from the transcript
Counted
02
Accuracy errorsContent discussed but captured incorrectly - graded by clinical safety impact
Graded
03
OmissionsClinically relevant content discussed but absent from the note
Counted
04
Irrelevant inclusionsContent present in the note that was not relevant to the clinical encounter
Counted
05
Formatting issuesStructural or presentation problems that reduce clinical utility
Counted
05 /Hallucination Classification

Not all hallucinations
carry the same risk

For hallucinations specifically, we use a support-level classification system that grades the severity of the error based on how much of the generated content reflects what was actually said or could reasonably be inferred from the transcript.

This distinction matters because not all hallucinations carry the same risk - reasonable clinical inference differs meaningfully from complete fabrication - and our framework is designed to reflect that gradation.

Support-Level Classification
Reasonable InferenceLevel 01
Content not explicitly stated, but defensible from clinical context. 90% or more of clinicians in this specialty would make the same inference - typically guideline-driven or standard-of-care reasoning.
Reasonable
Questionable InferenceLevel 02
A plausible clinical link exists between transcript and note, but clinical opinion would be split. The inference is speculative, not clearly supported by guidelines, or jumps to a conclusion without sufficient basis.
Questionable
UnsupportedLevel 03
No reasonable link between transcript and note content. The information was never discussed and cannot be inferred from any clinical context - complete fabrication.
Unsupported
ContradictionLevel 04
The note directly conflicts with what was said in the transcript - the opposite of what was documented, discussed, or decided during the encounter.
Contradiction
Safety-Critical Grading
Safety-Critical

Could reasonably lead to patient harm if the note were used without correction - for example, a wrong medication in a patient with a documented allergy, a missing red flag symptom, or a dosage error of 10x or more.

Moderate

An error with a plausible but lower-probability path to patient harm. Clinically meaningful, but unlikely to cause serious harm in most scenarios if left uncorrected.

Minimal

Little to no impact on clinical care - for example, missing documentation of social chit-chat, minor formatting issues, or a slight variation in routine follow-up timing.

06 /Evaluation Process

Where human judgement fits

Automated metrics are efficient and essential - but they're not sufficient.

While our framework uses binary, countable error classifications rather than subjective numeric scales, applying those classifications to complex clinical notes requires human judgement - determining whether content represents reasonable clinical inference or unsupported fabrication, whether an omission is clinically relevant, or whether documented reasoning appropriately reflects the encounter.

Human review validates whether quantitative improvements translate to meaningful gains in clinical utility, and flags issues that metrics alone might miss.

In our framework, human judgement operates at two levels.

Before deployment
Blinded head-to-head evaluation
Before any significant model update goes live, licensed clinicians review notes from the current and candidate systems side by side, with no knowledge of which system produced which note. This provides direct evidence that a new model is genuinely better, not just different.
Ongoing
Clinical input during real-world use
This includes structured feedback during staged rollouts, qualitative input from our clinician user base, and de-identified and aggregated data from the edits clinicians make when finalising notes in their daily workflow. Together, they provide a picture of real-world performance that no benchmark dataset alone can capture.

Automated evaluation establishes a consistent baseline we track over time, detecting performance changes as models evolve - while human judgement validates that quantitative improvements translate to meaningful clinical utility.

07 /Conclusion

A continuous,
multi-dimensional approach

Clinical note quality cannot be reduced to a single metric, which is why our framework evaluates across five distinct dimensions using binary, countable classifications that provide actionable insight rather than subjective scores.

5
Evaluation dimensions
Hallucinations, accuracy errors, omissions, irrelevant inclusions, and formatting - each measured distinctly with countable classifications.
Layers of human review
Blinded pre-deployment comparison and ongoing real-world clinical feedback - together validating that numbers translate to outcomes.
Continuous improvement
Every model update, prompt change, or improvement is evaluated against the current system, every deployment monitored in production, and feedback from clinicians guides where we focus next.
Shared standards for the industry

Setting standards for clinical documentation quality is not something any single company should do alone. Lyrebird is committed to participating in the broader conversation through our research, our clinical partnerships, and our engagement with the wider community as we work toward evaluation frameworks that allow for meaningful progress across the field.

Continue the conversation

This article was reviewed by the clinical and research leadership team at Lyrebird Health, who are committed to objective interpretation of research findings and transparent discussion of both benefits and limitations. We'll continue publishing what we learn — and we welcome feedback from clinicians, researchers, and healthcare leaders.