What a large-scale, independently-conducted evaluation of ambient AI documentation found and what it means for teams implementing these tools.
Gold Coast Hospital and Health Service (GCHHS), one of Australia's largest public health services, published peer-reviewed findings from a 16-week evaluation of ambient documentation across 19 specialties and 7,499 consultations.
Clinical documentation has become one of the most significant threats to sustainable healthcare delivery. Clinicians spend twice as much time on documentation as direct patient care, contributing to burnout and fundamentally altering the therapeutic encounter.
Ambient AI documentation is one proposed solution that's moving from pilots into routine care. The published evidence base is still limited, but real-world findings are starting to emerge on impact, reliability, and what safe implementation requires.
Four outcome areas assessed across the 16-week evaluation. Benefits were meaningful and so were the implementation considerations.
Collectively, these results suggest meaningful benefits in clinician experience, workflow, note quality, and patient experience alongside implementation considerations that require structured attention.
The lessons below reflect Lyrebird Health's interpretation of the GCHHS findings, informed by implementation experience across thousands of clinicians.
Documentation quality and structure vary across clinicians, specialties, and time pressure. The right evaluation question is comparative: does the tool improve note quality and reduce effort compared to your baseline, with acceptable risks and appropriate safeguards?
In the GCHHS evaluation, ambient-generated notes scored higher on PDQI-9 than standard clinician notes (37.06/40 vs 34.56/40), while being produced faster with reported workflow and patient-experience benefits. The PDQI-9 is a validated instrument that scores how "good" a clinical note is whether it's clear and well organised, includes the important information, and would be useful to another clinician reading it later.
A tool that isn't "perfect" may still represent meaningful improvement in a high-pressure setting. The more useful question is whether it improves quality and reduces burden relative to what actually happens today.
Ambient AI documentation is intended as a draft that reduces effort while keeping clinician judgement firmly in the loop. In the GCHHS evaluation, 58% of outputs were accepted without modification on average. In other cases, clinicians amended before finalising.
The goal isn't zero edits it's making reviews faster, more reliable, and harder to miss. Key facts should be easy to verify and easy to correct: especially medications, numbers, diagnoses, procedures, laterality, allergies, and safety-critical negatives.
What good looks like: the tool makes it easy for clinicians to quickly check the details that matter, spot anything that looks off, and amend without friction. Safe review should feel built in, not bolted on.
One safety risk highlighted in the evaluation is incorrect or unsupported content making it into the clinical record. Two findings in the paper quantify that risk but may seem hard to reconcile: 47% of survey respondents reported observing hallucinations at least once, while structured quality assessment of reviewed output pairs rated freedom from hallucination at 4.83/5.
These results are capturing different things. The survey reflects whether clinicians ever noticed something concerning across many consultations. The structured assessment rates the quality of a small sample of reviewed outputs. Both signals matter.
What looks like a "hallucination" can arise from gaps or ambiguity in captured audio, missing context, or errors in attribution or synthesis. A reported hallucination is a frontline safety signal something doesn't look right, was noticed by a clinician, and reported. That signal on its own doesn't explain why the concern appeared.
That's why it's imperative that ambient documentation tools make it easy for clinicians to flag concerns in the moment, and that teams have a structured quality assurance process to triage reports, investigate patterns, and close the loop with clinicians on what happened and why.
What ends up in the clinical record is shaped by two things: the tool itself, and the everyday habits that develop around it. As trust in the tool grows, complacency may increase over time potentially perpetuating or amplifying inaccuracies if review quality drops.
It can be as simple as agreeing a small set of shared defaults:
That kind of light calibration tends to support consistent practice over time, especially as the workflow becomes familiar and speed naturally increases.
The evaluation showed meaningful variation in how different outpatient specialties used the tool across the trial period. The authors highlight orthopaedics as an example where participation was lower likely reflecting how documentation works in that setting, where consult notes are often written by junior doctors with a preference for brevity.
The paper also describes a key implementation insight: the more clinicians invested in the tool, the more they got out of it. Template customisation and familiarisation were central to usability but sometimes difficult to prioritise in time-poor outpatient clinics. Clinicians who invested that upfront effort reported better outcomes.
Variation is expected across specialties, clinics, and individual clinicians. Rather than treating mixed experiences as a sign something is wrong, treat them as early feedback where is this saving time, where is it improving notes, where does it need adjustment?
It's easy to focus on efficiency when evaluating ambient documentation, but the GCHHS findings suggest the impact shows up in the room as well. 68% of patients said their clinician spent more time speaking directly with them, and 59% felt the technology had a positive effect on their visit.
Patient experience isn't just a side benefit it's part of what implementation changes. And because it's relational, it can be sensitive to how the tool is introduced and how the workflow feels in practice.
Time saved is one measure. But clinician attention, patient rapport, and health literacy are also clinical outcomes. Implementation should track both.
The GCHHS evaluation adds what the field needs: early, real-world, peer-reviewed data on impact, usability, and the reliability issues that show up in routine outpatient workflows. These learnings can also shape how you evaluate vendors. Read the full vendor checklist →
How do you define and classify quality issues capture problems, mis-structuring, hallucinations, bias? How are issues detected automatically and via user reporting? Can clinicians flag an issue in seconds?
Does the interface make it easy to review the draft and verify key details medications, doses, diagnoses, procedures? Are edits obvious, trackable, and fast, or easy to miss?
Can you track quality trends by specialty, template, and model version? How do you test updates before release? What happens when an update makes something worse?
What happens after a clinician flags an issue? How quickly do you respond? What reporting do customers get, and how often?
Will you share quality metrics and respond to independent evaluation? What does support look like post go-live?
This evaluation advances the evidence base meaningfully but it also surfaces questions the field has not yet answered at scale.
Ambient documentation is a rapidly-evolving category. As more real-world evidence emerges, it will help clarify what matters when these tools are used in day-to-day practice. The GCHHS evaluation is an important contribution but the conversation is ongoing.
Because Lyrebird was the ambient scribe used in this evaluation, the team came away with a rich set of practical learnings spanning product design, implementation, and the systems needed to monitor quality and support safe clinician review over time.
The tool has evolved significantly since the trial period (July to December 2024). We plan to share our learnings openly, both for transparency and to add to the broader body of understanding about ambient documentation tools in practice.
The first piece we will share is our framework for clinical note quality evaluation the internal process we use to assess note quality in a structured way, make sense of issues that arise in real use (including hallucinations), and track improvements over time.
A large-scale, independently-conducted, peer-reviewed evaluation found meaningful benefits and meaningful considerations for teams implementing ambient AI documentation.
Of clinicians across 19 specialties and 7,499 consultations reported a positive impact on workflow and administrative burden.
AI-generated notes vs 34.56 for clinician-written notes a 7.2% quality improvement validated against a standardised clinical documentation metric.
Of patients reported their clinician spent more time speaking directly with them during the consultation.
From baseline interpretation to patient experience measurement practical guidance for teams implementing or evaluating ambient documentation.
We welcome feedback from clinicians, researchers, and healthcare leaders. Contact our team with questions about the evaluation, implementation, or what we've learned.