Clinical Research: 16-Week Outpatient Evaluation

Peer-reviewed findings and implementation lessons from Gold Coast Hospital and Health Service.

What a large-scale, independently-conducted evaluation of ambient AI documentation found and what it means for teams implementing these tools.

Published

March 2026

Reviewed by

Clinical & Research Team

Setting

Gold Coast Hospital and Health Service

Study

BMC Health Serv Res (2025)

In this article

01Study snapshot
02Key outcomes
03Implementation lessons
04Evaluating vendors
05Open questions
06Since the evaluation

01 Study snapshot

Ambient AI documentation
in practice at Gold Coast.

Gold Coast Hospital and Health Service (GCHHS), one of Australia's largest public health services, published peer-reviewed findings from a 16-week evaluation of ambient documentation across 19 specialties and 7,499 consultations.

Disclosure: The tool evaluated in this study was developed by Lyrebird Health. Researchers from the Gold Coast Hospital and Health Service designed, conducted, evaluated and submitted this article for publication independently.

Clinical documentation has become one of the most significant threats to sustainable healthcare delivery. Clinicians spend twice as much time on documentation as direct patient care, contributing to burnout and fundamentally altering the therapeutic encounter.

Ambient AI documentation is one proposed solution that's moving from pilots into routine care. The published evidence base is still limited, but real-world findings are starting to emerge on impact, reliability, and what safe implementation requires.

Setting

Tertiary public hospital outpatient clinics at GCHHS, across 19 specialties including paediatrics, orthopaedics, cardiology, and mental health.

Scale

100 clinicians · 7,499 consultations · 16 weeks (July to December 2024)

What was assessed

Tool performance (quality, utility, reliability) and impact on clinician and patient experience in routine practice.

Methods

Mixed methods including staff and patient surveys, interviews, scribe outputs, and EMR review (PDQI-9, ROUGE).

02 Key outcomes

What the evaluation found.

Four outcome areas assessed across the 16-week evaluation. Benefits were meaningful and so were the implementation considerations.

01 · Efficiency and workflow

Clinicians reported relief from documentation load

84% of clinicians reported a positive impact on efficiency, alleviating administrative burden and freeing time for high-value tasks.
58% of ambient-generated content was accepted without modification into final clinical notes.
Staff interviews revealed relief from the "dread" of post-clinic documentation.

02 · Note quality

AI-generated notes outperformed clinician-written notes

Ambient notes scored 37.06/40 on the PDQI-9 versus 34.56/40 for clinician-written notes (blinded assessment, 18 matched pairs).
AI outputs scored 4.83/5 for freedom from hallucination and 5.0/5 for freedom from bias.
Notes were consistently rated as more thorough, better organised, and more clinically useful.

03 · Patient experience

Patients noticed more clinician presence

68% of patients reported their clinician spent more time speaking directly with them.
59% reported the technology had a positive effect on their visit.
Clinicians described more therapeutic conversations and better eye contact during sensitive discussions.

04 · Reliability and safety

Real-world signals require attention

47% of clinicians reported observing hallucinations at least once during the 16-week trial (self-reported, 43% response rate).
16% observed potential bias in outputs.
These survey results may reflect self-selection bias clinicians with direct experience were more likely to respond.

Collectively, these results suggest meaningful benefits in clinician experience, workflow, note quality, and patient experience alongside implementation considerations that require structured attention.

03 Implementation lessons

Six lessons for teams
implementing ambient documentation.

The lessons below reflect Lyrebird Health's interpretation of the GCHHS findings, informed by implementation experience across thousands of clinicians.

Interpret impact relative to baseline documentation quality Read more →

Documentation quality and structure vary across clinicians, specialties, and time pressure. The right evaluation question is comparative: does the tool improve note quality and reduce effort compared to your baseline, with acceptable risks and appropriate safeguards?

In the GCHHS evaluation, ambient-generated notes scored higher on PDQI-9 than standard clinician notes (37.06/40 vs 34.56/40), while being produced faster with reported workflow and patient-experience benefits. The PDQI-9 is a validated instrument that scores how "good" a clinical note is whether it's clear and well organised, includes the important information, and would be useful to another clinician reading it later.

A tool that isn't "perfect" may still represent meaningful improvement in a high-pressure setting. The more useful question is whether it improves quality and reduces burden relative to what actually happens today.

Design for safe review: optimise for reviewability, not perfection Read more →

Ambient AI documentation is intended as a draft that reduces effort while keeping clinician judgement firmly in the loop. In the GCHHS evaluation, 58% of outputs were accepted without modification on average. In other cases, clinicians amended before finalising.

The goal isn't zero edits it's making reviews faster, more reliable, and harder to miss. Key facts should be easy to verify and easy to correct: especially medications, numbers, diagnoses, procedures, laterality, allergies, and safety-critical negatives.

What good looks like: the tool makes it easy for clinicians to quickly check the details that matter, spot anything that looks off, and amend without friction. Safe review should feel built in, not bolted on.

Build a quality assurance loop to make errors hard to miss Read more →

One safety risk highlighted in the evaluation is incorrect or unsupported content making it into the clinical record. Two findings in the paper quantify that risk but may seem hard to reconcile: 47% of survey respondents reported observing hallucinations at least once, while structured quality assessment of reviewed output pairs rated freedom from hallucination at 4.83/5.

These results are capturing different things. The survey reflects whether clinicians ever noticed something concerning across many consultations. The structured assessment rates the quality of a small sample of reviewed outputs. Both signals matter.

What looks like a "hallucination" can arise from gaps or ambiguity in captured audio, missing context, or errors in attribution or synthesis. A reported hallucination is a frontline safety signal something doesn't look right, was noticed by a clinician, and reported. That signal on its own doesn't explain why the concern appeared.

That's why it's imperative that ambient documentation tools make it easy for clinicians to flag concerns in the moment, and that teams have a structured quality assurance process to triage reports, investigate patterns, and close the loop with clinicians on what happened and why.

Create shared norms to support consistent, safe use Read more →

What ends up in the clinical record is shaped by two things: the tool itself, and the everyday habits that develop around it. As trust in the tool grows, complacency may increase over time potentially perpetuating or amplifying inaccuracies if review quality drops.

It can be as simple as agreeing a small set of shared defaults:

Consent is obtained before the recording starts
The clinician reviews the entire note before signing off
Any uncertainty or inconsistency triggers a pause, not a quick correction
Issues are flagged immediately through a standard pathway

That kind of light calibration tends to support consistent practice over time, especially as the workflow becomes familiar and speed naturally increases.

Value is contextual: expect different starting points Read more →

The evaluation showed meaningful variation in how different outpatient specialties used the tool across the trial period. The authors highlight orthopaedics as an example where participation was lower likely reflecting how documentation works in that setting, where consult notes are often written by junior doctors with a preference for brevity.

The paper also describes a key implementation insight: the more clinicians invested in the tool, the more they got out of it. Template customisation and familiarisation were central to usability but sometimes difficult to prioritise in time-poor outpatient clinics. Clinicians who invested that upfront effort reported better outcomes.

Variation is expected across specialties, clinics, and individual clinicians. Rather than treating mixed experiences as a sign something is wrong, treat them as early feedback where is this saving time, where is it improving notes, where does it need adjustment?

Look beyond just time saved: measure the patient experience Read more →

It's easy to focus on efficiency when evaluating ambient documentation, but the GCHHS findings suggest the impact shows up in the room as well. 68% of patients said their clinician spent more time speaking directly with them, and 59% felt the technology had a positive effect on their visit.

Patient experience isn't just a side benefit it's part of what implementation changes. And because it's relational, it can be sensitive to how the tool is introduced and how the workflow feels in practice.

Time saved is one measure. But clinician attention, patient rapport, and health literacy are also clinical outcomes. Implementation should track both.

04 Evaluating vendors

Five questions to ask
any ambient documentation vendor.

The GCHHS evaluation adds what the field needs: early, real-world, peer-reviewed data on impact, usability, and the reliability issues that show up in routine outpatient workflows. These learnings can also shape how you evaluate vendors. Read the full vendor checklist →

Quality definitions and detection

How do you define and classify quality issues capture problems, mis-structuring, hallucinations, bias? How are issues detected automatically and via user reporting? Can clinicians flag an issue in seconds?

Review workflow

Does the interface make it easy to review the draft and verify key details medications, doses, diagnoses, procedures? Are edits obvious, trackable, and fast, or easy to miss?

Monitoring and governance

Can you track quality trends by specialty, template, and model version? How do you test updates before release? What happens when an update makes something worse?

Feedback loops

What happens after a clinician flags an issue? How quickly do you respond? What reporting do customers get, and how often?

Transparency and partnership

Will you share quality metrics and respond to independent evaluation? What does support look like post go-live?

05 Open questions

What the field
still needs to answer.

This evaluation advances the evidence base meaningfully but it also surfaces questions the field has not yet answered at scale.

Do benefits persist over longer time horizons, and does complacency develop as clinicians become familiar with the tool?

How do outcomes compare across different vendors and clinical settings beyond a single public health service?

What implementation strategies work best by specialty particularly in those, like orthopaedics, where adoption was more mixed?

How do we measure and mitigate bias reliably, and what standards should the field adopt for reporting it?

Ambient documentation is a rapidly-evolving category. As more real-world evidence emerges, it will help clarify what matters when these tools are used in day-to-day practice. The GCHHS evaluation is an important contribution but the conversation is ongoing.

06 Since the evaluation

What we've learned
and what we're sharing.

Because Lyrebird was the ambient scribe used in this evaluation, the team came away with a rich set of practical learnings spanning product design, implementation, and the systems needed to monitor quality and support safe clinician review over time.

What comes next

The tool has evolved significantly since the trial period (July to December 2024). We plan to share our learnings openly, both for transparency and to add to the broader body of understanding about ambient documentation tools in practice.

The first piece we will share is our framework for clinical note quality evaluation the internal process we use to assess note quality in a structured way, make sense of issues that arise in real use (including hallucinations), and track improvements over time.

Summary

The evidence is early.
The questions are real.
The direction is clear.

A large-scale, independently-conducted, peer-reviewed evaluation found meaningful benefits and meaningful considerations for teams implementing ambient AI documentation.

84%

Reported efficiency improvement

Of clinicians across 19 specialties and 7,499 consultations reported a positive impact on workflow and administrative burden.

37.06

PDQI-9 score

AI-generated notes vs 34.56 for clinician-written notes a 7.2% quality improvement validated against a standardised clinical documentation metric.

68%

Patients noticed more presence

Of patients reported their clinician spent more time speaking directly with them during the consultation.

Implementation lessons

From baseline interpretation to patient experience measurement practical guidance for teams implementing or evaluating ambient documentation.

This article was reviewed by the clinical and research leadership team at Lyrebird Health, who are committed to objective interpretation of research findings and transparent discussion of both benefits and limitations. We'll continue publishing what we learn. Read the full study: Performance, acceptability, and impact of ambient listening scribe technology in an outpatient context: a mixed methods trial evaluation. BMC Health Serv Res (2025).

Continue the conversation

We welcome feedback from clinicians, researchers, and healthcare leaders. Contact our team with questions about the evaluation, implementation, or what we've learned.

Contact the clinical team Read our evaluation framework →