Peer-Reviewed Findings and Implementation Lessons from a 16-Week Outpatient Evaluation

Ambient AI Documentation in Practice at Gold Coast Hospital & Health Service
Disclosure
The tool evaluated in this study was developed by Lyrebird Health. Researchers from the Gold Coast Hospital and Health Service (GCHHS) designed, conducted, evaluated and submitted this article for publication independently.

Clinical documentation has become one of the most significant threats to sustainable healthcare delivery. Clinicians spend twice as much time on documentation as direct patient care, contributing to burnout and fundamentally altering the therapeutic encounter.
Ambient AI documentation is one proposed solution that’s moving from pilots into routine care. The published evidence base is still limited, but real-world findings are starting to emerge on impact, reliability, and what safe implementation requires.
Gold Coast Hospital and Health Service (GCHHS), one of Australia's largest public health services, recently published peer-reviewed findings from a 16-week evaluation of ambient documentation across 19 specialties and 7,499 consultations.
This article summarises those findings and extracts practical learnings for teams considering or implementing these tools.
Study snapshot
- Setting:
Tertiary public hospital outpatient clinics at Gold Coast Hospital and Health Service (GCHHS), across 19 specialties including paediatrics, orthopaedics, cardiology, mental health, and others. - Scale:
100 clinicians, 7,499 consultations, 16 weeks (Jul to Dec 2024). - What was assessed:
Tool performance (quality, utility, reliability) and impact on clinician and patient experience in routine practice. - Methods:
Mixed methods including staff and patient surveys, interviews, scribe outputs, and EMR review (PDQI-9, ROUGE).
Key outcomes
(1) Efficiency and workflow
- 84% of clinicians reported a positive impact on efficiency, alleviating administrative burden and freeing time for high-value tasks.
- 58% of ambient-generated content was accepted without modification into final clinical notes.
- Staff interviews revealed relief from the "dread" of post-clinic documentation.
(2) Note quality
- Ambient-generated notes scored 37.06/40 on the PDQI-9 versus 34.56/40 for clinician written notes (blinded assessment of 18 matched pairs).
- AI outputs scored 4.83/5 for freedom from hallucination and 5.0/5 for freedom from bias.
- Notes were consistently rated as more thorough, better organised, and more useful than standard practice.
(3) Patient experience
- 68% of patients reported their clinician spent more time speaking directly with them.
- 59% reported the technology had a positive effect on their visit,
- Clinicians described more therapeutic conversations and better eye contact during sensitive discussions.
(4) Reliability and safety
- 47% of clinicians reported observing hallucinations at least once during the 16-week trial (self-reported, 43% response rate).
- 16% observed potential bias in outputs.
Collectively, these results suggest benefits in clinician experience and workflow, note quality and patient experience alongside implementation considerations that require attention.
Implementation lessons
The results highlight several considerations for implementing ambient documentation tools safely in practice.
These include how teams monitor quality over time, how they identify and respond to reliability issues, and how workflows support safe clinician review - echoing similar themes explored in recent randomised trials of ambient documentation.
The lessons below reflect Lyrebird Health’s interpretation of the GCHHS findings, informed by implementation experience.
(1) Interpret impact relative to baseline documentation quality
Documentation quality and structure vary across clinicians, specialties, and time pressure.
The right evaluation question is comparative: does the tool improve note quality and reduce effort compared to your baseline, with acceptable risks and appropriate safeguards?
In the GCHHS evaluation, ambient-generated notes scored higher on PDQI-9 than standard clinician notes (37.06/40 vs 34.56/40), while being produced faster with reported workflow and patient-experience benefits. The PDQI-9 is a validated instrument that scores how "good" a clinical note is: whether it's clear and well organised, includes the important information, and would be useful to another clinician reading it later.
For teams considering implementation, it may be useful to measure the baseline explicitly because in routine care, notes are often created under real constraints.
Clinics run behind, documentation workflows rely on a mix of templates, dictation, copy-forward, and end-of-day catch-up. "Perfect notes" isn't the only benchmark: the more useful question is whether a tool improves quality and reduces burden relative to what actually happens today.
Beyond what "typical" documentation looks like, it is important to measure the pressures shaping it: whether notes are completed in-clinic or after-hours, how often clinics run behind, and how much structure already exists through templates and workflows.
A tool that isn't “perfect” may still represent meaningful improvement in a high-pressure setting.
(2) Design for safe review: optimise for reviewability, not perfection
Ambient AI documentation is intended as a draft that reduces effort while keeping clinician judgement firmly in the loop.
In the GCHHS evaluation, 58% of outputs were accepted without modification on average; in other cases, clinicians amended before finalising.
The goal isn't zero edits: it's making reviews faster, more reliable, and harder to miss.
Key facts should be easy to verify and easy to correct, especially medications, numbers, diagnoses, procedures, laterality, allergies, and safety-critical negatives.
What good looks like: the tool makes it easy for clinicians to quickly check the details that matter, spot anything that looks off, and amend without friction. Safe review should feel built in, not bolted on.
(3) Build a quality assurance loop to make errors hard to miss
One safety risk highlighted in the evaluation is incorrect or unsupported content making it into the clinical record. Two findings in the paper help quantify that risk, but on the surface, they may seem hard to reconcile.
- In staff surveys, 47% of respondents reported observing hallucinations at least once during the trial, and 16% reported observing potential bias (43% response rate).
- In a separate structured quality assessment of a small sample of note pairs, outputs were rated highly on "freedom from hallucination" (4.83/5) and "freedom from bias" (5.0/5).
These results are capturing different things.
The survey reflects whether clinicians ever noticed something concerning during routine use across many consultations. These survey findings may be subject to self-selection bias, as individuals with personal experience of hallucinations would likely be more motivated to participate, potentially leading to an inflated hallucination rate.
The structured assessment rates the quality of a small sample of reviewed outputs.
Both signals matter.
It also helps to unpack what “hallucination” means in practice. Clinicians may use the term whenever something in the draft note doesn't match their recollection of the encounter, doesn't fit the clinical context, or doesn't appear supported by what was said.
In that sense, a reported hallucination is a frontline safety signal - something doesn’t look right, was noticed by a clinician, and reported.
However, that signal on its own doesn’t explain why the concern appeared in the output.
What looks like a “hallucination” can arise for different reasons. From gaps or ambiguity in what was captured, to missing context, to errors in attribution or synthesis. Sometimes more than one factor is involved.
That’s why it’s imperative that ambient documentation tools make it easy for clinicians to flag concerns in the moment, and that teams have a structured quality assurance process to triage reports, investigate patterns, and minimise repeat issues over time.
Closing that loop matters too - sharing outcomes with clinicians so they understand what happened, why it happened, and how similar issues will be handled in future.
(4) Create shared norms to support consistent, safe use
What ends up in the clinical record is shaped by two things: the tool itself, and the everyday habits that develop around it. Which details get a quick scan, which get a closer look, and what makes someone pause before accepting an output.
The authors note that hallucinations and bias pose risk particularly if clinicians don't thoroughly review and edit generated notes, and they raise a longer-term question: as trust in the tool grows, complacency may increase over time, potentially perpetuating or amplifying inaccuracies.
That's why the authors emphasise human oversight and frequent quality checks as these tools are used more widely. It can be helpful for teams to make "safe use defaults" explicit, so safety doesn't rely on each person independently reinventing a checking approach.
In practice, this is a shared responsibility: healthcare teams bringing clinical judgement, and vendors supporting safe habits through product design and implementation support. It doesn't need to be heavy-handed.
It can be as simple as agreeing a small set of shared defaults:
- Consent is obtained before the recording starts
- The clinician reviews the entire note before signing off
- Any uncertainty or inconsistency triggers a pause, not a quick correction
- Issues are flagged immediately through a standard pathway
That kind of light calibration tends to support consistent practice over time, especially as the workflow becomes familiar and speed naturally increases.
(5) Value is contextual: expect different starting points
The evaluation showed meaningful variation in how different outpatient specialties used the tool across the trial period.
The authors highlight orthopaedics as an example where participation was lower, suggesting this likely reflects how documentation works in that setting: consult notes are often written by junior doctors, with a stronger preference for brevity.
The paper also describes a key implementation insight: the more clinicians invested in the tool, the more they got out of it. Template customisation and familiarisation were central to usability, but sometimes difficult to prioritise in time-poor outpatient clinics. Clinicians who invested that upfront effort reported better outcomes. Whether clinicians persisted depended on whether the output suited their documentation needs and felt worth the effort.
Variation is expected at multiple levels: between specialties, between clinics, and between individual clinicians. Rather than treating mixed experiences as a sign something is wrong, it's more useful to treat them as early feedback.
Where is this saving time? Where is it improving notes? Where does it need adjustment to better match the way that clinic documents? Those early learnings make broader implementation smoother over time.
(6) Look beyond just time saved: measure the patient experience
It's easy to focus on efficiency when evaluating ambient documentation, but the GCHHS findings suggest the impact shows up in the room as well.
In patient surveys, 68% said their clinician spent more time speaking directly with them, and 59% felt the technology had a positive effect on their visit. Clinicians described similar shifts: more direct conversation and better eye contact, including during sensitive discussions.
Patient experience isn't just a side benefit. It's part of what implementation changes. And because it's relational, it can be sensitive to how the tool is introduced and how the workflow feels in practice.
Time saved is one measure. But clinician attention, patient rapport, and health literacy are also clinical outcomes, so implementation should track both.

Evaluating vendors
Ambient documentation is a rapidly-evolving category and as more real-world evidence emerges, it helps clarify what matters when these tools are used in day-to-day practice.
The GCHHS evaluation adds something the field needs: early, real-world, peer reviewed data on impact, usability, and the kinds of reliability issues that show up in routine outpatient workflows.
These learnings can also shape how you evaluate vendors in practice - to look past feature checklists, focus on how the tool performs in real clinic workflows, and monitor how the vendor manages quality over time.
Here are practical questions to ask:
- Quality definitions and detection
How do you define and classify quality issues (capture problems, mis-structuring, hallucinations, bias)? How are issues detected automatically and via user reporting? Can clinicians flag an issue in seconds? - Review workflow
Does the interface make it easy to review the draft and verify key details (medications, doses, diagnoses, procedures)? Are edits obvious, trackable, and fast, or easy to miss? - Monitoring and governance
Can you track quality trends by specialty, template, and model version? How do you test updates before release? What happens when an update makes something worse? - Feedback loops
What happens after a clinician flags an issue? How quickly do you respond? What reporting do customers get, and how often? - Transparency and partnership
Will you share quality metrics and respond to independent evaluation? What does support look like post go-live?
This evaluation also raises broader questions the field still needs to answer:
- Do benefits persist over longer time horizons, and does complacency develop?
- How do outcomes compare across vendors and settings?
- What implementation strategies work best by specialty?
- How do we measure and mitigate bias reliably?
Since the evaluation
Because Lyrebird was the ambient scribe used in this evaluation, we came away with a rich set of practical learnings spanning product design, implementation, and the systems needed to monitor quality and support safe clinician review over time.
We plan to share these learnings openly, both for transparency and to add to the broader body of understanding about ambient documentation tools in practice. The first piece we will share is our framework for clinical note quality evaluation: the internal process we use to assess note quality in a structured way, make sense of issues that arise in real use (including hallucinations), and track improvements over time.
The tool has evolved significantly since the trial period (July to December 2024).

This article was reviewed by the clinical and research leadership team at Lyrebird Health, who are committed to objective interpretation of research findings and transparent discussion of both benefits and limitations. We'll continue publishing what we learn.
Continue the conversation:
We welcome feedback from clinicians, researchers, and healthcare leaders - contact our team at clinical@lyrebirdhealth.com.
Read the full study:
Performance, acceptability, and impact of ambient listening scribe technology in an outpatient context: a mixed methods trial evaluation. BMC Health Serv Res (2025).






