Clinical AI is moving into bigger, more consequential parts of healthcare. In this keynote, Lyrebird Health sets out why purpose-built, deliberately scaled, and independently evaluated clinical AI is the only appropriate response to what the stakes demand.
DHF 2026 Keynote · Building responsibly as the capability of AI expands · Kai Van Lieshout & Dr. Ray Boyapati
Audio extracted from the keynote recording. Prefer the full video? Switch to the Watch tab above.
Full keynote transcript · DHF 2026
Hi, I'm Kai, the founder and CEO of Lyrebird Health. We're on a mission to extend the human healthspan by building AI that helps healthcare work better. And as you can see, I'm not alone.
Ray, do you want to introduce yourself?
Thanks Kai, I'm Ray, Chief Clinical Officer at Lyrebird, and I'm also a practicing gastroenterologist, so just like many of you, I'm in clinic every week seeing patients. My job is to ensure what we build actually works in real clinical practice, not just in theory, not just on average, but in the actual consultation room, with a real clinician, a real patient, and real impact on care.
So, this is our fourth year at DHF. In 2023, our first year, we were in the back corner of Startup Alley in a one metre booth. There was only one AI scribe at DHF, and we spent the majority of our conversations explaining that AI could in fact draft the paperwork for you.
And a few things have changed since then. AI has quickly moved into bigger and more complex parts of healthcare, like decision support, workflow automation, and agents.
By comparison, the Clinical AI Tool that just writes your notes can sound pretty simple. But looking back, I don't think it ever was. Even a seemingly simple clinical note has a meaningful impact on the way care is delivered. It becomes part of the patient record and the basis for what gets followed up, referred, or actioned.
So even the simplest version of this technology was already meaningfully impacting care. And now these tools are moving further upstream, closer to how care is prioritised and delivered.
So from a product perspective, these steps can seem incremental: a better note, a new document, a workflow, a suggested action. But clinically, the downstream impact is exponential: a faster referral, a quicker diagnosis, earlier treatment, even a better outcome. And so the stakes of getting it right, or wrong, grow exponentially at every step.
From where I sit in the clinic, these AI tools feel categorically different to anything I've seen before in medicine. My EMR stores information. My referral system moves clinical details from A to B. In comparison, AI tools generate clinical content, in real time, at scale and across different environments. And that content shapes what a clinician sees and how they act.
Think about what that means in practice. We all know that a very common point of failure in our existing system is on discharge from hospital. Patients often don't get enough information, don't know who to call, or when to call, and even if they do have a phone number, the phone's usually engaged or the clinical team is unavailable. Patients regularly fall through the cracks. It seems like a great use case for an AI communication system.
So a patient is discharged from hospital and the hospital engages an AI system to triage queries and manage communication. Two days later, the patient calls back, short of breath. The system triages it as routine recovery and books a follow-up for next week. But it wasn't deeply connected to the clinical record and didn't know this patient had surgery during the admission. And the system doesn't have the clinical judgement to understand that surgery puts them at high risk of a pulmonary embolism or clots in the lungs, a potentially catastrophic event, or that the shortness of breath is a symptom of this condition.
It didn't know because it was never built to know. It was built to handle calls, not to understand the clinical context behind them. The opportunities here are enormous, but so are the risks if you get it wrong.
In medicine, when the stakes rise, the standard rises with them. That's how we protect patients. The same has to be true here, and that's something Lyrebird learned early on.
Lyrebird started with writing clinical notes. But we pretty quickly realised how connected notes are to everything else, referrals, care plans, billing, and how much that varies by specialty, setting, region and patient population.
So something generic can't safely support clinical work, and may even cause harm, because you need to understand the specific context the tool is operating in. One example that stands out is early on, we found in some rural settings, the tool wasn't reliably documenting social history screening for indigenous patients. When we looked into this, and spoke to the doctors, their exact words were that in certain patient populations, they screened alcohol intake in a more casual framing. "What's a good night of drinking for you?" Lyrebird was interpreting this as social chat, rather than an intentional clinical datapoint that's simply screened differently depending on the setting.
This was just one example, but it exposed something incredibly important about the vastness of what "correct" looks like for even just a simple subsection of the notes based on patient population. We were able to catch that because we had two things in place: first, enough real-world use for this kind of variation to surface, and two, the clinical feedback loops to recognise why it mattered and make changes.
That example has stayed with us because it shows how easy it is to miss context in healthcare. The clinician had already done the clinical work. The AI only had to recognise what mattered and document it properly. But these tools are moving upstream: into triage, follow-up, care planning, guideline support, billing, and the systems that determine what happens after the patient leaves the room.
At that point, the risks multiply again. It is no longer just, "Did the AI capture what the clinician said?" It becomes, "Did the AI understand the clinical context well enough to help shape what happens next, safely?"
The future of clinical AI will be defined by those who earn trust in the clinical settings they serve, and keep earning it as the technology takes on more responsibility. So purpose-built AI is not something you build once and declare finished. It is something you keep earning as the product is used in the real world.
The sheer breadth and variation of how medicine is practised across the world is something most people outside of healthcare never fully appreciate. A GP managing chronic disease in rural Queensland is different to a surgical outpatient clinic in London or a paediatric ED at three in the morning. These aren't minor variations. They are completely different clinical environments. Different ways of working. Different terminology. Different patient populations. Different clinical risks. An AI tool that is built generically and gets it right in one can get it dangerously wrong in another, and the clinician using it may never know.
Scaling without intent doesn't solve this. It makes it worse: more contexts you don't understand, more patients you're getting wrong without knowing it. You have to be deliberate about what you do and where you go. The solution is to scale deliberately.
In medicine, we understand this intuitively. A doctor doesn't specialise on day one. But they don't train as nurses and pharmacists either. They go broad within their discipline, enough settings and enough patients to build real clinical judgment, and then they go deep. The breadth makes the depth possible, but it has to be intentional. That's how we think about building clinical AI. You need enough exposure across enough clinical contexts to understand the variation. But you don't try to be everything to everyone. You learn insights from scale then you go deep where it matters most.
This is also core to how we think about deployments. In May already, Lyrebird has helped make paperwork easier for over a million consults in Australian General Practice. In the NHS, we're rolling out the UK's largest deployment of clinical AI across 20,000 clinicians in the South West London Integrated Care Board. That scale reflects a level of confidence in the product. But it also increases the responsibility around it, because more clinicians means more contexts, patients, edge cases, and places where the product needs to keep understanding the reality of care.
Scale matters, but only if you use it properly. More use gives you more signals, but it can also make you overconfident. So part of building this responsibly is deliberately looking for where the product falls down. Not waiting for those problems to find you. Looking for them, taking them seriously, and then checking what you think you know with people outside your organisation.
One example is our deployment at Gold Coast Hospital, the largest deployment of ambient AI in an Australian hospital setting, now rolled out to over 1,500 active clinicians. They wanted to run an independent, peer-reviewed evaluation. We said yes. Not because we knew exactly what it would find, but because we wanted to know. And we wanted it published either way.
Ray, when the findings came back, what stood out to you?
Honestly, my first reaction was excitement. This is the kind of rigorous, real world evaluation we need much more of in this space. And the findings were really interesting. Strong efficiency gains for clinicians. Patients reporting more direct time with their doctors. And Lyrebird notes actually scored higher on average than clinician-written notes on a validated quality instrument. Sounds great, right? But let's look a bit closer.
Yes, the Lyrebird notes scored higher on average on the PDQI-9. But it doesn't mean it's better all the time. And in medicine, you can't average away a clinical signal like that. You have to understand it.
Now, Lyrebird has changed a lot since that research was done. But that doesn't make the findings less valuable. They showed us insights into how clinicians actually use Lyrebird in real conditions that simply could not be seen from the inside. That's exactly what independent real-world research is for. We understand this well in medicine. It's why we have trials. Each phase surfaces things the previous one couldn't. Some things only become visible when an intervention is embedded in real practice, across enough contexts, over enough time. But evidence alone isn't enough. You have to do something with it.
In medicine, you don't get to decide the standard on your own. There are guidelines, professional bodies, and peer review. That infrastructure exists because the stakes are so high. In clinical AI, that infrastructure is still being built. But someone has to start, and you can only set a meaningful standard if you've done the work: built the scale, done the research, and seen what you're missing.
We'd done both. So we asked: what does good actually need to look like? There's no industry standard for this yet, so we built our own. The Clinical Note Evaluation Framework is one example of what that looks like in practice. The framework asks clinicians not to rate notes but to find specific, countable errors. Did this note contain a hallucination? Was there an important omission? A score of four out of five doesn't tell you what went wrong, whereas countable errors give you something you can actually act on.
For hallucinations specifically, not all of them carry the same risk. A patient mentions they've got a sore knee, they've had to take time off work, and they've been resting it at home. The note records "limited mobility due to knee pain." The patient never said that exactly, but most clinicians would consider that reasonable. Now compare that to a note that says "patient denies allergies," when the patient actually reported a penicillin allergy. That's a contradiction and dangerous. A framework that treats these two in the same way is not measuring quality. And what we've learned is that for most of these errors, it's not a model problem. It's a context problem. It's the same lesson as the example Kai shared earlier. If you're not close enough to the clinical context, you can't see the error, let alone fix it.
And it's easy to say your product is safe, responsible or high quality, but ultimately those are just words. At some point, you need to turn that into something specific and operational. This framework is one way of doing that. It is not a final answer, but it gives us a way to be specific about quality, risk, and what happens when the product gets something wrong.
And that brings us back to the standard we should set for clinical AI. If clinical AI is going to shape what gets captured, trusted, or acted on, generic is not good enough. It's something a tool has to keep earning: through proving it belongs in the clinical context it's used in. For us, that means building with clinical depth. Deliberate scale. Feedback loops that surface what's missing. Independent evidence that challenges what we think we know. And setting a standard that keeps rising as the role of the technology becomes more consequential.
This is not an abstract question about the future. This year alone, an estimated 42 million patient consultations will involve AI, and that number is only going to increase. So we're not talking about a standard we might accept one day. We're talking about the standard being normalised now. And that matters for everyone in this room, whether you are building, buying, deploying, governing or using these tools in care. We all have a role in shaping the future clinical AI will create. And we all have a stake in that future, because one day, we or someone we love will be on the other side of these tools as a patient. And when that happens, we'll want to know it has been built with the depth and care that clinical work deserves.
For me, it comes down to this. Next week, I'll be in clinic with a patient sitting across from me. And when I use an AI tool in that consultation, I need to know, really know, that it belongs there. If it belongs, it will make me a better doctor for my patients. But if that tool has been built too fast, too broadly, too generically, then it's not helping my patient. It's a risk to my patient.
That's not something I'm willing to accept as a doctor. And it's not what you'd accept if you or someone you love was the patient sitting across from me. If you're in this room, you're part of this conversation. Come find us. We would love to chat. And let's build this exciting future together, responsibly.
An EMR stores information. A referral system moves clinical details from A to B. Clinical AI generates content in real time, at scale, across different environments. And that content shapes what a clinician sees and how they act.
From a product perspective, the steps can seem incremental: a better note, a new document, a workflow, a suggested action. But clinically, the downstream impact is exponential. A faster referral, a quicker diagnosis, earlier treatment, a better outcome. The stakes of getting it right, or wrong, grow at every step.
These tools are also moving upstream, into triage, follow-up, care planning, guideline support, billing, and the systems that determine what happens after the patient leaves the room. At that point, the question changes. It is no longer simply whether the AI captured what the clinician said. It becomes whether the AI understood the clinical context well enough to help shape what happens next, safely.
"In medicine, when the stakes rise, the standard rises with them. That's how we protect patients. The same has to be true here."
Dr. Ray Boyapati, Chief Clinical OfficerThe sheer breadth and variation of how medicine is practised across the world is something most people outside of healthcare never fully appreciate. A GP managing chronic disease in rural Queensland operates in a completely different clinical environment to a surgical outpatient clinic in London or a paediatric ED at three in the morning. An AI tool built generically and that performs well in one can get it dangerously wrong in another, and the clinician using it may never know.
In some rural settings, Lyrebird discovered early that the tool wasn't reliably documenting social history screening for First Nations patients, because it was reading a clinical question framed in a casual, culturally appropriate way as social conversation rather than intentional data collection. The tool hadn't been built too badly. It hadn't been built specifically enough.
Generic tools don't fail loudly. They fail quietly, in ways that only become visible if you are close enough to the clinical context to recognise them. The goal is not just to capture what was said. It is to understand what it means in the specific clinical environment it was said in.
Breadth is essential for building real clinical judgment. But breadth without intent makes the problem worse: more contexts you don't understand, more patients you're getting wrong without knowing it.
A doctor doesn't specialise on day one. But they don't train as nurses and pharmacists either. They go broad within their discipline, enough settings and enough patients to build real clinical judgment, then they go deep. The breadth makes the depth possible, but it has to be intentional.
That is how Lyrebird thinks about building clinical AI. Enough exposure across enough clinical contexts to understand the variation. Not everything for everyone, but genuine depth where it matters most.
Scale reflects confidence in the product. But it also increases the responsibility around it. More clinicians means more contexts, more patients, more edge cases, and more places where the product needs to keep understanding the reality of care. Lyrebird treats each deployment not as a finish line, but as an obligation to keep learning.
More use gives you more signals, but it can also make you overconfident. The complexity of clinical AI demands that you go looking for failure, not wait for it to surface. Independent scrutiny reveals what no internal team can see on its own.
At Gold Coast Hospital and Health Service, Lyrebird supported the largest deployment of ambient Clinical AI in an Australian hospital setting, now active across more than 1,500 clinicians. When GCHHS wanted to run an independent, peer-reviewed evaluation, Lyrebird said yes. Not because the outcome was certain, but because knowing the full picture mattered.
The largest peer-reviewed, independent evaluation of ambient Clinical AI in an Australian hospital. The evaluation used the PDQI-9 validated quality instrument across 16 weeks of outpatient practice across multiple specialties.
"Better on average doesn't mean better always. In medicine, you can't average away a clinical signal like that. You have to understand it."
Dr. Ray Boyapati, Chief Clinical OfficerThe PDQI-9 distributions showed something important: while Lyrebird notes scored higher on average, there remained cases where the clinician note outperformed the AI-generated one. In medicine, that overlap is not a footnote. It is a signal that demands investigation, not reassurance.
The evaluation also showed that across 16 weeks, many clinicians flagged something as off at least once. That finding is not a failure of the product. It is exactly the kind of real-world signal that no internal testing environment can reliably produce. Each phase of evidence surfaces things the previous one could not. Some things only become visible when an intervention is embedded in real practice, across enough contexts, over enough time.
The GCHHS peer-reviewed study, six implementation lessons from the deployment, and a practical vendor evaluation checklist are all available on Lyrebird's research library.
In medicine, you don't get to decide the standard on your own. There are guidelines, professional bodies, and peer review. That infrastructure exists because the stakes are too high to trust self-assessment alone. The same principle applies to clinical AI.
Lyrebird built its Clinical Note Evaluation Framework because there was no industry standard to adopt. The framework asks clinicians not to rate notes on a scale, but to find specific, countable errors. A score of four out of five doesn't tell you what went wrong. Countable errors give you something you can actually act on.
A note that records "limited mobility due to knee pain" when the patient described a sore knee and difficulty getting around is a reasonable clinical inference. A note that states "patient denies allergies" when the patient reported a penicillin allergy is a dangerous contradiction. A framework that treats these the same way is not measuring quality. It is not measuring anything meaningful.
It is easy to say a product is safe, responsible, or high quality. At some point, those words need to become something specific and operational. The framework is one way of doing that. It is not a final answer, but it makes quality measurable, risk gradable, and error recovery possible.
This year alone, an estimated 42 million patient consultations in Australia will involve AI. The standard being normalised in those consultations will shape care for years to come.
in Australia this year will involve AI. What that means for patients depends on whether the tools behind them were built with the depth and care that clinical work deserves.
Whether you are building, buying, deploying, governing, or using these tools in care, everyone in healthcare has a role in shaping what this future looks like. And everyone has a personal stake in it, because one day, you or someone you love will be on the other side of these tools as a patient.
Purpose-built clinical AI is not a feature. It is a discipline. One that has to be kept earning, through genuine clinical depth, deliberate scale, feedback loops that surface what is missing, and independent evidence that challenges what you think you know.
Purpose-built clinical AI is not something you build once and declare finished. It is something you keep earning as the product is used in the real world, and as the technology takes on more responsibility in care.