Keynote  ·  Digital Health Festival 2026  ·  Melbourne

Building responsibly as the capability of AI expands

Clinical AI is moving into bigger, more consequential parts of healthcare. In this keynote, Lyrebird Health sets out why purpose-built, deliberately scaled, and independently evaluated clinical AI is the only appropriate response to what the stakes demand.

Kai Van Lieshout, Founder and CEO of Lyrebird Health
Kai Van Lieshout Founder & CEO
Dr. Ray Boyapati, Chief Clinical Officer at Lyrebird Health
Dr. Ray Boyapati Chief Clinical Officer
About this keynote
  • EventDigital Health Festival 2026
  • LocationMelbourne, Australia
  • DateMay 2026
  • SpeakersKai Van Lieshout & Dr. Ray Boyapati
  • FormatWatch · Listen · Read
Topics covered
Purpose-built AI Scale deliberately Independent research Clinical standards GCHHS evaluation

DHF 2026 Keynote · Building responsibly as the capability of AI expands · Kai Van Lieshout & Dr. Ray Boyapati

Audio extracted from the keynote recording. Prefer the full video? Switch to the Watch tab above.

Full keynote transcript  ·  DHF 2026

Kai Van Lieshout

Hi, I'm Kai, the founder and CEO of Lyrebird Health. We're on a mission to extend the human healthspan by building AI that helps healthcare work better. And as you can see, I'm not alone.

Ray, do you want to introduce yourself?

Dr. Ray Boyapati

Thanks Kai, I'm Ray, Chief Clinical Officer at Lyrebird, and I'm also a practicing gastroenterologist, so just like many of you, I'm in clinic every week seeing patients. My job is to ensure what we build actually works in real clinical practice, not just in theory, not just on average, but in the actual consultation room, with a real clinician, a real patient, and real impact on care.

Kai Van Lieshout

So, this is our fourth year at DHF. In 2023, our first year, we were in the back corner of Startup Alley in a one metre booth. There was only one AI scribe at DHF, and we spent the majority of our conversations explaining that AI could in fact draft the paperwork for you.

And a few things have changed since then. AI has quickly moved into bigger and more complex parts of healthcare, like decision support, workflow automation, and agents.

Slide: AI tools are categorically different from anything clinicians have used before
Kai Van Lieshout

By comparison, the Clinical AI Tool that just writes your notes can sound pretty simple. But looking back, I don't think it ever was. Even a seemingly simple clinical note has a meaningful impact on the way care is delivered. It becomes part of the patient record and the basis for what gets followed up, referred, or actioned.

So even the simplest version of this technology was already meaningfully impacting care. And now these tools are moving further upstream, closer to how care is prioritised and delivered.

So from a product perspective, these steps can seem incremental: a better note, a new document, a workflow, a suggested action. But clinically, the downstream impact is exponential: a faster referral, a quicker diagnosis, earlier treatment, even a better outcome. And so the stakes of getting it right, or wrong, grow exponentially at every step.

Dr. Ray Boyapati

From where I sit in the clinic, these AI tools feel categorically different to anything I've seen before in medicine. My EMR stores information. My referral system moves clinical details from A to B. In comparison, AI tools generate clinical content, in real time, at scale and across different environments. And that content shapes what a clinician sees and how they act.

Think about what that means in practice. We all know that a very common point of failure in our existing system is on discharge from hospital. Patients often don't get enough information, don't know who to call, or when to call, and even if they do have a phone number, the phone's usually engaged or the clinical team is unavailable. Patients regularly fall through the cracks. It seems like a great use case for an AI communication system.

So a patient is discharged from hospital and the hospital engages an AI system to triage queries and manage communication. Two days later, the patient calls back, short of breath. The system triages it as routine recovery and books a follow-up for next week. But it wasn't deeply connected to the clinical record and didn't know this patient had surgery during the admission. And the system doesn't have the clinical judgement to understand that surgery puts them at high risk of a pulmonary embolism or clots in the lungs, a potentially catastrophic event, or that the shortness of breath is a symptom of this condition.

It didn't know because it was never built to know. It was built to handle calls, not to understand the clinical context behind them. The opportunities here are enormous, but so are the risks if you get it wrong.

In medicine, when the stakes rise, the standard rises with them. That's how we protect patients. The same has to be true here, and that's something Lyrebird learned early on.

Slide: The vastness of clinical variation heightens clinical risk
Kai Van Lieshout

Lyrebird started with writing clinical notes. But we pretty quickly realised how connected notes are to everything else, referrals, care plans, billing, and how much that varies by specialty, setting, region and patient population.

So something generic can't safely support clinical work, and may even cause harm, because you need to understand the specific context the tool is operating in. One example that stands out is early on, we found in some rural settings, the tool wasn't reliably documenting social history screening for indigenous patients. When we looked into this, and spoke to the doctors, their exact words were that in certain patient populations, they screened alcohol intake in a more casual framing. "What's a good night of drinking for you?" Lyrebird was interpreting this as social chat, rather than an intentional clinical datapoint that's simply screened differently depending on the setting.

This was just one example, but it exposed something incredibly important about the vastness of what "correct" looks like for even just a simple subsection of the notes based on patient population. We were able to catch that because we had two things in place: first, enough real-world use for this kind of variation to surface, and two, the clinical feedback loops to recognise why it mattered and make changes.

Slide: The stakes are too great
Kai Van Lieshout

That example has stayed with us because it shows how easy it is to miss context in healthcare. The clinician had already done the clinical work. The AI only had to recognise what mattered and document it properly. But these tools are moving upstream: into triage, follow-up, care planning, guideline support, billing, and the systems that determine what happens after the patient leaves the room.

At that point, the risks multiply again. It is no longer just, "Did the AI capture what the clinician said?" It becomes, "Did the AI understand the clinical context well enough to help shape what happens next, safely?"

The future of clinical AI will be defined by those who earn trust in the clinical settings they serve, and keep earning it as the technology takes on more responsibility. So purpose-built AI is not something you build once and declare finished. It is something you keep earning as the product is used in the real world.

Slide: Our operating principle: Scale Deliberately · Gather Insights · Define Standard
Dr. Ray Boyapati

The sheer breadth and variation of how medicine is practised across the world is something most people outside of healthcare never fully appreciate. A GP managing chronic disease in rural Queensland is different to a surgical outpatient clinic in London or a paediatric ED at three in the morning. These aren't minor variations. They are completely different clinical environments. Different ways of working. Different terminology. Different patient populations. Different clinical risks. An AI tool that is built generically and gets it right in one can get it dangerously wrong in another, and the clinician using it may never know.

Scaling without intent doesn't solve this. It makes it worse: more contexts you don't understand, more patients you're getting wrong without knowing it. You have to be deliberate about what you do and where you go. The solution is to scale deliberately.

Slide: Scale deliberately
Dr. Ray Boyapati

In medicine, we understand this intuitively. A doctor doesn't specialise on day one. But they don't train as nurses and pharmacists either. They go broad within their discipline, enough settings and enough patients to build real clinical judgment, and then they go deep. The breadth makes the depth possible, but it has to be intentional. That's how we think about building clinical AI. You need enough exposure across enough clinical contexts to understand the variation. But you don't try to be everything to everyone. You learn insights from scale then you go deep where it matters most.

Kai Van Lieshout

This is also core to how we think about deployments. In May already, Lyrebird has helped make paperwork easier for over a million consults in Australian General Practice. In the NHS, we're rolling out the UK's largest deployment of clinical AI across 20,000 clinicians in the South West London Integrated Care Board. That scale reflects a level of confidence in the product. But it also increases the responsibility around it, because more clinicians means more contexts, patients, edge cases, and places where the product needs to keep understanding the reality of care.

Slide: Augment internal insights with independent research
Kai Van Lieshout

Scale matters, but only if you use it properly. More use gives you more signals, but it can also make you overconfident. So part of building this responsibly is deliberately looking for where the product falls down. Not waiting for those problems to find you. Looking for them, taking them seriously, and then checking what you think you know with people outside your organisation.

One example is our deployment at Gold Coast Hospital, the largest deployment of ambient AI in an Australian hospital setting, now rolled out to over 1,500 active clinicians. They wanted to run an independent, peer-reviewed evaluation. We said yes. Not because we knew exactly what it would find, but because we wanted to know. And we wanted it published either way.

Ray, when the findings came back, what stood out to you?

Dr. Ray Boyapati

Honestly, my first reaction was excitement. This is the kind of rigorous, real world evaluation we need much more of in this space. And the findings were really interesting. Strong efficiency gains for clinicians. Patients reporting more direct time with their doctors. And Lyrebird notes actually scored higher on average than clinician-written notes on a validated quality instrument. Sounds great, right? But let's look a bit closer.

Yes, the Lyrebird notes scored higher on average on the PDQI-9. But it doesn't mean it's better all the time. And in medicine, you can't average away a clinical signal like that. You have to understand it.

Now, Lyrebird has changed a lot since that research was done. But that doesn't make the findings less valuable. They showed us insights into how clinicians actually use Lyrebird in real conditions that simply could not be seen from the inside. That's exactly what independent real-world research is for. We understand this well in medicine. It's why we have trials. Each phase surfaces things the previous one couldn't. Some things only become visible when an intervention is embedded in real practice, across enough contexts, over enough time. But evidence alone isn't enough. You have to do something with it.

Slide: Define the standard that healthcare deserves
Dr. Ray Boyapati

In medicine, you don't get to decide the standard on your own. There are guidelines, professional bodies, and peer review. That infrastructure exists because the stakes are so high. In clinical AI, that infrastructure is still being built. But someone has to start, and you can only set a meaningful standard if you've done the work: built the scale, done the research, and seen what you're missing.

We'd done both. So we asked: what does good actually need to look like? There's no industry standard for this yet, so we built our own. The Clinical Note Evaluation Framework is one example of what that looks like in practice. The framework asks clinicians not to rate notes but to find specific, countable errors. Did this note contain a hallucination? Was there an important omission? A score of four out of five doesn't tell you what went wrong, whereas countable errors give you something you can actually act on.

For hallucinations specifically, not all of them carry the same risk. A patient mentions they've got a sore knee, they've had to take time off work, and they've been resting it at home. The note records "limited mobility due to knee pain." The patient never said that exactly, but most clinicians would consider that reasonable. Now compare that to a note that says "patient denies allergies," when the patient actually reported a penicillin allergy. That's a contradiction and dangerous. A framework that treats these two in the same way is not measuring quality. And what we've learned is that for most of these errors, it's not a model problem. It's a context problem. It's the same lesson as the example Kai shared earlier. If you're not close enough to the clinical context, you can't see the error, let alone fix it.

Kai Van Lieshout

And it's easy to say your product is safe, responsible or high quality, but ultimately those are just words. At some point, you need to turn that into something specific and operational. This framework is one way of doing that. It is not a final answer, but it gives us a way to be specific about quality, risk, and what happens when the product gets something wrong.

Slide: 42 million patient consultations · Closing
Kai Van Lieshout

And that brings us back to the standard we should set for clinical AI. If clinical AI is going to shape what gets captured, trusted, or acted on, generic is not good enough. It's something a tool has to keep earning: through proving it belongs in the clinical context it's used in. For us, that means building with clinical depth. Deliberate scale. Feedback loops that surface what's missing. Independent evidence that challenges what we think we know. And setting a standard that keeps rising as the role of the technology becomes more consequential.

This is not an abstract question about the future. This year alone, an estimated 42 million patient consultations will involve AI, and that number is only going to increase. So we're not talking about a standard we might accept one day. We're talking about the standard being normalised now. And that matters for everyone in this room, whether you are building, buying, deploying, governing or using these tools in care. We all have a role in shaping the future clinical AI will create. And we all have a stake in that future, because one day, we or someone we love will be on the other side of these tools as a patient. And when that happens, we'll want to know it has been built with the depth and care that clinical work deserves.

Dr. Ray Boyapati

For me, it comes down to this. Next week, I'll be in clinic with a patient sitting across from me. And when I use an AI tool in that consultation, I need to know, really know, that it belongs there. If it belongs, it will make me a better doctor for my patients. But if that tool has been built too fast, too broadly, too generically, then it's not helping my patient. It's a risk to my patient.

That's not something I'm willing to accept as a doctor. And it's not what you'd accept if you or someone you love was the patient sitting across from me. If you're in this room, you're part of this conversation. Come find us. We would love to chat. And let's build this exciting future together, responsibly.

1The stakes

Clinical AI is categorically different from anything clinicians have used before

An EMR stores information. A referral system moves clinical details from A to B. Clinical AI generates content in real time, at scale, across different environments. And that content shapes what a clinician sees and how they act.

From a product perspective, the steps can seem incremental: a better note, a new document, a workflow, a suggested action. But clinically, the downstream impact is exponential. A faster referral, a quicker diagnosis, earlier treatment, a better outcome. The stakes of getting it right, or wrong, grow at every step.

These tools are also moving upstream, into triage, follow-up, care planning, guideline support, billing, and the systems that determine what happens after the patient leaves the room. At that point, the question changes. It is no longer simply whether the AI captured what the clinician said. It becomes whether the AI understood the clinical context well enough to help shape what happens next, safely.

"In medicine, when the stakes rise, the standard rises with them. That's how we protect patients. The same has to be true here."

Dr. Ray Boyapati, Chief Clinical Officer

The sheer breadth and variation of how medicine is practised across the world is something most people outside of healthcare never fully appreciate. A GP managing chronic disease in rural Queensland operates in a completely different clinical environment to a surgical outpatient clinic in London or a paediatric ED at three in the morning. An AI tool built generically and that performs well in one can get it dangerously wrong in another, and the clinician using it may never know.

In some rural settings, Lyrebird discovered early that the tool wasn't reliably documenting social history screening for First Nations patients, because it was reading a clinical question framed in a casual, culturally appropriate way as social conversation rather than intentional data collection. The tool hadn't been built too badly. It hadn't been built specifically enough.

Why generic is not good enough

Generic tools don't fail loudly. They fail quietly, in ways that only become visible if you are close enough to the clinical context to recognise them. The goal is not just to capture what was said. It is to understand what it means in the specific clinical environment it was said in.

2Our operating principle, part one

Scale deliberately, not carelessly

Breadth is essential for building real clinical judgment. But breadth without intent makes the problem worse: more contexts you don't understand, more patients you're getting wrong without knowing it.

A doctor doesn't specialise on day one. But they don't train as nurses and pharmacists either. They go broad within their discipline, enough settings and enough patients to build real clinical judgment, then they go deep. The breadth makes the depth possible, but it has to be intentional.

That is how Lyrebird thinks about building clinical AI. Enough exposure across enough clinical contexts to understand the variation. Not everything for everyone, but genuine depth where it matters most.

1M+
Consultations supported in Australian General Practice in May 2026 alone
20K
NHS clinicians in the South West London Integrated Care Board rollout
24+
Specialties supported, each built with specialty-specific clinical depth
Scale and responsibility

Scale reflects confidence in the product. But it also increases the responsibility around it. More clinicians means more contexts, more patients, more edge cases, and more places where the product needs to keep understanding the reality of care. Lyrebird treats each deployment not as a finish line, but as an obligation to keep learning.

3Our operating principle, part two

Search for what's broken, not what you already know

More use gives you more signals, but it can also make you overconfident. The complexity of clinical AI demands that you go looking for failure, not wait for it to surface. Independent scrutiny reveals what no internal team can see on its own.

At Gold Coast Hospital and Health Service, Lyrebird supported the largest deployment of ambient Clinical AI in an Australian hospital setting, now active across more than 1,500 clinicians. When GCHHS wanted to run an independent, peer-reviewed evaluation, Lyrebird said yes. Not because the outcome was certain, but because knowing the full picture mattered.

GCHHS Evaluation · Memon et al. BMC Health Services Research, 2025

The largest peer-reviewed, independent evaluation of ambient Clinical AI in an Australian hospital. The evaluation used the PDQI-9 validated quality instrument across 16 weeks of outpatient practice across multiple specialties.

34.6/40
Average PDQI-9 score for clinician-written notes
37.1/40
Average PDQI-9 score for Lyrebird-generated notes
80%
Of clinicians reported time savings in documentation

"Better on average doesn't mean better always. In medicine, you can't average away a clinical signal like that. You have to understand it."

Dr. Ray Boyapati, Chief Clinical Officer

The PDQI-9 distributions showed something important: while Lyrebird notes scored higher on average, there remained cases where the clinician note outperformed the AI-generated one. In medicine, that overlap is not a footnote. It is a signal that demands investigation, not reassurance.

The evaluation also showed that across 16 weeks, many clinicians flagged something as off at least once. That finding is not a failure of the product. It is exactly the kind of real-world signal that no internal testing environment can reliably produce. Each phase of evidence surfaces things the previous one could not. Some things only become visible when an intervention is embedded in real practice, across enough contexts, over enough time.

Read the full evaluation

The GCHHS peer-reviewed study, six implementation lessons from the deployment, and a practical vendor evaluation checklist are all available on Lyrebird's research library.

Read the GCHHS research article →

4Our operating principle, part three

Define the standard that healthcare deserves

In medicine, you don't get to decide the standard on your own. There are guidelines, professional bodies, and peer review. That infrastructure exists because the stakes are too high to trust self-assessment alone. The same principle applies to clinical AI.

Lyrebird built its Clinical Note Evaluation Framework because there was no industry standard to adopt. The framework asks clinicians not to rate notes on a scale, but to find specific, countable errors. A score of four out of five doesn't tell you what went wrong. Countable errors give you something you can actually act on.

01
Hallucinations
Content in the note not present in or inferable from the transcript.
Counted
02
Accuracy errors
Content discussed but captured incorrectly, graded by clinical safety impact.
Graded
03
Omissions
Clinically relevant content discussed but absent from the note.
Counted
04
Irrelevant inclusions
Content in the note that wasn't relevant to the clinical encounter.
Counted
05
Formatting issues
Structural or presentation problems that reduce clinical utility.
Counted
Not all hallucinations carry the same risk

A note that records "limited mobility due to knee pain" when the patient described a sore knee and difficulty getting around is a reasonable clinical inference. A note that states "patient denies allergies" when the patient reported a penicillin allergy is a dangerous contradiction. A framework that treats these the same way is not measuring quality. It is not measuring anything meaningful.

It is easy to say a product is safe, responsible, or high quality. At some point, those words need to become something specific and operational. The framework is one way of doing that. It is not a final answer, but it makes quality measurable, risk gradable, and error recovery possible.

Read the Clinical Note Evaluation Framework →

5What this means now

The standard isn't being set one day. It is being set now.

This year alone, an estimated 42 million patient consultations in Australia will involve AI. The standard being normalised in those consultations will shape care for years to come.

42 million patient consultations

in Australia this year will involve AI. What that means for patients depends on whether the tools behind them were built with the depth and care that clinical work deserves.

Whether you are building, buying, deploying, governing, or using these tools in care, everyone in healthcare has a role in shaping what this future looks like. And everyone has a personal stake in it, because one day, you or someone you love will be on the other side of these tools as a patient.

Purpose-built clinical AI is not a feature. It is a discipline. One that has to be kept earning, through genuine clinical depth, deliberate scale, feedback loops that surface what is missing, and independent evidence that challenges what you think you know.

01
Scale
Deliberately
02
Gather
Insights
03
Define
Standard
A discipline, not a declaration

Purpose-built clinical AI is not something you build once and declare finished. It is something you keep earning as the product is used in the real world, and as the technology takes on more responsibility in care.

Lyrebird Health
See what purpose-built clinical AI looks like in practice.
Book a demo