Research and Whitepapers

4 min read

6 min read

DeepScore: Measuring the performance of ambient AI clinical documentation

With accountability and clarity in mind, we set out to create a methodology to analyze the quality of ambient AI for clinical documentation and then share the results of the DeepScribe solution.

One of the greatest and most universal benefits of ambient AI medical scribing is reflected in a single word: time. As ambient AI captures every moment of a conversation with patients, clinicians simply have more of it. Time to get home and see family. Time to care for more patients. Time free from the pressures of documentation.

But ambient AI medical documentation is only as good as its last note. How do we measure the quality and accuracy of that note before it's entered into the EHR?

‍Anyone developing and running AI models for healthcare should be able to answer that question, using key performance indicators (KPIs) as a standard of measurement. They should also be transparent about the metrics related to the quality of their generative AI output.

At DeepScribe, we take trust and transparency seriously. It’s even built right into our ambient AI technology: Clinicians can highlight any text snippet within the AI-generated note to see the origin of that information in the transcribed conversation.

With accountability and clarity in mind, we set out to create a methodology to analyze the quality of ambient AI for clinical documentation and then share the results of the DeepScribe solution specifically. This is the focus of our new report, DeepScore: A Comprehensive Approach to Measuring Quality in AI-Generated Clinical Documentation.

What is DeepScore?

Led by DeepScribe Senior Data Scientist Jon Oleson, the DeepScore report identifies six crucial elements of ambient AI output for clinical documentation and establishes a score for each one. In addition to illustrating DeepScribe’s AI performance, this methodology offers a framework that both ambient AI consumers and companies can refer to.

The DeepScore metric represents the overall quality of autonomous transcription and scribing, guiding continuous improvement in patient care documentation.

The name "DeepScore" refers to a composite score of these six elements. As Jon explains in the report, the DeepScore metric represents "the overall quality of autonomous transcription and scribing, guiding continuous improvement in patient care documentation."

In short, we see DeepScore as a report card of sorts, a foundation for ambient AI excellence – and a guiding light to continually elevate AI standards and performance.

Tracking the value of AI medical documentation

Despite the critical role documentation plays in care delivery – especially chronic and complex care – there are very few studies that examine the accuracy and quality of AI scribes. As AI adoption continues at a relatively rapid pace, however, this measurement is indispensable. It provides timely insights into performance and can aid the decision-making process for healthcare organizations choosing an ambient AI solution.

When ambient AI scribing is examined, it continues to emerge as far more efficient than other charting options – especially combining the degree of accuracy and data capture with the proven time savings that clinical teams are experiencing.

Human scribes have provided welcome relief to overworked clinicians but aren’t able to apply the level of insight that an AI model can tap into and deliver within minutes, creating accurate, clinically valuable notes. As an example, the DeepScribe platform has data from more than five million conversations with patients. (Related: one study has shown that four out of seven clinicians using human scribes will spend "significantly more" time on documentation post-visit.)

In a 2023 article in the journal Perspectives in Health Information Management, author George A. Gellert, MD discusses the value of AI amid the massive volume of new medical information during a physician's career, saying "AI can potentially integrate and apply this expanding evidence and knowledge base in near real time at the patient level to inform specific episodic patient care delivery, while also informed by what will soon be decades of individual patient (and populational) past medical history EHR data."

Further, a 2024 journal article in Nature Medicine determined that Large Language Models (LLMs) generated better clinical summaries than medical experts, with more than one-third of summaries defined as "superior."

Summary: The DeepScore AI clinical documentation quality report

Here are the six metrics identified within the report, a definition of each, and the score for the DeepScribe ambient AI model. For complete details, see the full report.

Category: Frequency of significant errors

Major Defect-Free Rate: 95.9%

Percentage of medically relevant content that’s free of major defects (i.e. would require immediate correction to avoid risk)

Critical Defect-Free Rate: 100.0%

Percentage of medically relevant content that’s free of critical defects (i.e. could lead to serious adverse patient outcomes)‍

‍

Category: Relevance and precision of captured medical information

Captured Entity Rate (Recall): 90.2%

Percentage of medically relevant information captured in the AI-generated note‍

Accurate Entity Rate (Precision): 96.2%

Percentage of medically relevant information correctly categorized and summarized

‍

Category: User Acceptance

Minimally Edited Note Rate: 95.0%

Percentage of notes in which fewer than 10% of the words are those substituted by the clinician

‍

Category: Transcription Quality Control

Medical Word Hit Rate: 95.3%

Percentage of medical terms correctly transcribed from clearly audible source audio

‍

Overall DeepScore: 95.4%

Condenses the individual metrics into a single score, for a quick understanding of overall quality

‍

Conclusion and Recommendations

As the care delivery workflow continues to evolve, and a growing number of healthcare organizations implement ambient AI scribing, metrics like those in the DeepScore methodology become more valuable.

For clinicians and healthcare leaders exploring ambient AI options, part of the preparation must be understanding the quality of the technology they’re bringing into their practices. We recommend taking the following steps:

Check out the DeepScore report. You’ll see the details behind the six metrics summarized in this post.

As you consider any ambient AI solution, ask about their quality and accuracy measurement. See how it compares, and be sure that it meets your standards.

Remember that your organization should see real results from high-quality notes including: a. Each clinician, including specialists, spends minimal time reviewing and closing charts. b. The accuracy and efficiency translate to high rates of clinician adoption of the technology.

With the DeepScore report, we aim to promote transparency and accountability for the quality of AI scribing - and, toward that end, share our own AI clinical documentation performance by the numbers.

‍

To learn more: Download the full DeepScore report here.

See how DeepScribe can transform your clinical documentation process, improving the case experience at your organization.

‍

Realize the full potential of Healthcare AI with DeepScribe

Explore how DeepScribe’s customizable ambient AI platform can help you save time, improve patient care, and maximize revenue.

Get in touch