The Hallucination Hazard: When AI Scribes Fail in the Clinic

Recent audits in Ontario have revealed a troubling trend in the adoption of AI-powered medical scribes: these systems are routinely failing to capture basic facts, sometimes with dangerous consequences. According to reports, over 60% of evaluated AI scribe systems mixed up prescribed drugs in patient notes, and a significant number of systems fabricated information or suggested treatment plans that were never discussed.

This failure highlights a critical tension in modern healthcare: the desire to reduce administrative burnout for clinicians versus the absolute necessity of medical accuracy. While the promise of AI is to free doctors from the keyboard, the reality is that these tools may be introducing a new, more insidious type of error into the patient record.

The Nature of the Failure: Transcription vs. Interpretation

A core point of confusion—and a primary source of risk—is the distinction between speech-to-text transcription and LLM-based summarization. Traditional transcription software aims to create a literal record of what was said. In contrast, many modern "AI Scribes" use Large Language Models (LLMs) to interpret the conversation and synthesize it into a structured medical note.

This interpretive layer is where the danger lies. LLMs are probabilistic, not deterministic. They don't "understand" medical facts; they predict the next likely token in a sequence. When a conversation is non-linear or nuanced, the AI may fill in gaps with "plausible" but incorrect information based on its training data rather than the actual conversation.

One user shared a harrowing example of this phenomenon:

Diagnosed with Runner's Knee. AI summary said I was diagnosed with osteoporosis, and had hip pain and walking difficulty, though literally none of that was ever said or implied.

The "Almost Right" Trap

Technical observers note that AI often fails in a way that is uniquely deceptive. Because the output is grammatically correct and professionally phrased, it can easily bypass a cursory human review. This is often referred to as the "zombie" effect—where the AI acts nearly right, but a small, critical detail is fundamentally wrong.

This is not limited to healthcare. In corporate environments, LLM note-takers have been known to misrepresent promises made during vendor negotiations or miss the nuance of technical back-and-forth discussions. The danger is that when a tool is "mostly right" 90% of the time, users develop a dangerous level of trust, making them less likely to catch the 10% of errors that could be catastrophic in a medical context.

The Human Element: Error Rates and Oversight

Some argue that human doctors are also prone to error. The argument is that a 60% error rate in AI might be comparable to the historical rate of human error in medical records. However, the nature of the error differs. A human error is often an omission; an AI error is frequently a fabrication (a hallucination).

Furthermore, the burden of correction is shifting. Instead of writing a note, doctors are now spending time auditing an AI-generated note. As one patient in Toronto observed:

My doctor always asks me if they can use the AI note taker... At the end of the consultation she goes over the notes and corrects it, often complaining to me about having to talk more to the computer than to me.

Mitigating the Risk

To prevent AI scribes from becoming a liability, several safeguards have been proposed by practitioners and users:

Timestamped Links: Integrating summaries with the original audio recording, allowing clinicians to click a specific note and hear exactly what was said at that moment.
Strict Transcription First: Moving away from immediate summarization and relying on high-fidelity transcripts as the primary record, using AI only for optional organization.
Explicit Disclaimers: Clearly stating at the beginning of meetings that AI is being used and that its interpretations may be inaccurate.
Patient Verification: Encouraging patients to review their own transcripts and summaries to ensure their medical history is not being corrupted by hallucinations.

Conclusion

The Ontario audit serves as a warning that AI is being deployed in high-stakes environments before the technology is ready for the specific requirement of 100% factual accuracy. While the efficiency gains are tempting, the risk of fabricating a diagnosis or misidentifying a medication is a risk that no healthcare system can afford to take. Until these models can move from probabilistic guessing to deterministic accuracy, the human must remain the primary author—and the most skeptical editor—of the medical record.

The Hallucination Hazard: When AI Scribes Fail in the Clinic

The Hallucination Hazard: When AI Scribes Fail in the Clinic

The Nature of the Failure: Transcription vs. Interpretation

The "Almost Right" Trap

The Human Element: Error Rates and Oversight

Mitigating the Risk

Conclusion

References

HN Stories