AIPatient LLM-Powered Simulated Patient for Clinical Education & Research

High-fidelity simulations that blend Reasoning Retrieval-Augmented Generation (Reasoning RAG), a six-agent workflow, and a patient Knowledge Graph built from de-identified MIMIC-III data.

Reasoning RAG
AIPatient KG
Six LLM Agents
AIPatient overview: Data & Model → AIPatient System → Potential Applications
Figure 1. Data & Model → AIPatient System → Potential Applications. (Tip: Click any image/GIF to zoom.)

Medical Student User Study

Paired crossover with medical students vs. human-simulated patients (H-SPs). Below are example recordings showing conversation flow and interaction fidelity, as well as group usage clips from class sessions. (Tip: Click any image/GIF to zoom.)

Demo screen recording
Demo screen recording (GIF)
Group recording
Group recording (GIF)

Overview

Paper Abstract (verbatim)

Simulated patient systems play an important role in modern medical education and research, providing safe, integrative medical training environments and supporting clinical decision-making simulations. Progressive Artificial Intelligence (AI) technologies, such as Large Language Models (LLM), could advance simulated patient systems by replicating medical conditions and patient-doctor interactions with high fidelity and low cost. However, effectiveness and trustworthiness of these systems remain challenging. Here, we developed AIPatient, a simulated patient system powered by LLM-based AI agents. The system incorporates the Retrieval Augmented Generation (RAG) framework, powered by six task-specific LLM-based AI agents for complex reasoning. For simulation reality, the system is also powered by the AIPatient KG (Knowledge Graph), built with de-identified real patient data from the Medical Information Mart for Intensive Care (MIMIC)-III database. Primary outcomes demonstrate the system’s performance, including the system’s accuracy in Electronic Health Record (EHR)-based medical Question Answering (QA), readability, robustness, and stability. The system achieved a QA accuracy of 94.15% when all six AI agents present, surpassing benchmarks with partial or no agent integration. Its knowledgebase demonstrated high validity (F1 score=0.89). Readability scores showed median Flesch Reading Ease at 77.23 and median Flesch Kincaid Grade at 5.6, indicating accessibility to all medical professionals. Robustness and stability were confirmed with non-significant variance (ANOVA F-value=0.6126, p > 0.1; F-value=0.782, p > 0.1). A user study with medical students showed that AIPatient delivers high fidelity, usability, and educational value, performing on par with or better than human-simulated patients in history-taking. These results highlight AIPatient’s potential to support medical education, AI model testing, and healthcare system improvement.

Key Capabilities

AIPatient blends Reasoning RAG, multi-agent orchestration, and a clinically grounded knowledge graph to deliver realistic, adaptable training experiences.

Reasoning RAG

Three stages — retrieval, reasoning, and generation — with iterative checks to reduce hallucinations.

Multi-Agent Workflow

Six task-specific LLM agents collaborate to support history-taking and QA.

AIPatient KG (MIMIC-III)

NER-constructed KG from de-identified EHR notes (Neo4j AuraDB) for structured retrieval.

Conversation Memory

Summaries maintain context and patient personality across turns.

Evaluation Framework

Effectiveness (KB validity, QA accuracy, readability) and trustworthiness (robustness, stability).

Ethics & Compliance

PhysioNet-compliant model usage; IRB-approved user study.

AI Agents

Six task-specific LLM agents collaborate in a tool-use workflow.

“It first retrieves relevant information from the knowledge graph (Retrieval Agent and KG Query Generation Agent), then applies contextual reasoning to reduce hallucinations (Abstraction Agent and Checker Agent), and finally generates natural language responses … (Rewrite Agent and Summarization Agent).”

Retrieval Agent

Selects relevant KG nodes/edges given a user query.

KG Query Generation Agent

Builds Cypher to retrieve structured facts.

Abstraction Agent

Generalizes queries for robust retrieval.

Checker Agent

Approves/iterates retrieval up to three times.

Rewrite Agent

Translates facts to patient-style language/personality.

Summarization Agent

Maintains conversation memory.

System Architecture

Reasoning RAG grounds free-text interaction with verifiable KG facts. The agent controller orchestrates retrieval, abstraction, checking, and generation with conversation memory.

Reasoning RAG pipeline with six agents: Retrieval, Reasoning, Generation
Figure 2. System Architecture: Retrieval → Reasoning → Generation with six agents. (Tip: Click any image/GIF to zoom.)
  • “AIPatient KG has 1,500 patient-admission records, with a total of 15,441 nodes and 26,882 edges.”
  • Neo4j AuraDB stores entities and relations for efficient KG queries.
  • Conversation history maintains continuity and personality.

Results

Primary outcomes include QA accuracy, knowledgebase validity (NER), readability, robustness, and stability. Key findings below quote the paper verbatim.

EHR-QA Accuracy

94.15%

All agents + few-shot

KB Validity (NER)

F1 = 0.89

GPT-4-Turbo

Readability (median)

FRE 68.77 · FK 6.4

(Results section)

Robustness & Stability

F=0.6126; F=0.7820

p=0.5420; p=0.7990

Ablations & Model Comparison

“The setup with all agents and few-shot learning achieves the highest accuracy … with 94.15% overall accuracy. The baseline without the AIPatient KG and Reasoning RAG performs worse … accuracy drops to 13.33%.”
“GPT-4-Turbo achieved the highest overall accuracy (94.15%), followed by Claude-4-Opus (90.80%) and GPT-4o (89.02%).”

Readability

“Flesch Reading Ease scores ranging from 10.91 to 99.23 (median 68.77) and Flesch-Kincaid Grade Level … (median grade level 6.4).”

Abstract medians: FRE 77.23; FK 5.6.

Robustness

“No significant effect of QA conversation paraphrasing on overall response accuracy (F = 0.6126, p = 0.5420) … In the Medical History category … (F = 5.3038, p = 0.00589).”

Stability

“Across 32 personality groups, the median data loss is 2% (0%–5.88%) … Overall (F = 0.7820, p = 0.7990).”

User Study Results (Figures) (Tip: Click any image/GIF to zoom.)

User study result A
User Study Result A
User study result B
User Study Result B
User test results overview
User Test Results

User Study

Paired crossover with medical students vs. human-simulated patients (H-SPs). Below are example recordings. (Tip: Click any image/GIF to zoom.)

Demo screen recording
Demo screen recording (GIF)
Group recording
Group recording (GIF)

Resources

Ethics (verbatim)

The user study … was approved by the Institutional Review Board of Qilu Hospital of Shandong University (IRB Protocol Number: KYLL-202505-005). All participants provided informed consent.