High-fidelity simulations that blend Reasoning Retrieval-Augmented Generation (Reasoning RAG), a six-agent workflow, and a patient Knowledge Graph built from de-identified MIMIC-III data.
Paired crossover with medical students vs. human-simulated patients (H-SPs). Below are example recordings showing conversation flow and interaction fidelity, as well as group usage clips from class sessions. (Tip: Click any image/GIF to zoom.)
Paper Abstract (verbatim)
Simulated patient systems play an important role in modern medical education and research, providing safe, integrative medical training environments and supporting clinical decision-making simulations. Progressive Artificial Intelligence (AI) technologies, such as Large Language Models (LLM), could advance simulated patient systems by replicating medical conditions and patient-doctor interactions with high fidelity and low cost. However, effectiveness and trustworthiness of these systems remain challenging. Here, we developed AIPatient, a simulated patient system powered by LLM-based AI agents. The system incorporates the Retrieval Augmented Generation (RAG) framework, powered by six task-specific LLM-based AI agents for complex reasoning. For simulation reality, the system is also powered by the AIPatient KG (Knowledge Graph), built with de-identified real patient data from the Medical Information Mart for Intensive Care (MIMIC)-III database. Primary outcomes demonstrate the system’s performance, including the system’s accuracy in Electronic Health Record (EHR)-based medical Question Answering (QA), readability, robustness, and stability. The system achieved a QA accuracy of 94.15% when all six AI agents present, surpassing benchmarks with partial or no agent integration. Its knowledgebase demonstrated high validity (F1 score=0.89). Readability scores showed median Flesch Reading Ease at 77.23 and median Flesch Kincaid Grade at 5.6, indicating accessibility to all medical professionals. Robustness and stability were confirmed with non-significant variance (ANOVA F-value=0.6126, p > 0.1; F-value=0.782, p > 0.1). A user study with medical students showed that AIPatient delivers high fidelity, usability, and educational value, performing on par with or better than human-simulated patients in history-taking. These results highlight AIPatient’s potential to support medical education, AI model testing, and healthcare system improvement.
AIPatient blends Reasoning RAG, multi-agent orchestration, and a clinically grounded knowledge graph to deliver realistic, adaptable training experiences.
Three stages — retrieval, reasoning, and generation — with iterative checks to reduce hallucinations.
Six task-specific LLM agents collaborate to support history-taking and QA.
NER-constructed KG from de-identified EHR notes (Neo4j AuraDB) for structured retrieval.
Summaries maintain context and patient personality across turns.
Effectiveness (KB validity, QA accuracy, readability) and trustworthiness (robustness, stability).
PhysioNet-compliant model usage; IRB-approved user study.
Six task-specific LLM agents collaborate in a tool-use workflow.
“It first retrieves relevant information from the knowledge graph (Retrieval Agent and KG Query Generation Agent), then applies contextual reasoning to reduce hallucinations (Abstraction Agent and Checker Agent), and finally generates natural language responses … (Rewrite Agent and Summarization Agent).”
Selects relevant KG nodes/edges given a user query.
Builds Cypher to retrieve structured facts.
Generalizes queries for robust retrieval.
Approves/iterates retrieval up to three times.
Translates facts to patient-style language/personality.
Maintains conversation memory.
Reasoning RAG grounds free-text interaction with verifiable KG facts. The agent controller orchestrates retrieval, abstraction, checking, and generation with conversation memory.
Primary outcomes include QA accuracy, knowledgebase validity (NER), readability, robustness, and stability. Key findings below quote the paper verbatim.
EHR-QA Accuracy
94.15%
All agents + few-shot
KB Validity (NER)
F1 = 0.89
GPT-4-Turbo
Readability (median)
FRE 68.77 · FK 6.4
(Results section)
Robustness & Stability
F=0.6126; F=0.7820
p=0.5420; p=0.7990
“The setup with all agents and few-shot learning achieves the highest accuracy … with 94.15% overall accuracy. The baseline without the AIPatient KG and Reasoning RAG performs worse … accuracy drops to 13.33%.”
“GPT-4-Turbo achieved the highest overall accuracy (94.15%), followed by Claude-4-Opus (90.80%) and GPT-4o (89.02%).”
“Flesch Reading Ease scores ranging from 10.91 to 99.23 (median 68.77) and Flesch-Kincaid Grade Level … (median grade level 6.4).”
Abstract medians: FRE 77.23; FK 5.6.
“No significant effect of QA conversation paraphrasing on overall response accuracy (F = 0.6126, p = 0.5420) … In the Medical History category … (F = 5.3038, p = 0.00589).”
“Across 32 personality groups, the median data loss is 2% (0%–5.88%) … Overall (F = 0.7820, p = 0.7990).”
Paired crossover with medical students vs. human-simulated patients (H-SPs). Below are example recordings. (Tip: Click any image/GIF to zoom.)
PhysioNet (Data)
AIPatient KG & KG-CORAL — DOI: 10.13026/vjrq-9328
GitHub (Latest Code)
huiziy/AIPatient
Zenodo (Archive)
10.5281/zenodo.14583946
Ethics (verbatim)
The user study … was approved by the Institutional Review Board of Qilu Hospital of Shandong University (IRB Protocol Number: KYLL-202505-005). All participants provided informed consent.