TY - JOUR
T1 - Artificial intelligence-driven clinical guideline recommendations in maternal care
T2 - How trustworthy are they?
AU - Pérez, Jairo J.
AU - Giraldo-Forero, Andrés F.
AU - Rúa, Santiago
AU - Betancur, Daniel
AU - Urquina, Zuliany
AU - Castañeda, Pablo
AU - Arango-Valencia, Sara
AU - Barrientos-Gómez, Juan Guillermo
AU - Torres-Silva, Ever A.
AU - Orozco-Duque, Andrés
PY - 2025/12/10
Y1 - 2025/12/10
N2 - INTRODUCTION: Medical staff often face difficulties in consulting and applying clinical guidelines in practice. Large language models, especially when combined with retrieval-augmented generation, may help overcome these challenges by producing context-specific outputs with improved adherence to medical guidelines. OBJECTIVES: To assess the performance of commercial large language models in answering maternal health questions within retrieval-augmented generation systems, using both human and automated evaluation metrics. MATERIAL AND METHODS: A controlled experiment was designed to obtain accurate, consistent answers from a retrieval-augmented generation system based on Colombian maternal care guidelines. A physician formulated ten questions and defined the groundtruth answers. Various large language models were tested with a standardized prompt and evaluated through binary answer-concept ranking and retrieval-augmented generation assessment, metrics, judged by two independent large language models. RESULTS: Generative pre-trained transformer 3.5 (GPT-3.5) achieved the highest physicianassessed accuracy (0.90). Claude 3.5 obtained the top faithfulness score (0.78) under GPT-4.o evaluation, while Mistral ranked highest (0.84) under Claude 3.5 evaluation. Regarding answer relevance, GPT-3.5 scored highest across both judges (0.94 and 0.86). CONCLUSIONS: Integrating retrieval-augmented generation into obstetric care has the potential to enhance evidence-based practices and improve patient outcomes. However, rigorous validation of accuracy and context-specific reliability is essential before clinical deployment. The findings of this study indicate that large-scale models (e.g., GPT-3.5, Claude, Llama 70B) consistently outperform lighter models such as Llama 8B.
AB - INTRODUCTION: Medical staff often face difficulties in consulting and applying clinical guidelines in practice. Large language models, especially when combined with retrieval-augmented generation, may help overcome these challenges by producing context-specific outputs with improved adherence to medical guidelines. OBJECTIVES: To assess the performance of commercial large language models in answering maternal health questions within retrieval-augmented generation systems, using both human and automated evaluation metrics. MATERIAL AND METHODS: A controlled experiment was designed to obtain accurate, consistent answers from a retrieval-augmented generation system based on Colombian maternal care guidelines. A physician formulated ten questions and defined the groundtruth answers. Various large language models were tested with a standardized prompt and evaluated through binary answer-concept ranking and retrieval-augmented generation assessment, metrics, judged by two independent large language models. RESULTS: Generative pre-trained transformer 3.5 (GPT-3.5) achieved the highest physicianassessed accuracy (0.90). Claude 3.5 obtained the top faithfulness score (0.78) under GPT-4.o evaluation, while Mistral ranked highest (0.84) under Claude 3.5 evaluation. Regarding answer relevance, GPT-3.5 scored highest across both judges (0.94 and 0.86). CONCLUSIONS: Integrating retrieval-augmented generation into obstetric care has the potential to enhance evidence-based practices and improve patient outcomes. However, rigorous validation of accuracy and context-specific reliability is essential before clinical deployment. The findings of this study indicate that large-scale models (e.g., GPT-3.5, Claude, Llama 70B) consistently outperform lighter models such as Llama 8B.
UR - https://www.scopus.com/pages/publications/105025171112
U2 - 10.7705/biomedica.7902
DO - 10.7705/biomedica.7902
M3 - Artículo en revista científica indexada
C2 - 41410329
AN - SCOPUS:105025171112
SN - 0120-4157
VL - 45
SP - 37
EP - 51
JO - Biomedica
JF - Biomedica
IS - Sp 3
ER -