Artificial intelligence-driven clinical guideline recommendations in maternal care: How trustworthy are they?

  • Jairo J. Pérez
  • , Andrés F. Giraldo-Forero
  • , Santiago Rúa
  • , Daniel Betancur
  • , Zuliany Urquina
  • , Pablo Castañeda
  • , Sara Arango-Valencia
  • , Juan Guillermo Barrientos-Gómez
  • , Ever A. Torres-Silva
  • , Andrés Orozco-Duque

Producción científica: Contribución a una revistaArtículo en revista científica indexadarevisión exhaustiva

Resumen

INTRODUCTION: Medical staff often face difficulties in consulting and applying clinical guidelines in practice. Large language models, especially when combined with retrieval-augmented generation, may help overcome these challenges by producing context-specific outputs with improved adherence to medical guidelines. OBJECTIVES: To assess the performance of commercial large language models in answering maternal health questions within retrieval-augmented generation systems, using both human and automated evaluation metrics. MATERIAL AND METHODS: A controlled experiment was designed to obtain accurate, consistent answers from a retrieval-augmented generation system based on Colombian maternal care guidelines. A physician formulated ten questions and defined the groundtruth answers. Various large language models were tested with a standardized prompt and evaluated through binary answer-concept ranking and retrieval-augmented generation assessment, metrics, judged by two independent large language models. RESULTS: Generative pre-trained transformer 3.5 (GPT-3.5) achieved the highest physicianassessed accuracy (0.90). Claude 3.5 obtained the top faithfulness score (0.78) under GPT-4.o evaluation, while Mistral ranked highest (0.84) under Claude 3.5 evaluation. Regarding answer relevance, GPT-3.5 scored highest across both judges (0.94 and 0.86). CONCLUSIONS: Integrating retrieval-augmented generation into obstetric care has the potential to enhance evidence-based practices and improve patient outcomes. However, rigorous validation of accuracy and context-specific reliability is essential before clinical deployment. The findings of this study indicate that large-scale models (e.g., GPT-3.5, Claude, Llama 70B) consistently outperform lighter models such as Llama 8B.

Idioma originalInglés
Páginas (desde-hasta)37-51
Número de páginas15
PublicaciónBiomedica
Volumen45
N.ºSp 3
DOI
EstadoPublicada - 10 dic. 2025
Publicado de forma externa

Huella

Profundice en los temas de investigación de 'Artificial intelligence-driven clinical guideline recommendations in maternal care: How trustworthy are they?'. En conjunto forman una huella única.

Citar esto