Large language models for disease diagnosis: a scoping review

Large language models for disease diagnosis: a scoping review

  • Liu, Y. et al. A deep learning system for differential diagnosis of skin diseases. Nat. Med. 26, 900–908 (2020).

    Article 

    Google Scholar 

  • Mei, X. et al. Artificial intelligence–enabled rapid diagnosis of patients with covid-19. Nat. Med. 26, 1224–1228 (2020).

    Article 

    Google Scholar 

  • Li, X. et al. Artificial intelligence-assisted reduction in patients’ waiting time for outpatient process: a retrospective cohort study. BMC health Serv. Res. 21, 1–11 (2021).

    Article 

    Google Scholar 

  • Li, B. et al. The performance of a deep learning system in assisting junior ophthalmologists in diagnosing 13 major fundus diseases: a prospective multi-center clinical trial. npj Digit. Med. 7, 8 (2024).

    Article 

    Google Scholar 

  • Qiu, S. et al. Development and validation of an interpretable deep learning framework for alzheimer’s disease classification. Brain 143, 1920–1933 (2020).

    Article 

    Google Scholar 

  • Barnett, G. O., Cimino, J. J., Hupp, J. A. & Hoffer, E. P. Dxplain: an evolving diagnostic decision-support system. JAMA 258, 67–74 (1987).

    Article 

    Google Scholar 

  • Su, C., Xu, Z., Pathak, J. & Wang, F. Deep learning in mental health outcome research: a scoping review. Transl. Psychiatry 10, 116 (2020).

    Article 

    Google Scholar 

  • Gkotsis, G. et al. Characterisation of mental health conditions in social media using informed deep learning. Sci. Rep. 7, 1–11 (2017).

    Google Scholar 

  • Du, J. et al. Extracting psychiatric stressors for suicide from social media using deep learning. BMC Med. Inform. Decis. Mak. 18, 77–87 (2018).

    Article 

    Google Scholar 

  • Caraballo, P. J. et al. Trustworthiness of a machine learning early warning model in medical and surgical inpatients. JAMIA Open 8, ooae156 (2025).

    Article 

    Google Scholar 

  • Sajda, P. Machine learning for detection and diagnosis of disease. Annu. Rev. Biomed. Eng. 8, 537–565 (2006).

    Article 

    Google Scholar 

  • Stafford, I. S. et al. A systematic review of the applications of artificial intelligence and machine learning in autoimmune diseases. NPJ Digit. Med. 3, 30 (2020).

    Article 

    Google Scholar 

  • Kline, A. et al. Multimodal machine learning in precision health: A scoping review. npj Digit. Med. 5, 171 (2022).

    Article 

    Google Scholar 

  • Aggarwal, R. et al. Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit. Med. 4, 65 (2021).

    Article 

    Google Scholar 

  • Myszczynska, M. A. et al. Applications of machine learning to diagnosis and treatment of neurodegenerative diseases. Nat. Rev. Neurol. 16, 440–456 (2020).

    Article 

    Google Scholar 

  • Fatima, M. & Pasha, M. Survey of machine learning algorithms for disease diagnostic. J. Intell. Learn. Syst. Appl. 9, 1–16 (2017).

    Google Scholar 

  • Choy, S. P. et al. Systematic review of deep learning image analyses for the diagnosis and monitoring of skin disease. NPJ Digit. Med. 6, 180 (2023).

    Article 

    Google Scholar 

  • Mei, X. et al. Interstitial lung disease diagnosis and prognosis using an AI system integrating longitudinal data. Nature communications 14.1, 2272 (2023).

    Article 

    Google Scholar 

  • Zhou, S. et al. Open-world electrocardiogram classification via domain knowledge-driven contrastive learning. Neural Netw 179, 106551 (2024).

    Article 

    Google Scholar 

  • Zhou, Q. et al. A machine and human reader study on ai diagnosis model safety under attacks of adversarial images. Nat. Commun. 12, 7281 (2021).

    Article 

    Google Scholar 

  • Hannun, A. Y. et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat. Med. 25, 65–69 (2019).

    Article 

    Google Scholar 

  • Brown, T. et al. Language models are few-shot learners. In Advances in Neural Information Processing Systems, vol. 33 (2020).

  • Touvron, H. et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).

  • Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023).

    Article 

    Google Scholar 

  • Yang, Z., Mitra, A., Kwon, S. & Yu, H. ClinicalMamba: A generative clinical language model on longitudinal clinical notes. In Proceedings of the 6th Clinical Natural Language Processing Workshop, 54–63 (Association for Computational Linguistics, 2024).

  • Peng, L. et al. An in-depth evaluation of federated learning on biomedical natural language processing for information extraction. NPJ Digit. Med. 7, 127 (2024).

    Article 

    Google Scholar 

  • Zhan, Z., Zhou, S., Li, M. & Zhang, R. Ramie: retrieval-augmented multi-task information extraction with large language models on dietary supplements. Journal of the American Medical Informatics Association ocaf002 (2025).

  • Lu, M. Y. et al. A multimodal generative AI copilot for human pathology. Nature 1–3 (2024).

  • Kim, J. et al. Large language models outperform mental and medical health care professionals in identifying obsessive-compulsive disorder. NPJ Digit. Med. 7, 193 (2024).

    Article 

    Google Scholar 

  • Thirunavukarasu, A. J. et al. Large language models in medicine. Nat. Med. 29, 1930–1940 (2023).

    Article 

    Google Scholar 

  • Zhou, H. et al. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112 (2023).

  • Meng, X. et al. The application of large language models in medicine: A scoping review. Iscience 27, (2024).

  • Zhang, Y. et al. Data-centric foundation models in computational healthcare: A survey.arXiv preprint arXiv:2401.02458 (2024).

  • Du, X. et al. Generative large language models in electronic health records for patient care since 2023: A systematic review.medRxiv 2024–08 (2024).

  • Wang, C. et al. A survey for large language models in biomedicine. arXiv preprint arXiv:2409.00133 (2024).

  • Li, L. et al. A scoping review of using large language models (LLMs) to investigate electronic health records (EHRs). arXiv preprint arXiv:2405.03066 (2024).

  • He, Kai, et al. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. Information Fusion (2025): 102963.

  • Pressman, S. M. et al. Clinical and surgical applications of large language models: A systematic review. J. Clin. Med. 13, 3041 (2024).

    Article 

    Google Scholar 

  • Omar, M., Brin, D., Glicksberg, B. & Klang, E. Utilizing natural language processing and large language models in the diagnosis and prediction of infectious diseases: A systematic review. Am J Infect Control 52, 992–1001 (2024).

    Article 

    Google Scholar 

  • Giuffrè, M. et al. Systematic review: The use of large language models as medical chatbots in digestive diseases. Alimentary pharmacology & therapeutics 60.2, 144–166 (2024).

    Article 

    Google Scholar 

  • Mai, A. S., Adnan, K. & Mohammad, Y. Medpromptx: Grounded multimodal prompting for chest x-ray diagnosis. ArXiv abs/2403.15585 (2024).

  • Kraljevic, Z. et al. Foresight—a generative pretrained transformer for modelling of patient timelines using electronic health records: a retrospective modelling study. Lancet Digital Health 6, e281–e290 (2024).

    Article 

    Google Scholar 

  • GLM, T. et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793 (2024).

  • Busch, F. et al. Integrating text and image analysis: Exploring GPT-4v’s capabilities in advanced radiological applications across subspecialties. J. Med Internet Res. 26, e54948 (2024).

    Article 

    Google Scholar 

  • Kim, Y., Xu, X., McDuff, D., Breazeal, C. & Park, H. W. Health-llm: Large language models for health prediction via wearable sensor data. Conference on Health, Inference, and Learning (2024).

  • Gao, Z., Hu, Y., Tan, C. & Li, S. Z. Prefixmol: Target- and chemistry-aware molecule design via prefix embedding. ArXiv preprint abs/2302.07120 (2023).

  • Niu, S. et al. Ehr-knowgen: Knowledge-enhanced multimodal learning for disease diagnosis generation. Inf. Fusion 102, 102069 (2024).

    Article 

    Google Scholar 

  • Chung, P. et al. Large language model capabilities in perioperative risk prediction and prognostication. JAMA surgery 159.8, 928–937 (2024).

    Article 

    Google Scholar 

  • Delsoz, M. et al. Performance of ChatGPT in diagnosis of corneal eye diseases. Cornea 43.5, 664–670 (2024).

    Article 

    Google Scholar 

  • Fink, M. A. et al. Potential of chatgpt and gpt-4 for data mining of free-text ct reports on lung cancer. Radiology 308, e231362 (2023).

    Article 

    Google Scholar 

  • Moallem, G., Gonzalez, A. D. L. M., Desai, A. & Rusu, M. Automated labeling of spondylolisthesis cases through spinal mri radiology report interpretation using ChatGPT. In Medical Imaging 2024: Computer-Aided Diagnosis, vol. 12927, 702–706 (SPIE, 2024).

  • Benary, M. et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw. Open 6, e2343689–e2343689 (2023).

    Article 

    Google Scholar 

  • Reese, J. T. et al. On the limitations of large language models in clinical diagnosis. medRxiv 2023-07 (2024).

  • Sarangi, P. K., Irodi, A., Panda, S., Nayak, D. S. K. & Mondal, H. Radiological differential diagnoses based on cardiovascular and thoracic imaging patterns: perspectives of four large language models. Indian J. Radiol. Imaging 34, 269–275 (2024).

    Article 

    Google Scholar 

  • Wang, J. et al. Augmented risk prediction for the onset of alzheimer’s disease from electronic health records with large language models. arXiv preprint arXiv:2405.16413 (2024).

  • Du, X. et al. Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notes. eBioMedicine 109, 105401 (2024).

    Article 

    Google Scholar 

  • Haider, S. A. et al. Evaluating large language model (LLM) performance on established breast classification systems. Diagnostics 14, 1491 (2024).

    Article 

    Google Scholar 

  • Siepmann, R. et al. The virtual reference radiologist: comprehensive AI assistance for clinical image reading and interpretation. European Radiology 34, 6652–6666 (2024).

    Article 

    Google Scholar 

  • Peng, L. et al. Mmgpl: Multimodal medical data analysis with graph prompt learning. Med. Image Anal. 97, 103225 (2024).

    Article 

    Google Scholar 

  • Xu, S. et al. Elixr: Towards a general purpose x-ray artificial intelligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317 (2023).

  • Gertz, R. J. et al. Potential of GPT-4 for detecting errors in radiology reports: Implications for reporting accuracy. Radiology 311, e232714 (2024).

    Article 

    Google Scholar 

  • Ono, D., Dickson, D. W. & Koga, S. Evaluating the efficacy of few-shot learning for GPT-4vision in neurodegenerative disease histopathology: A comparative analysis with convolutional neural network model. Neuropathol. Appl Neurobiol. 50, e12997 (2024).

    Article 

    Google Scholar 

  • Dai, Y., Gao, Y. & Liu, F. Transmed: Transformers advance multi-modal medical image classification. Diagnostics 11, 1384 (2021).

    Article 

    Google Scholar 

  • Upadhyaya, D. P. et al. A 360° View for Large Language Models: Early Detection of Amblyopia in Children Using Multi-view Eye Movement Recordings. In International Conference on Artificial Intelligence in Medicine (pp. 165-175). (Cham: Springer Nature, Switzerland, 2024).

  • Noda, M. et al. Feasibility of multimodal artificial intelligence using GPT-4 vision for the classification of middle ear disease: Qualitative study and validation. JMIR AI 3, e58342 (2024).

    Article 

    Google Scholar 

  • Antaki, F., Chopra, R. & Keane, P. A. Vision-language models for feature detection of macular diseases on optical coherence tomography. JAMA Ophthalmol 142, 573–576 (2024).

    Article 

    Google Scholar 

  • Peng, Z. et al. Development and evaluation of multimodal AI for diagnosis and triage of ophthalmic diseases using ChatGPT and anterior segment images: protocol for a two-stage cross-sectional study. Front. Artif. Intell. 6, 1323924 (2023).

    Article 

    Google Scholar 

  • Suh, P. S. et al. Comparing diagnostic accuracy of radiologists versus GPT-4v and Gemini Pro Vision using image inputs from diagnosis please cases. Radiology 312, e240273 (2024).

    Article 

    Google Scholar 

  • Pugliese, G. et al. Are artificial intelligence large language models a reliable tool for difficult differential diagnosis? An a posteriori analysis of a peculiar case of necrotizing otitis externa. Clin. Case Rep. 11, e7933 (2023).

    Article 

    Google Scholar 

  • Hu, C. et al. Exploiting ChatGPT for diagnosing autism-associated language disorders and identifying distinct features. arXiv preprint arXiv:2405.01799 (2024).

  • Deng, S. et al. Hear me, see me, understand me: Audio-visual autism behavior recognition. (IEEE Transactions on Multimedia, 2024).

  • Rezaii, N. et al. Artificial intelligence classifies primary progressive aphasia from connected speech. Brain 147, 3070–3082 (2024).

    Article 

    Google Scholar 

  • Liu, C., Ma, Y., Kothur, K., Nikpour, A. & Kavehei, O. Biosignal copilot: Leveraging the power of LLMs in drafting reports for biomedical signals. medRxiv 2023.06.28.23291916 (2023).

  • Yu, H., Guo, P. & Sano, A. Zero-shot ECG diagnosis with large language models and retrieval-augmented generation. In ML4H@NeurIPS (2023).

  • Wu, D. et al. Multimodal machine learning combining facial images and clinical texts improves the diagnosis of rare genetic diseases. arXiv preprint arXiv:2312.15320 (2023).

  • Feng, Y., Xu, X., Zhuang, Y. & Zhang, M. Large language models improve alzheimer’s disease diagnosis using multi-modality data. In 2023 IEEE International Conference on Medical Artificial Intelligence (MedAI), 61–66 (IEEE, 2023).

  • Ma, M. D. et al. Clibench: Multifaceted evaluation of large language models in clinical decisions on diagnoses, procedures, lab tests orders and prescriptions. arXiv preprint arXiv:2406.09923 (2024).

  • Liang, L. et al. Genetic transformer: An innovative large language model driven approach for rapid and accurate identification of causative variants in rare genetic diseases. medRxiv 2024-07 (2024).

  • Thompson, W. et al. Large language models with retrieval-augmented generation for zero-shot disease phenotyping. In Deep Generative Models for Health Workshop NeurIPS 2023 (2023).

  • Shi, W. et al. Retrieval-augmented large language models for adolescent idiopathic scoliosis patients in shared decision-making. In Proceedings of the 14th ACM International Conference on Bioinformatics, Computational Biology, and Health Informatics, 1–10 (2023).

  • Wen, Y., Wang, Z. & Sun, J. MindMap: Knowledge graph prompting sparks graph of thoughts in large language models. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 10370–10388 (Association for Computational Linguistics, Bangkok, Thailand, 2024).

  • Kresevic, S. et al. Optimization of hepatological clinical guidelines interpretation by large language models: a retrieval augmented generation-based framework. NPJ Digit. Med. 7, 102 (2024).

    Article 

    Google Scholar 

  • Yu, H., Guo, P. & Sano, A. ECG semantic integrator (ESI): A foundation ECG model pretrained with LLM-enhanced cardiological text. Trans. Mach. Learn. Res. (2024).

  • Ghersin, I. et al. Comparative evaluation of a language model and human specialists in the application of European guidelines for the management of inflammatory bowel diseases and malignancies. Endoscopy 56, 706–709 (2024).

    Article 

    Google Scholar 

  • Ge, J. et al. Development of a liver disease–specific large language model chat interface using retrieval-augmented generation. Hepatology 80, 1158–1168 (2024).

    Article 

    Google Scholar 

  • Xia, P. et al. RULE: Reliable multimodal RAG for factuality in medical vision language models. In Al-Onaizan, Y., Bansal, M. & Chen, Y.-N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 1081–1093 (Association for Computational Linguistics, Miami, Florida, USA, 2024).

  • Ranjit, M., Ganapathy, G., Manuel, R. & Ganu, T. Retrieval augmented chest x-ray report generation using openai gpt models. In Deshpande, K. et al. (eds.) Proceedings of the 8th Machine Learning for Healthcare Conference, vol. 219 of Proceedings of Machine Learning Research, 650–666 (PMLR, 2023).s

  • Li, Z., Zhang, J., Zhou, W., Zheng, J. & Xia, Y. Gpt-agents based on medical guidelines can improve the responsiveness and explainability of outcomes for traumatic brain injury rehabilitation. Sci. Rep. 14, 7626 (2024).

    Article 

    Google Scholar 

  • Abdullahi, T. et al. Learning to make rare and complex diagnoses with generative ai assistance: qualitative study of popular large language models. JMIR Med. Educ. 10, e51391 (2024).

    Article 

    Google Scholar 

  • Rifat Ahmmad Rashid, M. et al. A respiratory disease management framework by combining large language models and convolutional neural networks for effective diagnosis. Int. J. Comput. Digit. Syst. 16, 189–202 (2024).

    Google Scholar 

  • Ferber, D. et al. Autonomous artificial intelligence agents for clinical decision making in oncology. ArXiv abs/2404.04667 (2024).

  • Soong, D. et al. Improving accuracy of GPT-3/4 results on biomedical data using a retrieval-augmented language model. PLOS Digital Health 3, e0000568 (2024).

    Article 

    Google Scholar 

  • Rau, A. et al. A context-based chatbot surpasses trained radiologists and generic ChatGPT in following the ACR appropriateness guidelines. Radiology 308, e230970 (2023).

    Article 

    Google Scholar 

  • Zhu, Y. et al. Emerge: Integrating RAG for improved multimodal EHR predictive modeling. ArXiv abs/2406.00036 (2024).

  • Chen, C. et al. Large Language Model-Informed ECG Dual Attention Network for Heart Failure Risk Prediction. IEEE Transactions on Big Data 11, 948–960 (2024).

    Article 

    Google Scholar 

  • Askell, A. et al. A general language assistant as a laboratory for alignment. arXiv preprint arXiv:2112.00861 (2021).

  • Liu, H. et al. Visual instruction tuning. Advances in neural information processing systems 36, 34892–34916 (2023).

    Google Scholar 

  • Toma, A. et al. Clinical camel: An open expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031 (2023).

  • Wu, J., Wu, X., Zheng, Y. & Yang, J. Medkp: Medical dialogue with knowledge enhancement and clinical pathway encoding. arXiv preprint arXiv:2403.06611 (2024).

  • He, Y., Zhang, Y., He, S. & Wan, J. BP4ER: Bootstrap Prompting for Explicit Reasoning in Medical Dialogue Generation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2480–2492 (ELRA and ICCL, Torino, Italia, 2024).

  • Xu, K., Cheng, Y., Hou, W ., Tan, Q. & Li, W. Reasoning Like a Doctor: Improving Medical Dialogue Systems via Diagnostic Reasoning Process Alignment. In Findings of the Association for Computational Linguistics: ACL 2024, pages 6796–6814 (Association for Computational Linguistics, Bangkok, Thailand, 2024).

  • Yang, L. et al. Advancing multimodal medical capabilities of Gemini. arXiv preprint arXiv:2405.03162 (2024).

  • He, S. et al. Meddr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning. arXiv e-prints (2024): arXiv-2404.

  • Chen, Z. et al. Dia-LLaMA: Towards large language model-driven ct report generation. arXiv preprint arXiv:2403.16386 (2024).

  • Liu, Z. et al. Radiology-llama2: Best-in-class large language model for radiology. arXiv preprint arXiv:2309.06419 (2023).

  • Alkhaldi, A. et al. Minigpt-med: Large language model as a general interface for radiology diagnosis. arXiv preprint arXiv:2407.04106 (2024).

  • Lee, S., Youn, J., Kim, H., Kim, M. & Yoon, S. H. CXR-LLAVA: a multimodal large language model for interpreting chest X-ray images. arXiv (2023).

  • Kwon, T. et al. Large language models are clinical reasoners: Reasoning-aware diagnosis framework with prompt-generated rationales. In Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, 18417–18425 (2024).

  • Xu, H. et al. A whole-slide foundation model for digital pathology from real-world data. Nature 630, 181–188 (2024).

    Article 

    Google Scholar 

  • Zhou, J. et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat. Commun. 15, 5649 (2024).

    Article 

    Google Scholar 

  • Sun, Y. et al. Pathasst: A generative foundation ai assistant towards artificial general intelligence of pathology. In Proceedings of the AAAI Conference on Artificial Intelligence, 38, 5034–5042 (2024).

  • Zhang, X. et al. When LLMs Meets Acoustic Landmarks: An Efficient Approach to Integrate Speech into Large Language Models for Depression Detection. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 146–158 (Association for Computational Linguistics, Miami, Florida, USA, 2024).

  • Ouyang, L. et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems 35, 27730–27744 (2022).

    Google Scholar 

  • Schulman, J. et al. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347 (2017).

  • Zhou, Z. et al. Large model driven radiology report generation with clinical quality reinforcement learning. arXiv preprint arXiv:2403.06728 (2024).

  • Wang, G., Yang, G., Du, Z., Fan, L. & Li, X. Clinicalgpt: large language models finetuned with diverse medical data and comprehensive evaluation. arXiv preprint arXiv:2306.09968 (2023).

  • Zhang, H. et al. HuatuoGPT, Towards Taming Language Model to Be a Doctor. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 10859–10885 (Association for Computational Linguistics, Singapore, 2023).

  • Zeng, G. et al. MedDialog: Large-scale Medical Dialogue Datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 9241–9250, Online. (Association for Computational Linguistics, 2020).

  • Gao, L., Schulman, J. & Hilton, J. Scaling laws for reward model overoptimization. In International Conference on Machine Learning, pages 10835–10866 (PMLR, 2023).

  • Ye, Z. et al. Beyond Scalar Reward Model: Learning Generative Judge from Preference Data. ArXiv abs/2410.03742 (2024).

  • Henderson, P. et al. Deep reinforcement learning that matters. Proceedings of the AAAI conference on artificial intelligence. 32 (2018).

  • Rafailov, R. et al. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems 36, 53728–53741 (2023).

    Google Scholar 

  • Ye, Q. et al. Qilin-med: Multi-stage knowledge injection advanced medical large language model. arXiv preprint arXiv:2310.09089 (2023).

  • Yang, D. et al. Pediatricsgpt: Large language models as chinese medical assistants for pediatric applications. Advances in Neural Information Processing Systems 37, 138632–138662 (2024).

    Google Scholar 

  • Guo, C., Pleiss, G., Sun, Y. & Weinberger, K. Q. On calibration of modern neural networks. In Precup, D. & Teh, Y. W. (eds.) Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017, vol. 70 of Proceedings of Machine Learning Research, 1321–1330 (PMLR, 2017).

  • Tajwar, F. et al. Preference fine-tuning of LLMs should leverage suboptimal, on-policy data. In Proceedings of the 41st International Conference on Machine Learning (ICML’24), Vol. 235, 47441–47474 (JMLR.org, 2024).

  • Bai, Y. et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073 (2022).

  • Hu, E. J. et al. Lora: Low-rank adaptation of large language models. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022 (OpenReview.net, 2022).

  • Rajashekar, N. C. et al. Human-algorithmic interaction using a large language model-augmented artificial intelligence clinical decision support system. In Proceedings of the CHI Conference on Human Factors in Computing Systems, vol. 37, 1–20 (ACM, New York, NY, USA, 2024).s

  • Yang, X. et al. A large language model for electronic health records. NPJ digital medicine 5, 194 (2022).

    Article 

    Google Scholar 

  • Labrak, Y. et al. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains. In Findings of the Association for Computational Linguistics: ACL 2024, pages 5848–5864 (Association for Computational Linguistics, Bangkok, Thailand, 2024).

  • Wang, J., Seng, K. P., Shen, Y., Ang, L.-M. & Huang, D. Image to label to answer: An efficient framework for enhanced clinical applications in medical visual question answering. Electronics 13, 2273 (2024).

    Article 

    Google Scholar 

  • Liu, F. et al. A medical multimodal large language model for future pandemics. NPJ Digit. Med. 6, 226 (2023).

    Article 

    Google Scholar 

  • Wu, C. et al. Towards generalist foundation model for radiology by leveraging web-scale 2D&3D medical data. arXiv preprint arXiv:2308.02463 (2023).

  • Ding, J.-E. et al. Large language multimodal models for new-onset type 2 diabetes prediction using five-year cohort electronic health records. Scientific Reports 14, 20774 (2024).

    Article 

    Google Scholar 

  • Phan, V. M. H. et al. Decomposing disease descriptions for enhanced pathology detection: A multi-aspect vision-language pre-training framework. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages. 11492–11501 (2024).

  • Chen, J. et al. Towards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7346–7370 (Association for Computational Linguistics, Miami, Florida, USA, 2024).

  • Lu, Z. et al. Large language models in biomedicine and health: current research landscape and future directions. J. Am. Med. Inform. Assoc. 31, 1801–1811 (2024).

    Article 

    Google Scholar 

  • Li, H. et al. Llms-as-judges: a comprehensive survey on llm-based evaluation methods. arXiv preprint arXiv:2412.05579 (2024).

  • Tu, T. et al. Towards conversational diagnostic artificial intelligence[J]. Nature 1–9 (2025).

  • Safranek, C. W. et al. Automated heart score determination via chatgpt: Honing a framework for iterative prompt development. J. Am. Coll. Emerg. Phys. Open 5, e13133 (2024).

    Article 

    Google Scholar 

  • Zhang, T. et al. Incorporating Clinical Guidelines Through Adapting Multi-modal Large Language Model for Prostate Cancer PI-RADS Scoring. International Conference on Medical Image Computing and Computer-Assisted Intervention. (Cham: Springer Nature, Switzerland, 2024).

  • Zhou, S. et al. Explainable differential diagnosis with dual-inference large language models. npj Health Systems 2, 12 (2025).

    Article 

    Google Scholar 

  • Chen, X. et al. EyeGPT: Ophthalmic assistant with large language models. arXiv preprint arXiv:2403.00840 (2024).

  • Savage, T., Nayak, A., Gallo, R., Rangan, E. & Chen, J. H. Diagnostic reasoning prompts reveal the potential for large language model interpretability in medicine. NPJ Digit. Med. 7, 20 (2024).

    Article 

    Google Scholar 

  • Li, S. S. et al. MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning. (Neural Information Processing Systems, 2024).

  • Fansi Tchango, A., Goel, R., Wen, Z., Martel, J. & Ghosn, J. Ddxplus: A new dataset for automatic medical diagnosis. Adv. neural Inf. Process. Syst. 35, 31306–31318 (2022).

    Google Scholar 

  • Xie, Q. et al. Medical foundation large language models for comprehensive text analysis and beyond. npj Digit. Med. 8, 141 (2025).

    Article 

    Google Scholar 

  • Yang, S. et al. Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. Proceedings of the AAAI conference on artificial intelligence. Vol. 38. No. 17. (2024).

  • Mohammadi, S. S. & Nguyen, Q. D. A user-friendly approach for the diagnosis of diabetic retinopathy using ChatGPT and automated machine learning. Ophthalmol. Sci. 4, 100495 (2024).

    Article 

    Google Scholar 

  • Tank, C. et al. Depression detection and analysis using large language models on textual and audio-visual modalities. arXiv preprint arXiv:2407.06125 (2024).

  • Sandmann, S., Riepenhausen, S., Plagwitz, L. & Varghese, J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat. Commun. 15, 2050 (2024).

    Article 

    Google Scholar 

  • Bae, S. et al. Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images. Advances in Neural Information Processing Systems 36, 3867–3880 (2023).

    Google Scholar 

  • Hu, J. et al. Designing scaffolding strategies for conversational agents in dialog task of neurocognitive disorders screening. In Proceedings of the CHI Conference on Human Factors in Computing Systems, 1–21 (2024).

  • Englhardt, Z. et al. From classification to clinical insights: Towards analyzing and reasoning about mobile and behavioral health data with large language models. Proc. ACM Interact., Mob. Wearable Ubiquitous Technol. 8, 1–25 (2024).

    Article 

    Google Scholar 

  • Smith, P. C. et al. Missing clinical information during primary care visits. JAMA 293, 565–571 (2005).

    Article 

    Google Scholar 

  • McInerney, D. et al. Towards reducing diagnostic errors with interpretable risk prediction. Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting. Vol. 2024, 2024).

  • Adler-Milstein, J., Chen, J. H. & Dhaliwal, G. Next-generation artificial intelligence for diagnosis: from predicting diagnostic labels to “wayfinding”. JAMA 326, 2467–2468 (2021).

    Article 

    Google Scholar 

  • Shi, X. et al. Medical dialogue system: A survey of categories, methods, evaluation and challenges. In Findings of the Association for Computational Linguistics ACL 2024 (2024).

  • Sun, Z., Luo, C. & Huang, Z. Conversational disease diagnosis via external planner-controlled large language models. arXiv preprint arXiv:2404.04292 (2024).

  • Zou, X. et al. AI-driven diagnostic assistance in medical inquiry: Reinforcement learning algorithm development and validation. J. Med. Internet Res. 26, e54616 (2024).

    Article 

    Google Scholar 

  • Zhang, R. et al. Making shiny objects illuminating: the promise and challenges of large language models in us health systems. npj Health Syst 2, 8 (2025).

    Article 

    Google Scholar 

  • Cameron, S. & Turtle-Song, I. Learning to write case notes using the soap format. J. Counsel. Dev. 80, 286–292 (2002).

    Article 

    Google Scholar 

  • Oniani, D. et al. Enhancing large language models for clinical decision support by incorporating clinical practice guidelines. In 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI), 694–702 (IEEE, 2024).

  • Sallam, M., Al-Salahat, K. & Al-Ajlouni, E. ChatGPT performance in diagnostic clinical microbiology laboratory-oriented case scenarios. cureus 15, e50629 (2023).

    Google Scholar 

  • Bhasuran, B. et al. Preliminary analysis of the impact of lab results on large language model generated differential diagnoses. npj Digit. Med. 8, 166 (2025).

    Article 

    Google Scholar 

  • Yi, Z. et al. A survey on recent advances in LM-based multi-turn dialogue systems. arXiv preprint arXiv:2402.18013 (2024).

  • McDuff, D. et al. Towards accurate differential diagnosis with large language models. Nature 1–7 (2025).

  • Haltaufderheide, J. & Ranisch, R. The ethics of ChatGPT in medicine and healthcare: a systematic review on large language models (llms). NPJ Digit. Med. 7, 183 (2024).

    Article 

    Google Scholar 

  • Dou, C. et al. Detection, diagnosis, and explanation: A benchmark for Chinese medical hallucination evaluation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), 4784–4794 (2024).

  • Tran, H., Wang, J., Ting, Y., Huang, W. & Chen, T. Leaf: Learning and evaluation augmented by fact-checking to improve factualness in large language models. arXiv preprint arXiv:2410.23526 (2024).

  • Yue, X. & Zhou, S. Phicon: Improving generalization of clinical text de-identification models via data augmentation. In Clinical Natural Language Processing Workshop (2020).

  • Zhou, J. et al. Pre-trained multimodal large language model enhances dermatological diagnosis using SkinGPT-4. Nat. Commun. 15, 5649 (2024).

    Article 

    Google Scholar 

  • Spitale, M., Cheong, J. & Gunes, H. Underneath the Numbers: Quantitative and Qualitative Gender Fairness in LLMs for Depression Prediction. arXiv preprint arXiv:2406.08183 (2024).

  • Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079 (2023).

  • Yang, K. et al. Mentallama: interpretable mental health analysis on social media with large language models. In Proceedings of the ACM on Web Conference 2024, 4489–4500 (2024).

  • Peng, J. et al. Continually evolved multimodal foundation models for cancer prognosis. arXiv preprint arXiv:2501.18170 (2025).

  • Yi, H. et al. Towards general purpose medical AI: Continual learning medical foundation model. arXiv preprint arXiv:2303.06580 (2023).

  • Kim, Y. et al. Adaptive collaboration strategy for LLMs in medical decision making. NeurIPS (2024).

  • Jiang, S. et al. Med-MoE: Mixture of domain-specific experts for lightweight medical vision-language models. In Findings of the Association for Computational Linguistics: EMNLP 2024, 3843–3860 (Association for Computational Linguistics, Miami, Florida, USA, 2024).

  • Xu, D. et al. Editing factual knowledge and explanatory ability of medical large language models. Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (2024).

  • Hager, P. et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat. Med. 1–10 (2024).

  • Kuratov, Y. et al. Babilong: Testing the limits of LLMs with long context reasoning-in-a-haystack. Adv. Neural Inf. Process. Syst. 37, 106519–106554 (2024).

    Google Scholar 

  • Yang, Z., Mitra, A., Liu, W., Berlowitz, D. & Yu, H. Transformehr: transformer-based encoder-decoder generative model to enhance prediction of disease outcomes using electronic health records. Nat. Commun. 14, 7857 (2023).

    Article 

    Google Scholar 

  • Huang, L. et al. Machine learning of serum metabolic patterns encodes early-stage lung adenocarcinoma. Nat. Commun. 11, 3556 (2020).

    Article 

    Google Scholar 

  • Cui, H. et al. Timer: Temporal instruction modeling and evaluation for longitudinal clinical records. arXiv preprint arXiv:2503.04176 (2025).

  • Dou, C. et al. PlugMed: Improving Specificity in Patient-Centered Medical Dialogue Generation using In-Context Learning. Conference on Empirical Methods in Natural Language Processing (2023).

  • Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. Bert: Pre-training of deep bidirectional transformers for language understanding. In North American Chapter of the Association for Computational Linguistics (2019).

  • Chen, C. et al. Clinicalbench: Can LLMs beat traditional ML models in clinical prediction? arXiv preprint arXiv:2411.06469 (2024).

  • Zhong, T. et al. Chatradio-valuer: A chat large language model for generalizable radiology report generation based on multi-institution and multi-system data. arXiv preprint arXiv:2310.05242 (2023).

  • Zhan, Z., Zhou, S., Zhou, H., Liu, Z. & Zhang, R. Epee: Towards efficient and effective foundation models in biomedicine. arXiv preprint arXiv:2503.02053 (2025).

  • Ferrara, E. Large language models for wearable sensor-based human activity recognition, health monitoring, and behavioral modeling: a survey of early trends, datasets, and challenges. Sensors 24, 5045 (2024).

    Article 

    Google Scholar 

  • Hulstaert, F. et al. Gaps in the evidence underpinning high-risk medical devices in Europe at market entry, and potential solutions. Orphanet J. Rare Dis. 18, 212 (2023).

    Article 

    Google Scholar 

  • Tam, T. Y. C. et al. A framework for human evaluation of large language models in healthcare derived from literature review. npj Digit. Med. 7, 258 (2024).

    Article 

    Google Scholar 

  • Liu, Y. et al. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019).

  • Wu, D. et al. GestaltMML: Enhancing Rare Genetic Disease Diagnosis through Multimodal Machine Learning Combining Facial Images and Clinical Texts. ArXiv (2024): arXiv-2312.

  • Mizuta, K., Hirosawa, T., Harada, Y. & Shimizu, T. Can chatgpt-4 evaluate whether a differential diagnosis list contains the correct diagnosis as accurately as a physician? Diagnosis 11, 321–324 (2024).

    Article 

    Google Scholar 

  • Olesen, A. S. O. et al. How does ChatGPT-4 match radiologists in detecting pulmonary congestion on chest X-ray? J Med Arti Intell 7, (2024).

  • Liu, X. et al. Large language models are few-shot health learners. arXiv preprint arXiv:2305.15525 (2023).

  • Slack, D. & Singh, S. Tablet: Learning from instructions for tabular data. arXiv preprint arXiv:2304.13188 (2023).

  • Xia, P. et al. Cares: A comprehensive benchmark of trustworthiness in medical vision language models. Advances in Neural Information Processing Systems 37, 140334–140365 (2024).

    Google Scholar 

  • Wada, A. et al. Optimizing GPT-4 turbo diagnostic accuracy in neuroradiology through prompt engineering and confidence thresholds. Diagnostics 14, 1541 (2024).

    Article 

    Google Scholar 

  • Chen, Z., Lu, Y. & Wang, W. Empowering Psychotherapy with Large Language Models: Cognitive Distortion Detection through Diagnosis of Thought Prompting. In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 4295–4304 (Association for Computational Linguistics, Singapore, 2023).

  • Vashisht, P. et al. UMass-BioNLP at MEDIQA-M3G 2024: DermPrompt – A Systematic Exploration of Prompt Engineering with GPT-4V for Dermatological Diagnosis. In Proceedings of the 6th Clinical Natural Language Processing Workshop, pages 502–525 (Association for Computational Linguistics, Mexico City, Mexico, 2024).

  • Lim, S., Kim, Y., Choi, C-H., Sohn, J-Y. & Kim, B-H. ERD: A Framework for Improving LLM Reasoning for Cognitive Distortion Classification. In Proceedings of the 6th Clinical Natural Language Processing Workshop, pages 292–300 (Association for Computational Linguistics, Mexico City, Mexico, 2024).

  • Peng, C. et al. Improving generalizability of extracting social determinants of health using large language models through prompt-tuning. arXiv preprint arXiv:2403.12374 (2024).

  • Zhou, W. et al. Transferring Pre-Trained Large Language-Image Model for Medical Image Captioning. In CLEF (Working Notes), pages 1776–1784, (2023).

  • Belyaeva, A. et al. Multimodal LLMs for health grounded in individual-specific data. In Workshop on Machine Learning for Multimodal Healthcare Data, 86–102 (Springer, 2023).

  • Ong, J. C. L. et al. Development and testing of a novel large language model-based clinical decision support systems for medication safety in 12 clinical specialties. arXiv preprint arXiv:2402.01741 (2024).

  • Vithanage, D. et al. Evaluating machine learning approaches for multi-label classification of unstructured electronic health records with a generative large language model. medRxiv (2024).

  • Liu, J. et al. Large language model locally fine-tuning (LLMLF) on Chinese medical imaging reports. In Proceedings of the 2023 6th International Conference on Big Data Technologies (ACM, New York, NY, USA, 2023).

  • Song, M. et al. PneumoLLM: Harnessing the power of large language model for pneumoconiosis diagnosis. Med. Image Anal. 97, 103248 (2024).

    Article 

    Google Scholar 

  • Liu, W. & Zuo, Y. Stone needle: A general multimodal large-scale model framework towards healthcare. arXiv preprint arXiv:2306.16034 (2023).

  • Dou, C. et al. Integrating Physician Diagnostic Logic into Large Language Models: Preference Learning from Process Feedback. In Findings of the Association for Computational Linguistics: ACL 2024, pages 2453–2473 (Association for Computational Linguistics, Bangkok, Thailand, 2024).

  • Sun, M. LlamaCare: A Large Medical Language Model for Enhancing Healthcare Knowledge Sharing. arXiv preprint arXiv:2406.02350 (2024).

  • Zhang, K. et al. A generalist vision–language foundation model for diverse biomedical tasks. Nat Med 30, 3129–3141 (2024).

    Article 

    Google Scholar 

  • Wu, C.-K., Chen, W.-L. & Chen, H.-H. Large language models perform diagnostic reasoning. Tiny Papers @ ICLR 2023.

  • Yang, Z. et al. Unveiling GPT-4V’s hidden challenges behind high accuracy on USMLE questions: Observational Study. Journal of Medical Internet Research 27, e65146 (2025).

    Article 

    Google Scholar 

  • Chen, Z. et al. Narrative Feature or Structured Feature? A Study of Large Language Models to Identify Cancer Patients at Risk of Heart Failure. arXiv preprint arXiv:2403.11425 (2024).

  • Hayati, M. F. M., Ali, M. A. M. & Rosli, A. N. M. Depression detection on Malay dialects using GPT-3. In 2022 IEEE-EMBS Conference on Biomedical Engineering and Sciences (IECBES), 360–364 (IEEE, 2022).

  • Liu, S. et al. Leveraging large language models for generating responses to patient messages-a subjective analysis. J. Am. Med. Inform. Assoc. 31, 1367–1379 (2024).

    Article 

    Google Scholar 

  • Gao, Y. et al. Large language models and medical knowledge grounding for diagnosis prediction. medRxiv 2023-11 (2023).

  • Sushil, M. et al. A comparative study of zero-shot inference with large language models and supervised modeling in breast cancer pathology classification. Research Square (2024).

  • Zhang, X., Wu, C., Zhang, Y., Xie, W. & Wang, Y. Knowledge-enhanced visual-language pre-training on chest radiology images. Nat. Commun. 14, 4542 (2023).

    Article 

    Google Scholar 

  • Kotelanski, M., Gallo, R., Nayak, A. & Savage, T. Methods to estimate large language model confidence. arXiv preprint arXiv:2312.03733 (2023).

  • Qu, L. et al. The rise of AI language pathologists: Exploring two-level prompt learning for few-shot weakly-supervised whole slide image classification. Adv. Neural Inf. Process. Syst. 36, 67551–67564 (2023).

    Google Scholar 

  • Dekel, S. et al. ChatGPT Demonstrates Potential for Identifying Psychiatric Disorders: Application to Childbirth-Related Post-Traumatic Stress Disorder. Research Square (2023).

  • Du, J. et al. Ret-clip: A retinal image foundation model pre-trained with clinical diagnostic reports. International Conference on Medical Image Computing and Computer-Assisted Intervention. (Cham: Springer Nature, Switzerland, 2024).

  • Blankemeier, L. et al. Merlin: A vision language foundation model for 3d computed tomography. Research Square (2024).

  • Acharya, A. et al. Clinical risk prediction using language models: benefits and considerations. Journal of the American Medical Informatics Association 31, 1856–1864 (2024).

    Article 

    Google Scholar 

  • Chen, P.-F. et al. Automatic ICD-10 coding and training system: deep neural network based on supervised learning. JMIR Med. Inform. 9, e23230 (2021).

    Article 

    Google Scholar 

  • Pedro, T. et al. Exploring the use of ChatGPT in predicting anterior circulation stroke functional outcomes after mechanical thrombectomy: a pilot study. Journal of NeuroInterventional Surgery (2024).

  • Ren, X. et al. ChatASD: LLM-based AI therapist for ASD. In Communications in Computer and Information Science, Communications in computer and information science, 312–324 (Springer Nature Singapore, Singapore, 2024).

  • Weng, Y. et al. Large language models need holistically thought in medical conversational qa. arXiv preprint arXiv:2305.05410 (2023).

  • Panagoulias, D. P., Virvou, M. & Tsihrintzis, G. A. Evaluating ILM–generated multimodal diagnosis from medical images and symptom analysis. arXiv preprint arXiv:2402.01730 (2024).

  • Liu, Y. et al. A systematic evaluation of GPT-4v’s multimodal capability for chest x-ray image analysis. Meta-Radiol 2, 100099 (2024).

    Article 

    Google Scholar 

  • Chen, X. et al. Ffa-gpt: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. npj Digit. Med. 7, 111 (2024).

    Article 

    Google Scholar 

  • Hill, B. L. et al. Chiron: A generative foundation model for structured sequential medical data. In Deep Generative Models for Health Workshop NeurIPS 2023.

  • Kottlors, J. et al. Feasibility of differential diagnosis based on imaging patterns using a large language model. Radiology 308, e231167 (2023).

    Article 

    Google Scholar 

  • Nair, V. et al. DERA: Enhancing Large Language Model Completions with Dialog-Enabled Resolving Agents. Clinical Natural Language Processing Workshop (2023).

  • Umerenkov, D., Zubkova, G. & Nesterov, A. Deciphering diagnoses: how large language models explanations influence clinical decision making. arXiv preprint arXiv:2310.01708 (2023).

  • Chen, X. et al. ICGA-GPT: report generation and question answering for indocyanine green angiography images. Br J Ophthalmolog 108, 1450–1456 (2024).

    Article 

    Google Scholar 

  • Lyu, Q. et al. Translating radiology reports into plain language using ChatGPT and gpt-4 with prompt learning: results, limitations, and potential. Vis. Comput. Ind. Biomed. Art. 6, 9 (2023).

    Article 

    Google Scholar 

  • Jo, E. et al. Assessing GPT-4’s performance in delivering medical advice: comparative analysis with human experts. JMIR Med. Educ. 10, e51282 (2024).

    Article 

    Google Scholar 

  • Guo, S. et al. Comparing ChatGPT’s and Surgeon’s Responses to Thyroid-related Questions From Patients. J Clin Endocrinol Metab 110, e841–e850 (2025).

  • Kang, S. et al. WoLF: Wide-scope Large Language Model Framework for CXR Understanding. arXiv preprint arXiv:2403.15456 (2024).

  • He, Y. et al. BP4ER: Bootstrap Prompting for Explicit Reasoning in Medical Dialogue Generation. International Conference on Language Resources and Evaluation (2024).

  • link

    Leave a Reply

    Your email address will not be published. Required fields are marked *