LLM in HEOR: An evaluation framework

Health economics and outcomes research has already started to use AI tools such as large language models (LLM) across a diverse set of study types, including systematic literature reviews (SLR), health economic modelling (HEM) and real-world evidence (RWE).

In SLRs, LLMs can assist with abstract and full-text screening, bias assessment, data extraction, and automating meta-analysis code, accelerating evidence synthesis. For HEM, foundation models can replicate existing frameworks, generate de novo models, validate assumptions, and adapt outputs for diverse populations or platforms, improving efficiency and scalability. In RWE generation, LLMs facilitate the integration of unstructured electronic health record (EHR) data into analyzable datasets, enabling researchers to draw insights from multimodal sources such as genomics and imaging.

How can we make sure that the AI-based HEOR research is high-quality? A paper by Fleurence et al (2025) aims to create a framework to answer this question through the development of the ELEVATE-AI LLMs Framework. The framework draws on frameworks such as Holistic evaluation of language models (HELM), and PALISADE for machine learning, as well as emerging AI-specific guidelines like PRISMA-AI, TRIPOD+AI , and RAISE.

The specific domains evaluated in the ELEVATE-AI LLM framework are:

Domain NameDomain DescriptionModel CharacteristicsDescribes the model’s foundational characteristics, such as name, version, developer, model access, license, release date, architecture, training data, and fine-tuning performed for specific tasks.Accuracy AssessmentMeasures how closely the model’s output aligns with the correct or expected answer, evaluating precision, relevance, and correctness.Comprehensiveness AssessmentAssesses how thoroughly the model’s output addresses all aspects of the task, ensuring completeness, coherence, and critical coverage.Factuality VerificationEvaluates whether the model’s output is accurate and based on verifiable
sources, identifying hallucinated or non- existent citations.Reproducibility Protocols and GeneralizabilityEnsures methods and outputs can be independently verified by documenting workflows, sharing code, and specifying hyperparameters. Evaluates generalizability of approach proposedRobustness ChecksEvaluates the model’s resilience to input variations (e.g., typographical errors, ambiguous phrasing) and reports performance changes under test conditions.Fairness and BiasEvaluates whether the model’s output is equitable and free from harmful biases or stereotypes across diverse groups and contexts.Deployment Context and Efficiency MetricsExamines technical deployment aspects including hardware/software setup, processing time, scalability, and resource efficiency metrics.Calibration and UncertaintyMeasures how well the model conveys uncertainty in outputs through confidence levels and handling ambiguity. Includes metrics like Expected Calibration Error (ECE).Security and PrivacyAssesses adherence to security standards (encryption, anonymization) and regulations like GDPR/HIPAA where appropriate. Documents intellectual property/copyright protections.

See also  Medicaid’s ‘Unwinding’ Can Be Especially Perilous for Disabled People

The ELEVATE-AI framework also calculates an overall score, where each domain receives a 3 if it is clearly reported, a 2 if the description is ambiguous and a 1 if it is not reported. More detail on each of these framework dimensions can be found in the full paper here.

Sources

SLR Studies:

Khraisha Q, Put S, Kappenberg J, Warraitch A, Hadfield K. Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res Synth Methods. Mar 14
2024;doi:10.1002/jrsm.1715Gartlehner G, Kahwati L, Hilscher R, et al. Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Res Synth Methods. Mar 3 2024;doi:10.1002/jrsm.1710Guo E, Gupta M, Deng J, Park YJ, Paget M, Naugler C. Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study. J Med Internet Res. Jan 12 2024;26:e48996. doi:10.2196/48996Hasan B, Saadi S, Rajjoub NS, et al. Integrating large language models in systematic reviews: a framework and case study using ROBINS-I for risk of bias assessment. BMJ Evid Based Med. Feb 21 2024;doi:10.1136/bmjebm-2023-112597 Lai H, Ge L, Sun M, et al. Assessing the Risk of Bias in Randomized Clinical Trials With Large Language Models. JAMA Netw Open. May 1 2024;7(5):e2412687. doi:10.1001/jamanetworkopen.2024.12687Landschaft A, Antweiler D, Mackay S, et al. Implementation and evaluation of an additional GPT-4-based reviewer in PRISMA-based medical systematic literature reviews. Int J Med Inform. Sep 2024;189:105531. doi:10.1016/j.ijmedinf.2024.105531Reason T, Benbow E, Langham J, Gimblett A, Klijn SL, Malcolm B. Artificial Intelligence to Automate Network Meta-Analyses: Four Case Studies to Evaluate the Potential Application of Large Language Models. Pharmacoecon Open. Mar 2024;8(2):205-220. doi:10.1007/s41669-024-00476-9Robinson A, Thorne W, Wu BP, et al. Bio-sieve: Exploring instruction tuning large language models for systematic review automation. arXiv preprint arXiv:230806610. 2023; Schopow N, Osterhoff G, Baur D. Applications of the Natural Language Processing Tool ChatGPT in Clinical Practice: Comparative Study and Augmented Systematic Review. JMIR Med Inform. Nov 28 2023;11:e48933. doi:10.2196/48933Tran VT, Gartlehner G, Yaacoub S, et al. Sensitivity and Specificity of Using GPT-3.5 Turbo Models for Title and Abstract Screening in Systematic Reviews and Meta-analyses. Ann Intern Med. Jun 2024;177(6):791-799. doi:10.7326/m23-3389Jin Q, Leaman R, Lu Z. Retrieve, Summarize, and Verify: How Will ChatGPT Affect
Information Seeking from the Medical Literature? J Am Soc Nephrol. Aug 1 2023;34(8):1302 doi:10.1681/ASN.0000000000000166

See also  Idaho Health and Welfare asks the Legislature to shore up public health - Idaho Capital Sun

Health Economic Modelling

Ayer T, Samur S, Yildirim IF, Bayraktar E, Ermis T, J. C. Fully Replicating Published Health Economic Models Using Generative AI. . presented at: Annual Meeting of the Society for Medical Decision Making; 2024; Boston, MA. Chhatwal J, Yildirim IF, Samur S, Bayraktar E, Ermis T, T A. Development of De Novo Health Economic Models Using Generative AI. presented at: ISPOR Europe 2024 Meeting; 2024; Barcelona, Spain.Chhatwal J, Yildrim IF, Balta D, et al. Can Large Language Models Generate Conceptual Health Economic Models? . presented at: ISPOR 2024; 2024; Atlanta, Georgia. https://www.ispor.org/heor-resources/presentations-database/presentation/intl2024-3898/139128 Reason T, Rawlinson W, Langham J, Gimblett A, Malcolm B, Klijn S. Artificial Intelligence to Automate Health Economic Modelling: A Case Study to Evaluate the Potential Application of Large Language Models. Pharmacoecon Open. Mar 2024;8(2):191-203. doi:10.1007/s41669-024-00477-8

Real World Data

Cohen AB, Waskom M, Adamson B, Kelly J, G A. Using Large Language Models To Extract PD-L1 Testing Details From Electronic Health Records. presented at: ISPOR 2024; 2024; Atlanta, GA. https://www.ispor.org/heor-resources/presentations- database/presentation/intl2024-3898/136019 Guo LL, Fries J, Steinberg E, et al. A multi-center study on the adaptability of a shared foundation model for electronic health records. npj Digital Medicine. 2024/06/27 2024;7(1):171. doi:10.1038/s41746-024-01166-wJiang LY, Liu XC, Nejatian NP, et al. Health system-scale language models are all-purpose prediction engines. Nature. 2023/07/01 2023;619(7969):357-362. doi:10.1038/s41586- 023-06160-yLee K, Liu Z, Chandran U, et al. Detecting Ground Glass Opacity Features in Patients With Lung Cancer: Automated Extraction and Longitudinal Analysis via Deep Learning–Based Natural Language Processing. JMIR AI. 2023/6/1 2023;2:e44537. doi:10.2196/44537Peng C, Yang X, Chen A, et al. A study of generative large language model for medical research and healthcare. npj Digital Medicine. 2023/11/16 2023;6(1):210. doi:10.1038/s41746- 023-00958-wSoroush A. Large Language Models Are Poor Medical Coders — Benchmarking of Medical Code Querying. NEJM AI. 2024;1(5)Xie Q, Chen Q, Chen A, et al. Me-LLaMA: Foundation Large Language Models for Medical Applications. Res Sq. May 22 2024;doi:10.21203/rs.3.rs-4240043/v1Yang X, Chen A, PourNejatian N, et al. A large language model for electronic health records. NPJ Digit Med. Dec 26 2022;5(1):194. doi:10.1038/s41746-022-00742-2

See also  Just got letter stating COBRA coverage ended 8/31 - what do I do?

AI-based frameworks

Liang P, Bommasani R, Lee T, et al. Holistic evaluation of language models. arXiv preprint arXiv:221109110. 2022;Padula WV, Kreif N, Vanness DJ, et al. Machine Learning Methods in Health Economics and Outcomes Research-The PALISADE Checklist: A Good Practices Report of an ISPOR Task Force. Value Health. Jul 2022;25(7):1063-1080. doi:10.1016/j.jval.2022.03.022 Cacciamani GE, Chu TN, Sanford DI, et al. PRISMA AI reporting guidelines for systematic reviews and meta-analyses on AI in healthcare. Nat Med. Jan 2023;29(1):14-15. doi:10.1038/s41591-022-02139-wCollins GS, Moons KGM, Dhiman P, et al. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ.
2024;385:e078378. doi:10.1136/bmj-2023-078378Thomas J, Ella Flemyng, Noel-Storr A, et al. Responsible AIin Evidence SynthEsis (RAISE): guidance and recommendations. Accessed 26 November, 2024. https://osf.io/cn7x4