Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance

Van Calster, Ben; Collins, Gary S.; Vickers, Andrew J.; Wynants, Laure; Kerr, Kathleen F.; Barreñada, Lasai; Varoquaux, Gael; Singh, Karandeep; Moons, Karel Gm; Hernandez-Boussard, Tina; Timmerman, Dirk; McLernon, David J.; van Smeden, Maarten; Steyerberg, Ewout W.; Topic Group 6 of the STRATOS initiative

doi:https://doi.org/10.1016/j.landig.2025.100916

Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance

Files

PIIS2589750025000986.pdf (664.82 KB)

Publication date

2025-12-01

Authors

Van Calster, Ben

Collins, Gary S.

Vickers, Andrew J.

Wynants, Laure

Kerr, Kathleen F.

Barreñada, Lasai

Varoquaux, Gael

Singh, Karandeep

Moons, Carl

Hernandez-Boussard, Tina

DOI

https://doi.org/10.1016/j.landig.2025.100916

Document Type

Article

Metadata

Show full item record

Collections

UMC Repository

License

cc_by

Abstract

SummaryNumerous measures have been proposed to illustrate the performance of predictive artificial intelligence (AI) models. Selecting appropriate performance measures is essential for predictive AI models intended for use in medical practice. Poorly performing models are misleading and may lead to wrong clinical decisions that can be detrimental to patients and increase financial costs. In this Viewpoint, we assess the merits of classic and contemporary performance measures when validating predictive AI models for medical practice, focusing on models that estimate probabilities for a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall performance, classification, and clinical utility) along with corresponding graphical assessments. The first four domains address statistical performance, whereas the fifth domain covers decision–analytical performance. We discuss two key characteristics when selecting a performance measure and explain why these characteristics are important: (1) whether the measure’s expected value is optimised when calculated using the correct probabilities (ie, whether it is a proper measure) and (2) whether the measure solely reflects statistical performance or decision–analytical performance by properly accounting for misclassification costs. 17 measures showed both characteristics, 14 showed one, and one (F1 score) showed neither. All classification measures were improper for clinically relevant decision thresholds other than when the threshold was 0·5 or equal to the true prevalence. We illustrate these measures and characteristics using the ADNEX model which predicts the probability of malignancy in women with an ovarian tumour. We recommend the following measures and plots as essential to report: area under the receiver operating characteristic curve, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot showing probability distributions by outcome category.

Keywords

Medicine (miscellaneous), Health Informatics, Decision Sciences (miscellaneous), Health Information Management

Citation

Van Calster, B, Collins, G S, Vickers, A J, Wynants, L, Kerr, K F, Barreñada, L, Varoquaux, G, Singh, K, Moons, K G, Hernandez-Boussard, T, Timmerman, D, McLernon, D J, van Smeden, M, Steyerberg, E W & Topic Group 6 of the STRATOS initiative 2025, 'Evaluation of performance measures in predictive artificial intelligence models to support medical decisions : overview and guidance', The Lancet. Digital health, vol. 7, no. 12, 100916. https://doi.org/10.1016/j.landig.2025.100916

URI

https://dspace.library.uu.nl/handle/1874/469051

Evaluation of performance measures in predictive artificial intelligence models to support medical decisions: overview and guidance

Files

Publication date

Authors

Editors

Advisors

Supervisors

DOI

Document Type

Metadata

Collections

License

Abstract

Keywords

Citation

URI