𝗔𝗜: Tasks, Evaluation - PANORAMA

Tasks for AI👩‍💻👨‍💻¶

The PANORAMA: AI Study (grand challenge) aims to evaluate the performance of modern AI algorithms at patient-level diagnosis and lesion-level detection of PDAC in abdominal CECT scans. Similar to radiologists, the objective of AI developed in this study is to read CECT exams and produce an overall patient-level score for PDAC diagnosis and the lesion location, as depicted in the Figure below.

Figure. (top) Patient-level PDAC diagnosis (modeled by 'AI'): For a given patient case, using the CECT exam, compute a single floating point value between 0-1, representing that patient’s overall likelihood of PDAC. (bottom) Lesion location (modeled by 'AI'): For a given patient case, using the CECT exam (and optionally all clinical/acquisition variables), predict a 3D detection map of the PDAC lesion (with the same dimensions and resolution as the CECT image). For the predicted lesion, all voxels must comprise a single floating point value between 0-1, representing that lesion’s likelihood of harboring PDAC.

We require detection maps as the model output (rather than softmax predictions), so that we can definitively evaluate object/lesion-level detection performance using precision-recall (PR) and free-response receiver operating characteristic (FROC) curves. With volumes of softmax predictions, there's a lot of ambiguity on how this can be handled —e.g. what is the overall single likelihood of PDAC in the predicted lesion, what constitutes as the spatial boundaries of each predicted lesion, and in turn, what constitutes as object-level hits (TP) or misses (FN) as per any given hit criterion?

Similar to clinical practice, PANORAMA mandates coupling the tasks of lesion detection and patient diagnosis to promote interpretability and disincentivize AI solutions that produce inconsistent outputs (e.g. a high patient-level csPCa likelihood score without any significant csPCa detections, and vice versa). Organizers will provide end-to-end baseline solutions, adapted from for example the nnU-Net (Isensee et al., 2021) model in a GitHub repo.
Baseline AI models for 3D PDAC detection/diagnosis in CECT: (link to GitHub repo PANORAMA baseline)

Evaluation 📊¶

Performance Metrics¶

Patient-level diagnosis performance is evaluated using the Area Under Receiver Operating¶

Characteristic (AUROC) metric. Lesion-level detection performance is evaluated using the Average Precision (AP) metric. Overall score used to rank each AI algorithm is the average of both task-specific rankings.

Overall Ranking Score = [Rank(AP) + Rank(AUROC)]/ 2

Free-Response Receiver Operating Characteristic (FROC) curves are used for secondary analysis of AI detection sensitivity at 0.01, 0.001, and 0.0001 false positives per patient, assessing the potential for opportunistic screening. Intersection over Union (IoU) is also used for secondary (spatial congruence) analysis of positive AI detections, but not for the evaluation of detection or diagnosis performance (given that IoU is ill-posed to accurately validate these tasks (Reinke et al., 2022)).¶

Hit Criterion for Lesion Detection¶

A “hit criterion” is a condition that must be satisfied for each predicted lesion to count as a hit or true positive. Hit criteria have been typically fulfilled by achieving a minimum degree of prediction-ground truth overlap, by localizing predictions within a maximum distance from the ground-truth, or on the basis of localizing predictions to a specific region (as defined by sector maps).¶

For the 3D detections predicted by AI, we opt for a hit criterion based on object overlap:

True Positives: For a predicted PDAC lesion detection to be counted as a true positive, it must share a minimum overlap of 0.15 IoU in 3D with the ground-truth annotation.
False Positives: Predictions with no/insufficient overlap count towards false positives, irregardless of their size or location.
Edge Cases: When there are multiple predicted lesions with sufficient overlap (≥ 0.15 IoU), only the prediction with the largest overlap is counted, while all other overlapping predictions are discarded. Predictions with sufficient overlap that are subsequently discarded in such a manner, do not count towards false positives to account for split-merge scenarios.

Tasks for AI👩‍💻👨‍💻¶

Evaluation 📊¶

Performance Metrics¶

Patient-level diagnosis performance is evaluated using the Area Under Receiver Operating¶

Hit Criterion for Lesion Detection¶

Performance evaluation utilities for PDAC detection/diagnosis in CECT:¶

github.com/DIAGNijmegen/picai_eval¶