Tasks for AI๐Ÿ‘ฉโ€๐Ÿ’ป๐Ÿ‘จโ€๐Ÿ’ป


The PANORAMA: AI Study (grand challenge) aims to evaluate the performance of modern AI algorithms at patient-level diagnosis and lesion-level detection of PDAC in abdominal CECT scans. Similar to radiologists, the objective of AI developed in this study is to read CECT exa,s and produce an overall patient-level score for PDAC diagnosis and the lesion location, as depicted in the Figure below. 

Figure. (top) Patient-level PDAC diagnosis (modeled by 'AI'): For a given patient case, using the CECT examcompute a single floating point value between 0-1, representing that patientโ€™s overall likelihood of PDAC.  (bottom) Lesion  location (modeled by 'AI'): For a given patient case, using the CECT exam (and optionally all clinical/acquisition variables), predict a 3D detection map of the PDAC lesion (with the same dimensions and resolution as the CECT image). For the predicted lesion, all voxels must comprise a single floating point value between 0-1, representing that lesionโ€™s likelihood of harboring PDAC.

We require detection maps as the model output (rather than softmax predictions), so that we can definitively evaluate object/lesion-level detection performance using precision-recall (PR) and free-response receiver operating characteristic (FROC) curves. With volumes of softmax predictions, there's a lot of ambiguity on how this can be handled โ€”e.g. what is the overall single likelihood of PDAC in the predicted lesion, what constitutes as the spatial boundaries of each predicted lesion, and in turn, what constitutes as object-level hits (TP) or misses (FN) as per any given hit criterion?

Similar to clinical practice, PANORAMA mandates coupling the tasks of lesion detection and patient diagnosis to promote interpretability and disincentivize AI solutions that produce inconsistent outputs (e.g. a high patient-level csPCa likelihood score without any significant csPCa detections, and vice versa). Organizers will provide end-to-end baseline solutions, adapted from  for example the nnU-Net (Isensee et al., 2021) model in a GitHub repo.

Baseline AI models for 3D PDAC detection/diagnosis in CECT:
(link to GitHub repo PANORAMA baseline)


Evaluation ๐Ÿ“Š


Performance Metrics

Patient-level diagnosis performance is evaluated using the Area Under Receiver Operating Characteristic (AUROC) metric. Lesion-level detection performance is evaluated using the Average Precision (AP) metric. Overall score used to rank each AI algorithm is the average of both task-specific rankings.


Free-Response Receiver Operating Characteristic (FROC) curves are used for secondary analysis of AI detection sensitivity at 0.01, 0.001, and 0.0001 false positives per patient, assessing the potential for opportunistic screening. Intersection over Union (IoU) is also used for secondary (spatial congruence) analysis of positive AI detections, but not for the evaluation of detection or diagnosis performance (given that IoU is ill-posed to accurately validate these tasks (Reinke et al., 2022)).


Hit Criterion for Lesion Detection

โ€œhit criterionโ€ is a condition that must be satisfied for each predicted lesion to count as a hit or true positive. Hit criteria have been typically fulfilled by achieving a minimum degree of prediction-ground truth overlap, by localizing predictions within a maximum distance from the ground-truth, or on the basis of localizing predictions to a specific region (as defined by sector maps).

For the 3D detections predicted by AI, we opt for a hit criterion based on object overlap:

  • True Positives: For a predicted PDAC lesion detection to be counted as a true positive, it must share a minimum overlap of 0.15 IoU in 3D with the ground-truth annotation. 
  • False Positives: Predictions with no/insufficient overlap count towards false positives, irregardless of their size or location.
  • Edge Cases: When there are multiple predicted lesions with sufficient overlap (โ‰ฅ 0.15 IoU), only the prediction with the largest overlap is counted, while all other overlapping predictions are discarded. Predictions with sufficient overlap that are subsequently discarded in such a manner, do not count towards false positives to account for split-merge scenarios.

Performance evaluation utilities for PDAC detection/diagnosis in CECT:
github.com/DIAGNijmegen/picai_eval