top of page

Reliability assessment of tissue classification algorythms for multi-center and multi-scanner data

Updated: Nov 18, 2020

Gray and white matter volumes are important imaging markers of pathology and disease progression. Such measures are usually estimated from tissue segmentation maps produced by image processing pipelines. However, the reliability of the produced segmentations when using multi-center and multi-scanner data remains understudied. Here, we assess the robustness of six publicly available tissue classification pipelines across images acquired from different scanners models and sites.


Mahsa Dadar, Simon Duchesne

For the CCNA Group and the CIM-Q Group


Data included 90 T1-weighted images of a single individual, scanned in 73 sessions across 27 sites (Duchesne et al., 2019). An average T1-weighted image template was created out of all scans (Fonov et al., 2009). For each algorithm, their respective segmentation masks on this template was used as a silver standard (Figure 1). Dice Kappa similarity scores between these silver standards and individual segmentations and variability in tissue volumes across segmentations were assessed for Atropos (Avants et al., 2011); BISON (Dadar and Collins, 2019); Classify_Clean (Cocosco et al., 2003); FAST5.0 (Zhang et al., 2001);FreeSurfer6.0.0 (Fischl, 2012); and SPM12 (Penny et al., 2011). We also estimated the sample size necessary to detect a significant 1% volume reduction based on the variability of the estimates from each method within and across scanner models (80% power, 2-tailed significance).


Across tissue types, BISON had the lowest overall variability in Dice Kappa, followed by SPM12 (Figure 2). For GM, BISON had the highest overall Dice Kappa (0.94±0.01), followed by SPM12 (0.93±0.01) and Atropos (0.92±0.03); while Atropos had the highest overall Dice Kappa for WM (0.94±0.02), followed by BISON (0.93±0.01) and SPM12 (0.93±0.01). BISON had the lowest overall variability in its volumetric estimates (GM: 57.60±0.4, WM: 34.49±0.4, CSF: 7.90±0.3), followed by FreeSurfer (GM: 49.80±0.7, WM: 39.52±0.8, CSF: 10.71±1.2), and SPM12 (GM: 57.76±1.3, WM: 33.08±1.1, CSF: 9.12±0.9, Figure 3). BISON also had the smallest sample size requirement across all scanners and tissue types, followed by FreeSurfer, and SPM12 (Table 1). As expected, the necessary sample sizes decreased when using data from a specific scanner.

Figure 1. Axial slices showing the template created from the original T1w MRIs,

and the silver standard segmentations of this template from

Atropos, BISON, Classify_Clean, FAST, FreeSurfer, and SPM12.


Our comparisons provide a benchmark on the reliability of currently used tissue classification techniques and the amount of variability that can be expected when using large multi-center and multi-scanner databases.

bottom of page