Automatic Collateral Scoring from 3D CTA Images

—The collateral score is an important biomarker in 1 decision making for endovascular treatment (EVT) of patients 2 with ischemic stroke. The existing collateral grading systems are 3 based on visual inspection and prone to subjective interpretation 4 and interobserver variation. The purpose of our work is the devel-5 opment of an automatic collateral scoring method. In this work, 6 we present a method that is inspired by human collateral scoring. 7 Firstly, we deﬁne an anatomical region by atlas-based registration 8 and extract vessel structures using a deep convolutional neural 9 network. From this, high-level features based on the ratios of 10 vessel length and volume of the occluded and the contralateral 11 side are deﬁned. Multi-class classiﬁcation models are used to 12 map the feature space to a four-grade collateral score and a 13 quantitative score. The dataset used for training, validation and 14 testing is from a registry of images acquired in clinical routine 15 at multiple medical centers. The model performance is tested on 16 269 subjects, achieving an accuracy of 0.8. The dichotomized 17 collateral score accuracy is 0.9. The error is comparable to 18 the interobserver variation, the results are comparable to the 19 performance of two radiologists with 10 to 30 years of experience. 20

provides blood flow to brain tissue when the principal conduits fail to meet demands (Liebeskind [3]).
The MR CLEAN trial, a multicenter, randomized trial of EVT versus no EVT, showed that baseline computed tomographic angiography (CTA) collateral status modified the treatment effect (Berkhemer et al. [4]): patients with higher collateral score will most likely have better treatment outcome.A clinical decision tool based on multiple baseline clinical and imaging characteristics for individualized predictions of the effect of EVT has been developed and includes grade of collateral circulation as prognostic and predictive marker (Venema et al. [2]).Several collateral status grading systems exist, all based on visual scoring using coarse classification criteria (Seker et al. [5]).Such visual scoring systems suffer from subjective interpretations leading to inter-and also intra-observer variation.An automated scoring system could facilitate an objective and reproducible assessment of the cerebral collateral status.In our work, we use the four-grade score that was proposed by Tan et al. [6] as it has proven correlation with outcome and effect of EVT.The definition of this 4-grade score system is: • 0: absent collaterals (0% filling in occluded territory) • 1: poor collaterals (>0% and ≤50% filling in occluded territory) • 2: moderate collaterals (>50% and <100% in occluded territory) • 3: good collaterals (100% filling in occluded territory) Fig. 1 shows example images for four different collateral scores in Maximum Intensity Projection (MIP).

B. Related Work
Collateral status scoring relies on the difference between arterial trees in the middle cerebral artery (MCA) territory of the occluded side and its contralateral side.Therefore, vessel segmentation is an essential step in our application.Kirbas et al. [7] and Lesage et al. [8] provide a review of many vessel segmentation approaches that have been developed over the past decades.For cerebral blood vessel segmentation, Meijs et al. [9] summarized non-convolutional neural network based algorithms with respect to methods, image modality, and cerebral vessel segments.For example, Manniesing et al. [10] utilized a level set approach to detect the circle of Willis (CoW) in 3D CTA images, Schaap et al. [11] utilized a Bayesian tracking approach to segment the internal carotid artery (ICA) in 3D CTA images, Robben et al. [12] utilized graph connectivity in combination with a tracking approach to obtain the label and vessel structure of the CoW in Magnetic Resonance Angiography (MRA) images, and Meijs et al. [9]  To the best of our knowledge, only Boers et al. [19] published an automatic collateral scoring method.The region of interest (ROI) is defined by a probability density map that was generated from an atlas build from lesion segmentation from follow-up CT images.The 3D Frangi filter (Frangi et al.

II. METHOD 125
The proposed method starts with a pre-processing step to 126 define the anatomical regions of interest: we use an atlas-127 based approach which was developed by Peter et al. [21] 128 to obtain a 3D CTA brain image I b , an MCA probability 129 density map M and a hemisphere map H.This method takes 130 the CTA image and an atlas image as input.This step is 131 followed by a deep learning based segmentation of the brain 132 vasculature (centerlines).In the final step, the output of the 133 previous stages is transformed into a quantitative score, and 134 a collateral class score.The algorithm overview is shown in 135 Fig. 2. Each of the steps is detailed below.The purpose of the first step is to define the relevant 138 anatomical regions, i.e. the brain, the MCA region and both 139 hemispheres (both left and right side of the brain).For the 140 brain region, we use a CT atlas that was constructed from 141 averaging high-resolution 3D CTA images of 30 healthy 142 subjects (Friston [22]) with a corresponding binary brain mask.143 This CT atlas was part of the symmetric CT-MR template 144 type-1 block in Fig. 3, including two cascaded 3x3x3 3D 199 convolutional layers and the identity short connection (dashed 200 lines in Fig. 3).This combination is similar to the context module described in Isensee et al. [26], however, in our cases, we didn't use dropout layer in between.Two cascaded 3x3x3 3D convolutional layers with a stride of 2 (type-2 block in Fig. 3) are added in front of residual blocks in order to obtain more abstract feature maps as the encoder goes deeper.Leaky rectified linear units (ReLu) (Maas et al. [27]) are used as the activation function.In the concatenated decoder path, deconvolutional layers are constructed with an extra deep supervision (Kayalibay et al. [28]) path.This can avoid information loss and vanishing gradients in each convolutional layer.The proposed model uses a sigmoid activation layer in the final step to output a 3D voxel-wise vessel probability map.
2) Data Preparation: For each brain, we construct ground truth vessel trees.The vessel tree is a 3D binary mask resulting from the vessel centerline annotation process described in Section III-C.Then we split the whole brain and vessel tree into 3D cubes of 128x128x128 voxels for both training data and validation data.During the training process, we extract 64voxel 3D cubes out of 128x128x128 voxels.In the validation process during training, we use 128x128x128 voxels as input data size for convenience.Data augmentation is applied to obtain different training images in each iteration.In the data preparation stage, we did not resample the image into common space as the training data is representative for the whole dataset and exhibits little variation in slice spacing and pixel size.
3) Deep Learning Post Processing: The proposed CNN model outputs a vessel probability map with values between 0 and 1.We threshold the probability map to obtain a binary vessel map B 0 (Fig. 4b).The threshold value is found by optimizing a Dice cost function on the deep learning validation dataset.There are some small isolated parts (mostly false positive parts) in the predicted vessel map B 0 and connectivity based noise removal (Risser et al. [29]) is applied to remove these small isolated parts.This results in a binary vessel tree map denoted by B (Fig. 4c).

C. Quantification
The purpose of the quantification step is to compute a collateral score from the results of the previous processing steps.Human collateral scoring is based on comparing the amount of vessels visible in the affected and non-affected side, and we follow a similar strategy: we compare the affected and non-affected hemisphere of subjects using a combination of the binary vessel structure from deep learning model output B, the corresponding MCA probability density map M and hemisphere map H as shown in Fig. 2. We assume that it is known a priori (from clinical symptoms) which hemisphere (left or right) is affected.Based on this information and the hemisphere map, we generate an affected side binary map H A and a non-affected side binary map H N .With this information, we compute four different ratios, each representing a different aspect of the vasculature, as detailed below.
1) Volume: An obvious quantification is difference in the number of vessels between the affected and the non-affected hemisphere.Assuming there are more vessels visible in the  with its intensity r v and r vi respectively: where B is the set of voxels that are 1 in isolated part removed 264 binary vessel tree B , p is a voxel and I is the image intensity.Subsequently those vessel segments were fitted by 3D spline curves and further smoothed (Garcia [31], Garcia [32]) and interpolated (Fig. 5 right), yielding a set of world coordinates S. For all points p from S we obtain the corresponding MCA probability value M (p) from the 3D MCA probability density map.In same way, we obtain intensity values H A (p), H N (p), and I(p).The ratios r l and r li are accumulated values over all points (weighted with the mean distance w between the point and its neighbours) in the affected hemisphere and the ones in the non-affected hemisphere: 3) Multi-label Classification: In the last step, multi-class classification is used to predict collateral score (0,1,2,3) from an input feature vector r = [r v , r vi , r l , r li ].We start from the baseline model, in which we take the median of feature vector r as input and define the threshold value by utilizing the clinical definition of collateral score.We then define our second model by using a support vector classifier (SVC) with linear kernel to find the optimal threshold value for the median of feature vector r.For the third method, we use the complete feature vector r = [r v , r vi , r l , r li ] as input to a random forest classifier.Finally, we use ordinal regression with the complete feature vector for ordered categorical prediction.

A. Data Overview
The images used for training and assessing the methods were obtained from the MR CLEAN Registry (Jansen et al. [33]) and MR CLEAN trial (Berkhemer et al. [34]).The MR CLEAN Registry is an on-going registry that contains patients who underwent endovascular treatment at a stroke intervention center in the Netherlands since March, 2014.The CTA images are acquired in clinical routine at several different sites, and thus there is large variation in image quality, as well as in imaging equipment and acquisition protocols (contrast phase, brain coverage).
Hence, in order to get a representative set of images, the following selection criteria have been used to select images in this study: • Spatial resolution: The average diameter of the M1 segment of the MCA region is 3.1 ± 0.4 mm and of the M2 segment 2.4 ± 0.4 mm according to Rai et al. [35].
The spatial resolution should be sufficient to visualize the major arterial tree.Therefore, we only select images of which the slice thickness is smaller than 1.5 mm; additionally we require the slice spacing to be smaller than or equal to the slice thickness.From this set of 585 subjects, 49 subjects were manually selected for annotation and training of the CNN.From the remaining 536 subjects, we randomly selected around half of these subjects (270 images) for our study.This number was assumed to be sufficient for our evaluation, and also reduced the amount of work for obtaining a consensus score compared to using the full set.Those 270 subjects were originally from 14 intervention centers with different vendors (mainly from Philips, Siemens, Toshiba and GE).For all cases, we obtained 353 the occlusion side and occlusion position and initial collateral 354 score from the registry information.From those 270 subjects, 355 one subject was additionally excluded which was considered 356 to have insufficient image quality by the expert readers.We 357 did not have additional selection criteria on occlusion location.358 However, only 6 subjects with A1 or A2 occlusions were found 359 among the 1574 MR CLEAN registry subjects, and our final 360 set of images did not contain any A1 or A2 occlusions.

361
In addition, for training we added five collateral score 362 0 cases from the MR CLEAN Trial.Collateral score 0 is 363 rare, and due to the random selection, only two images with 364 collateral score 0 were in our initial selection.The additional 365 five subjects were only used in the classifier training process.366 The accuracy of the algorithm was thus evaluated on the 367 randomly selected 269 cases.

369
To get a consistent and reliable collateral score reference 370 standard for this study, collateral scoring was performed by 371 three radiologists from different medical centers with 10 to 30 372 years of experience.Two radiologists rated the collateral score 373 independently and the third radiologist independently rated 374 the cases in which there was disagreement by the first two 375 radiologists.The radiologists were asked to rate the collateral 376 status according to the criteria of Tan et al. [6].

377
In this 269 subjects, the two independent raters had an 378 interobserver agreement of 0.64, and their scores compared 379 to the consensus score were 0.81 and 0.82.The details of the 380 269 test subjects are listed in the Table I.Fig. 6 shows the 381 location of occlusion in the vessel segments that are listed in 382 Table I.Registration of the CT atlas to the CTA image was performed with ANTs (Avants et al. [38], Klein et al. [39]), following a previously described CT atlas based registration [21] that consists of a two-step approach: an initial rigid registration followed by a diffeomorphic non-rigid registration.
In the deep learning model training process, the network was trained with a batch size of 8 and 50 steps per epoch.In each epoch, we iterate twice over all 200 training images of 128x128x128 voxels: once with random shift and flipping along each axis and once with elastic deformation.We also introduced additive Gaussian noise at the input layer for regularization.The standard Dice score served as the loss function.We chose the root mean square propagation (RM-SProp) (Tieleman and Hinton [40]) optimizer with an initial learning rate of 0.1, and halved it every 10 epochs.We stopped training after 300 epochs, as the learning rate was approaching zero.
In the multi-label classification part, the baseline model has fixed threshold values.The three threshold values(θ 1 , θ 2 , θ 3 ) were determined by the clinical definition of collateral score (Tan et al. [6]).We introduced a small margin to the collateral 0 case, since, in practice, there were always some vessels in the occluded side.Therefore, we define the collateral 0 case as less than 10%.The other three threshold values follow the clinical definition, i.e. θ 1 = 10%; θ 2 = 50%; θ 3 = 100%.For the random forest classifier, we used five fold cross validation to optimize the maximal depth of the trees, the number of trees in forest and the number of features used.

B. Vessel Extraction Model Training and Hyperparameter Optimization
In the first experiment, we trained the convolutional neural network for vessel extraction.In this experiment, we first investigated the performance of 3D U-Net model with different hyperparameters and configurations.Compared to a standard 3D U-Net (Isensee et al. [25]) (our baseline model), we first assessed the model performance enhanced by deep supervision and context modules, and subsequently also assessed the added value of varying the number of filters at convolutional layers.
The annotation dataset was randomly divided into a training and validation dataset.The training dataset consisted of 7 whole annotation brains and 20 cubes.The validation dataset consisted of the other 2 brains and 20 cubes.In order to guarantee the continuity and completeness of the centerline structure in the training process, we dilated every singlepixel centerline from manual annotation with a 3x3x3 square structuring element.The resulting ground truth image are shown in Figure .7. The Dice score on the validation dataset was used to measure model performance.In the first experiment, we assessed the added value of the various configurations.For this experiment, 28 input filters were used, as this was the maximal

C. Collateral Score Quantification
Next, we evaluated the proposed collateral scoring method.First, we assessed the accuracy of the three proposed models.Then, we assessed the accuracy of proposed methods applied to a dichotomized decision based on collateral score.
In the accuracy test for collateral scoring we evaluated the baseline model, a linear support vector classifier (SVC) with single feature, ordinal regression with four features, and a random forest model as describe in Section II-C3.We used the consensus score as ground truth label.
In the baseline model, we applied threshold values θ 1 , θ 2 , θ 3 to the median of r = [r v , r vi , r l , r li ] and derived the collateral scores 0, 1, 2, and 3. Similarly, we use the median of r = [r v , r vi , r l , r li ] with linear SVC model to find another set of threshold values that maximize the collateral score test accuracy on 269 subjects.The average threshold values for the SVC model were: θ 1 = 7%; θ 2 = 55%; θ 3 = 99%.We use ordinal regression with r = [r v , r vi , r l , r li ].The averaged odds ratios of four features are [1.17,0.9, 0.92, 1.15].We further evaluated the added value of random forest in terms of accuracy.We explored the feature vector r starting from a single feature towards combined features and tested all 16 possible combinations.We performed a nested cross-validation with 20 splits at the outer level and 5-fold cross-validation for parameter tuning.More specifically, we first randomly splitted the 274 subjects into 20 subsets with a stratified sampler.Then we trained a model with data from 19 subsets and tested on 1 left-over subset.During training, we use 5-fold cross validation for hyper-parameter optimization of feature vector r.In the end, we have 20 models with similar performance but with different parameter settings.In this way, we could fully utilize 269 subjects for testing and reduce the possible bias caused by a smaller test dataset.
Feature vector [r v , r li ] shows the highest prediction accuracy on test subjects.The results of baseline model, linear SVC, ordinal regression and the random forest with optimal feature vector are shown in Table III.The proposed baseline model (median with fixed threshold value) and ordinal regression model have the same performance (accuracy=0.75),marginally different from SVC (accuracy = 0.76).In terms of accuracy, the random forest classifier outperforms the baseline model by 5%.
In the study of Tan et al. [6], a dichotomized collateral score that was proposed by Schramm et al. [41] was used to assess the correlation with infarct volume before recanalization.

575
The computation time for one brain with 0.5 mm slice 576 thickness is about 15 min.Of this, the atlas to CTA space registration, MCA map, hemisphere map transformation with binary skull dilation takes 12 min, of which most time is spent on the Ants registration.It is likely that this can be optimized for clinical applicability.The vessel extraction is around 0.5 min.The quantification part in total is around 2.5 min.Those times were obtained using an implementation that was not optimized for computation time.

V. DISCUSSION
In this work, we proposed an automatic collateral scoring method that is inspired by human visual collateral scoring.In the method, we compute the collateral score by comparing the difference of vessel structures in the occluded hemisphere and contralateral hemisphere.Vessel structure extraction is important for collateral status quantification.Therefore, we first investigated the performance of CNN and found a positive effect of context module and deep supervision on the performance of the 3D U-net vessel centerline segmentation model.We further evaluated network performance with an increased number of filters at convolutional layers.On average, we achieve a Dice score of 0.56 on the validation dataset.Whereas this might seem low, a value of 0.56 is reasonable for thin linear structures in 3D, as they contain a large proportion of boundary voxels.For such long thin structures, a single pixel shift in a direction orthogonal to the structure have a major impact on the Dice score.
In the collateral scoring assessment, the baseline model with using the median of feature vector r = [r v , r vi , r l , r li ] and fixed threshold values achieved an accuracy of 0.75 and a dichotomized accuracy of 0.87.This demonstrates that the features are relevant.We observed that the misclassified subjects are mostly at the border of decision boundary.The average error distance is 0.25.The error is computed from the floating point score, and represents the distance to the closest value of interval of the correct collateral score.For example, [0,1] corresponding to collateral score 0. On this scale, the error made in classification is the distance to the closest border of the correct class.The random forest model on average achieves an accuracy of 0.8, which outperforms the baseline model on the border cases.With two principle features r = [r v , r li ], the random forest model achieves a dichotomized accuracy of 0.9, which is comparable to two clinicians (0.91 and 0.9).less accurate for more distal occlusions.This can be explained with 3D patches.Such a training would be challenging to commonly available GPUs.Thirdly, the result of such endto-end training is difficult to interpret, whereas our approach also gives insight in the vessel segmentation and quantification on which the scoring is based.Finaly, we aim for a more quantitative analysis of collaterals with a tool that provides a continuous output instead of a semiquantitative scale with 4 items.
In comparison with Boers et al. [19], the dataset and the level of ground truth are different: Boers et al. [19] used the data from MR CLEAN Trial (Berkhemer et al. [34]) and we used dataset from MR CLEAN Registry (Jansen et al. [33]).Their collateral score was derived by direct use of feature r vi , and there was no direct assessment over collateral score; instead they perform a correlation test between their single feature and the manually obtained collateral score with a Spearman correlation test.The result on 59 subjects was a Spearmen ρ = 0.68, p < 0.001.For our method, a Spearmen correlation ρ = 0.80 was obtained on a test set of 269 subjects.
During this study, a collateral scoring product (e-Stroke Suite) became available from Brainomix.This software was evaluated by Grunwald et al. [42] recently.Ninety-eight subjects were used in their work.Their selection criteria were more restrictive than ours (1 mm slice thickness), and no information is provided on the occlusion location or number of excluded scans.Also, their consensus score may be biased towards the software performance, as the consensus score was determined after knowing the software score, which makes it hard to directly compare their result with ours.Reported accuracy of software compared to their reference standard is 90%, and they also demonstrate, similar to our work, that the errors of the software are within the interobserver variation.
All patients from the MR CLEAN registry that were included in the study had an acute large vessel occlusion, which was assessed on CTA.In all patients a large vessel occlusion was detected in one hemisphere only.Although we cannot rule out that small peripheral emboli were present in the contralateral hemisphere (which were not visible on CTA) in patients with a cardioembolic etiology, in general, the symptoms caused by a large vessel occlusion are more prominent than the symptoms caused by potential small peripheral emboli.Based on that we assume that the symptoms

[ 20 ]
) with visually tuned parameters and a threshold of 200 Hounsfield units (HU) was applied to extract vessel structure in a pre-defined region of interest.The computed feature is the ratio of vessel volume with intensity between the occluded side and the contralateral side.The method was assessed on 59 subjects from which their follow-up non-contrast CT scan was used to construct the probability density map.The method was assessed on CTA images of patients with an occlusion in the M1 segment (for the detailed vessel segments in MCA territory please refer to Fig. 6) of the MCA territory, a maximal slice thickness of 1 mm and full coverage of the intracranial region.C. Contributions and Organization of Our Work In this work, we propose a three-stage algorithm to compute a collateral score.The collateral scoring method was assessed on 269 subjects.Data preprocessing and vessel centerline segmentation are explained in Section II, followed by feature design and a multi-label classification model for collateral scoring.The data set, collateral score reference standard and annotation strategy 121 are described in Section III, the experiments and results are 122 detailed in Section IV, followed by discussion in Section V, 123 conclusions are drawn in Section VI. 124 136

Fig. 4 :
Fig. 4: The example image of deep learning post processing; a: shows the cube (100-pixel size) location in coronal view; b: shows an example binary vessel tree with some isolated parts B0 in 100-pixel size cube; c: shows an example image of binary vessel tree B.

265 2 )Fig. 5 :
Fig. 5: An example of vessel length computation.Left: the vessel skeleton; Right: the smoothed and interpolated vessel segments, the vessel length computation are based on those segments.Different segments are shown in different colors.

•
Contrast phase: Five different contrast phases have been previously defined by Rodriguez-Luna et al. [36]: early arterial, peak arterial, equilibrium, peak venous, and late venous, based on the image intensities in the contralateral ICA and the transverse sinus.In the early arterial phase, collateral vessels are likely not enhanced yet.In the late venous phase, the venous structures are more pronounced than arterial structures.Therefore, in this study we only select images with peak arterial, equilibrium or the early venous phase.• Image quality: Image quality in MR CLEAN Registry is rated by the core lab into the following categories: good image quality, moderate image quality and bad image quality.A good quality image implies that the image is sufficiently informative for radiologist to rate.We similarly included good quality images in our analyses.• Brain coverage: Brain coverage, the image should cover at least half of the vertical distance between skull base and vertex.At the start of the current study, baseline CTA data of 1594 subjects had been collected in the MR CLEAN Registry.These data was acquired from 16 March 2014 till 15 June 2016.Of these 1594 subjects, the images of 1058 subjects had good image quality (based on MR CLEAN Registry core lab readings).736 subjects fulfill both image quality and contrast phase criteria.At the end, 585 subjects fulfill all selection criteria.

Fig. 6 :
Fig. 6: An example of vessel segments in right hemisphere.The vessel segments include the ICA segment, M1, M2 and the more distal part of the MCA territory, for simplicity, we have combined M2 and the more distal part into M2.

395 3 experiencedFig. 7 :
Fig. 7: The image in the left is the result of whole brain annotation; The image in the right is the ground truth image.Different vessel segments are shown in different colors

Furthermore, inFig. 10 :
Fig. 10: The ROC curve of baseline model, linear SVC model and random forest model

TABLE III :
The accuracy, dichotomized accuracy of three proposed methods.

TABLE IV :
Subgroup analysis.In TableIV, a subgroup analysis is presented for the ac-620 quisition phase, collateral score, slice thickness and occlusion 621 position.The accuracy does not seem to depend on acquisition 622 phase or collateral score.It also shows that collateral scoring is