Membrane connectivity estimated by digital image analysis of HER2 immunohistochemistry is concordant with visual scoring and fluorescence in situ hybridization results: algorithm evaluation on breast cancer tissue microarrays

Introduction The human epidermal growth factor receptor 2 (HER2) is an established biomarker for management of patients with breast cancer. While conventional testing of HER2 protein expression is based on semi-quantitative visual scoring of the immunohistochemistry (IHC) result, efforts to reduce inter-observer variation and to produce continuous estimates of the IHC data are potentiated by digital image analysis technologies. Methods HER2 IHC was performed on the tissue microarrays (TMAs) of 195 patients with an early ductal carcinoma of the breast. Digital images of the IHC slides were obtained by Aperio ScanScope GL Slide Scanner. Membrane connectivity algorithm (HER2-CONNECT™, Visiopharm) was used for digital image analysis (DA). A pathologist evaluated the images on the screen twice (visual evaluations: VE1 and VE2). HER2 fluorescence in situ hybridization (FISH) was performed on the corresponding sections of the TMAs. The agreement between the IHC HER2 scores, obtained by VE1, VE2, and DA was tested for individual TMA spots and patient's maximum TMA spot values (VE1max, VE2max, DAmax). The latter were compared with the FISH data. Correlation of the continuous variable of the membrane connectivity estimate with the FISH data was tested. Results The pathologist intra-observer agreement (VE1 and VE2) on HER2 IHC score was almost perfect: kappa 0.91 (by spot) and 0.88 (by patient). The agreement between visual evaluation and digital image analysis was almost perfect at the spot level (kappa 0.86 and 0.87, with VE1 and VE2 respectively) and at the patient level (kappa 0.80 and 0.86, with VE1max and VE2max, respectively). The DA was more accurate than VE in detection of FISH-positive patients by recruiting 3 or 2 additional FISH-positive patients to the IHC score 2+ category from the IHC 0/1+ category by VE1max or VE2max, respectively. The DA continuous variable of the membrane connectivity correlated with the FISH data (HER2 and CEP17 copy numbers, and HER2/CEP17 ratio). Conclusion HER2 IHC digital image analysis based on membrane connectivity estimate was in almost perfect agreement with the visual evaluation of the pathologist and more accurate in detection of HER2 FISH-positive patients. Most immediate benefit of integrating the DA algorithm into the routine pathology HER2 testing may be obtained by alerting/reassuring pathologists of potentially misinterpreted IHC 0/1+ versus 2+ cases.


Introduction
Recent progress of virtual microscopy and digital image analysis technologies opens new perspectives for the development of more reliable tools of tissue-based biomarker measurement [1][2][3][4]. This would enable high-throughput research, quality assurance, and decision-support measures to control for observer variability. Not surprisingly, the dawn of digital pathology is marked by the efforts to optimise image analysis algorithms for HER2 expression in breast cancer tissue [4][5][6][7]. They all aim at ensuring accurate and reproducible measurement of HER2 expression, which correlates with pathologist's evaluation, amplification of the gene and clinical outcomes. In the absence of a true "gold standard", the objectivity of image analysis tools can also be tested by inter-algorithm variation studies [8]. Some studies have compared outputs of various tools for HER2 IHC analysis [9,10]. Computer-aided digital microscopy has been shown to reduce observer variability in HER2 IHC evaluation [11].
We designed our study to test the performance of HER2 IHC scoring based on a novel membrane connectivity estimate in tissue microarrays (TMAs) of breast cancer tissue. The digital analysis (DA) results were compared with the data of visual evaluation (VE) of HER2 by IHC and HER2 FISH test results on the same TMAs.

Tumour Specimens
Tumour samples were obtained from prospectively collected series of 195 patients with an early invasive ductal carcinoma of the breast treated at the Oncology Institute of Vilnius University and investigated at the National Center of Pathology during the period of 2007 to 2009. The median age of the patients was 57 years (range 27-87 years). The patients were diagnosed with stage T1-2 tumours, without distant metastases (M0), however, 48% of the patients showed lymph node involvement (N1 or N2). Informed consent was obtained and documented in writing before study entry. The study was approved by the Lithuanian Bioethics Committee.

Tissue Microarrays
The TMAs were constructed from 10% buffered formalinfixed paraffin-embedded tissue blocks, selected by the pathologist (DD). The corresponding hematoxylin and eosin-stained slides were scanned by Aperio ScanScope GL Slide Scanner (Aperio Technologies, Vista, CA, USA) under 20 × magnification. The pathologist randomly selected and marked representative areas of the tumour on the whole section images. The images were then converted into Mirax MViewMRXS format and used to guide the production of the TMAs on the tissue arraying instrument (3DHISTECH, TMA Master, Budapest, Hungary).
One millimetre-diameter cores were punched from the selected areas, thus producing 11 TMAs blocks containing 737 spots from 195 patients. Paraffin sections of the TMAs were cut for IHC (3 μm-thick) and FISH testing (4 μm-thick).

Immunohistochemistry
The sections were immunostained on Ventana Bench-Mark XT staining system (Ventana Medical Systems, Tucson, Arizona, USA). Sections were deparaffinized in xylene, dehydrated through three alcohol changes and transferred to Ventana Wash solution. Epitope retrieval was performed on the slides using Cell Conditioning solution (pH 8.5) at 100°C for 36 min. The sections were then incubated with Ventana PATHWAY anti-HER2/neu (4B5) rabbit monoclonal antibody at 37°C for 16 min using Ventana Ultraview DAB detection kit. Finally, the sections were developed in DAB at 37°C for 8 min, counterstained with Mayer's hematoxylin and mounted. Whole tissue sections of HER2-positive breast tumour tissue were used as positive tissue controls, while negative controls were performed by omitting the application of primary antibody. Digital images were captured using the Aperio ScanScope GL Slide Scanner (Aperio Technologies, Vista, CA, USA) under 20 × magnification.

Visual evaluation of HER2 IHC images
Visual evaluation of HER2 IHC score was performed by the pathologist (DD) twice (VE1, VE2) with an interval of 2 months, based on the review of the images of individual spots on the computer monitor (Acer AL2616W). The IHC results were scored according to the United States Food and Drug Administration (FDA) criteria approved for the 4B5 HER2 rabbit monoclonal antibody. Each spot was graded individually with 0, 1+, 2+ or 3+ HER2 score. For further analysis the score 0 and 1+ was merged into negative (0/1+) HER2 category. Based on the common adequacy criteria (tissue integrity, presence and amount of tumour tissue, staining artefacts), the pathologist encoded individual spots as inadequate. Similarly, spots containing ductal carcinoma in situ (DCIS), with or without invasive carcinoma, were excluded from further analysis.

Digital analysis of HER2 IHC images
Digital analysis of the HER2 IHC TMAs was performed on the same images as the visual evaluation. By using the Arrayimager software module from Visiopharm (Hoersholm, Denmark), individual digital images of each spot were automatically extracted from the whole slide images of the 11 TMAs. For each spot, a region-ofinterest (ROI) was fully automatically defined by the tissue detecting algorithm of the Visiomorph software module (Visiopharm, Hoersholm, Denmark). To secure against the potential effect on digital analysis of possibly artificial staining of the edge of the tissue spot, the ROI was designed to have a distance to the nearest edge of 100 pixels (approximately 25 μm). Automatic area control ensured exclusion of severely destroyed or missing spots from the study, since a tissue spot was only included, if its ROI area exceeded 37,000 μm 2 , corresponding to approximately 5% of the ROI of an intact spot with a diameter of 1 mm. The spots containing inadequate tumour sample or DCIS were excluded from the DA by means of visual evaluation.
As recently described in detail [12], the DA was performed by the HER2-CONNECT™ software module (Visiopharm, Hoersholm, Denmark). Briefly, the algorithm of this software includes: 1) pre-processing for detection of pixels contributing to the characteristic brown linear structures in digital images of tissue sections immunostained for the presence of HER2 by the DAB substrate; 2) bimodal segmentation for distinguishing pixels representing stained membrane from all other pixels of the image; 3) post-processing for skeletonizing the membrane, merging membranes which were not perfectly connected, and eliminating small membrane fragments. The values of variable parameters used in the pre-processing, segmentation and post-processing were all established in a preceding study at NordiQC (Aalborg Hospital, Denmark) using different staining methods, another whole slide scanner, and manual outlining of ROI [12]. The parameters were not specifically optimized for the current study. The size of each membrane fragment is defined as the area of pixels its skeleton is composed of, and the connectivity is calculated from the size distribution of all membrane fragments within the ROI. The connectivity can vary continuously from 0, corresponding to a ROI without a single membrane fragment with an area larger than a pre-defined low cut-off, to 1, corresponding to a ROI for which all membrane fragments have areas larger than a pre-defined high cut-off. The continuous connectivity estimate was then converted into HER2 score: 0/1+ if connectivity ≤ 0.12, 2+ if 0.12 < connectivity ≤ 0.56, 3+ if 0.56 < connectivity ≤ 1, Figure 1.

Fluorescence in situ hybridization
HER2 gene amplification was determined by a dual color FISH using the PathVysion HER2 DNA probe kit and Paraffin pretreatment kit (Abbott-Vysis, Inc., Downers Grove, IL, USA). Briefly, 4 μm sections were placed on positively charged slides and dried overnight at 56°C. The sections were deparaffinized in xylene, dehydrated in alcohol, air dried, then pretreated in 0.2N HCl for 20 min and in a pretreatment solution at 80°C for 30 min followed by protease digestion at 37°C for 26 min. Appropriate amount of hybridization solution containing directly labelled probes, both SpectrumGreen for the chromosome 17 centromere (CEP17) and SpectrumOrange for the HER2 gene locus, was applied, and the probe-target tissue was codenatured for 5 min at 72°C a b c Figure 1 Image outputs of the digital analyses. Tissue microarray images scored by digital analysis as 0/1+ 2+ and 3+ (a. b. and c. respectively): green lines outline cell membranes revealing positive HER2 immunohistochemical staining by membrane connectivity estimate.
using Hybridizer (DAKO Diagnostics, Glostrup, Denmark) and allowed to hybridize for 19 h at 37°C. Nonhybridized probe was washed out in a hot 72°C 2 × SSC with 0.3% NP-40 solution for 2 min. Nuclei were counterstained with DAPI and coverslipped (Invitrogen Corporaton, Carlsbad, USA). Appropriate amplified and non-amplified in-house controls were processed in the run. Hybridized probes were examined manually by fluorescence Zeiss microscope (Zeiss, Axio Imager.Z2, Gottingen, Germany) equipped with a single green, orange and triple band pass filter Dapi-FITC-Cy3.
The FISH analyses for HER2 were performed manually without knowledge of the IHC result, according to Food and Drug Administration (FDA) scoring system in which HER2 gene amplification was set at an HER2/CEP17 ratio of more than 2. One evaluation per patient was performed after a review of a patient's spots in the TMAs and selection of a representative area in one of the spots for the FISH count (a total of 20 cells counted per patient).

Statistical analysis
The agreement between VE1, VE2, and DA was tested by spot and by patient. The latter was based on the maximum HER2 score among the 2-4 spots belonging to the same patient (VE1max, VE2max, DAmax). The agreement was analyzed using kappa statistics; the strength of agreement 0.81-1.00 was interpreted as almost perfect [13]. The results are presented as weighted kappa with 95% confidence interval (CI). Pearson's correlation was performed to test the linear relationships between the continuous variable of membrane connectivity estimate and FISH results. Statistical analysis was performed with SAS 9.2 software.

Sample (spot) adequacy
A total of 737 TMAs spots were evaluated visually by the pathologist twice (VE1 and VE2). After exclusion of spots containing inadequate samples or DCIS (n = 9), 575 spots remained for further analysis.

Concordance of visual evaluation and digital analysis (by patient)
To test the concordance of VE and DA score on a patient level, the cases with 2, 3 or 4 adequate spots (by both VE and DA) per patient were selected. Out of the 177 cases with a total of 575 adequate spots, 16, 15, 55, and 91 cases contained 1, 2, 3, and 4 adequate spots, respectively, thus leaving 161 patients with 2, 3 or 4 spots for further analysis.
Patient's IHC HER2 score was defined as maximum score (VE1max, VE2max, DAmax) obtained from the 2-4 spots analyzed. Remarkably, variation of the HER2 score between patient's spots was rather low: all (2, 3 or 4) spots revealed the same score in 156, 151, and 141 patients evaluated by VE1, VE2, and DA, respectively (Table 2). Thus, in a great majority of the 161 patients, the individual spots produced the same result per patient which would be identically expressed as maximum, mode, or median. The remaining 4, 10, and 19 patients (VE1, VE2, and DA, respectively) revealed a range of 1; remarkably, the great majority of this variation was  (Table 3). The percentage agreement was 89.4% and 92.5%, respectively. In all three analyses, 17 patients remained in the 3+ category. Similarly, 122 (80%) and 121 (94%) of HER2 negative (0/1+) patients by the VE1max and VE2max, respectively, were classified as such by the DA. Again, most of the discrepancies were present in the 2+ category where 62-73% of the 2+ patients by the VEmax were classified by DAmax as 2+. In general, the DAmax tended to "upgrade" HER2 score in some patients, shifting them from 0/1+ and 2+ categories into 2+ and 3+, accordingly. Remarkably, no discrepancies between the VEmax and DAmax in the interval of two categories were detected.
Proportion of HER2 FISH-positive cases in the IHC categories scored by the visual evaluation and digital analysis HER2 FISH test was performed on the sections from the same TMAs containing the 575 spots used for the IHC analysis. FISH results of overall 152 patients were obtained and compared with the HER2 IHC results ( Table 4). The raw data of the patients with IHC score 2+ or 3+ (by either visual evaluation or digital analysis) and/or FISH HER2/CEP17 ratio > 2.0 and/or CEP17 > 3.0 is presented in the Table 5.
In summary, DAmax appeared to be most accurate with respect to positive FISH results. For the most of the cases where discrepancy was observed between IHC and FISH, the VE and DA were in agreement, and the discrepancy therefore seemed to be related to either biological variation of HER2 amplification and expression, or due to mistakes in reagents or assay procedures.

Correlation of membrane connectivity estimate with HER2 FISH results
Digital analysis of the IHC HER2 is based on the continuous variable of membrane connectivity and can be used in analyses of biomarker expression, independently of categorical scoring systems. We explored the potential of the membrane connectivity estimate comparing it to the patient's HER2 FISH data. Maximum spot value (Connect-Max) was used to characterize the patient's membrane connectivity estimate. Distribution analysis of the Con-nectMax revealed pronounced bimodal pattern with left asymmetry (Figure 2). Significant correlations between log (ConnectMax) and the FISH results were observed: log (mean HER2) copy number per cell (r = 0.67, p < 0.0001), log(mean HER2/CEP17) ratio (r = 0.57, p < 0.0001), and mean CEP17 number per cell (r = 0.39, p < 0.0001).  The raw data of the patients with IHC score 2+ or 3+ (by either visual evaluation or digital analysis) and/or FISH HER2/CEP17 ratio > 2.0 and/or CEP17 > 3.0 These interrelationships with absolute and relative FISH variables raise an issue of understanding the complexity of the phenomena depicted in the bubble plot ( Figure 3). Most IHC-and FISH-negative cases are represented by small dots in the left lower quadrant, while the positive cases concentrate in the left upper quadrant. However, quite numerous IHC-negative and IHC-positive cases fall into the "polysomy" quandrants on the right. The few IHC-FISH discrepancies can be tracked on the diagram, some of them revealing examples where conventional criteria for HER2 gene amplification testing by FISH may not always work. In particular, note the IHC-positive case with a high polysomy and mean HER2 per cell above 6, but the HER2/ CEP17 ratio below 2 (also, Table 5, line 38). Multivariate analysis of the IHC and FISH parameters may help understanding these complexities, and the membrane connectivity estimate may serve as a continuous variable of the IHC positivity.

Discussion
Our experiment revealed a reliable performance of HER2 expression measurement by the IHC digital image analysis based on the membrane connectivity estimate. The algorithm was run "plug-and-play" on the TMA images without an attempt to calibrate for potential image variation caused by scanning or IHC procedures. Manual annotation of the tumour tissue was not performed; however, spots containing DCIS or insufficient amount of tumour tissue were excluded from digital analysis by visual evaluation. Under these conditions, the digital analysis was in almost perfect agreement with the pathologist's score (VE) and exceeded the latter in terms of detecting FISH-positive patients.
We tested the agreement between the visual and digital evaluations in two sets of analyses: it was almost perfect at the level of individual spot (kappa 0.86 and 0.87, with the VE1 and VE2 respectively) and at the patient level (kappa 0.80 and 0.86, with the VE1max and VE2max, respectively). In general, the level of agreement in our study was among the highest reported when compared to that of previous studies using various digital analysis platforms [5,6,9,10,[14][15][16][17], but obviously some caution has to be taken when comparing across studies with different designs. In both VE and DA, we used maximum TMA spot values to define patient's HER2 IHC status. This approach has been tested previously [18], and, in our view, is a better way to summarize TMA data per patient than mean or median value, especially, when tissue heterogeneity is a concern. Also, maximum spot value increases the sensitivity of HER2 detection and may compensate for the limited tissue sampling in TMA.
As expected from the previous studies [6,14,15], both 0/1+ and 3+ IHC categories were consistently discriminated by both the VE and DA, whereas most discrepancies were present in detection of the 2+ score category. Although it sounds like a paradox, these discrepancies may bring the greatest "added value" of integrating digital analysis into the routine pathology work-up of HER2 testing. Extrapolation of our experiment to clinical setting would mean that in the cohort of 152 patients with    21,22,29). If the decision to perform a reflex FISH test were based on the IHC 2+ score by either VE1max or DA, that would have resulted in 19 FISH-positive cases compared to 16 by the VE1max-based decision alone (leading to 19% increase of the number of HER2-amplified cases in the cohort). In the setting where the pathologist would evaluate the IHC twice (VE1max and VE2max), the second review would have resulted in additional 8 HER2 IHC 2+ cases followed by the obligatory 8 FISH tests, thus detecting 1 additional HER2-amplified case; inclusion of the DA results into the account would require another 8 FISH tests with another 2 HER2-amplified cases detected. Considering potential consequences of a misdiagnosed HER2 status in 2 or 3 patients in the cohort of 152 for the "price" of adding automated digital analysis step and roughly 5-8 additional FISH tests per misdiagnosed case, the "balance" seems to be on the positive side. On the other hand, addition of the DA would have "saved" 2 or 3 FISH tests (compared to VE2max and VE1max, respectively) by suggesting the IHC 3+ score instead of the pathologist's 2+ score (Table 5, lines #35-37), however, one of the cases (#35) was negative by FISH, revealing potential lack of specificity of the DA alone. In contrast to other studies [19,20], our DA did not give a promise of a decreased number of IHC 2+ cases or increased specificity in detecting HER2-amplified cases. This latter statement, however, must be taken with caution since individual "sensitivity" of the pathologists may shift the VE results in different directions relative to the DA (the inter-observer variability was not tested in the present study). In summary, we suggest that the membrane connectivity DA would be most useful as a decision-support and quality assurance tool, alerting pathologists of borderline 0/1+ versus 2+ and 2+ versus 3+ HER2 IHC cases, thus improving the accuracy of the HER2 testing, but without expectation of significant savings by avoiding unnecessary FISH tests. Nevertheless, improved accuracy of the HER2 testing, without having to perform FISH in all cases, presents a reasonable economic trade-off. Although these considerations are based on the TMA analyses, whereas current pathology HER2 testing routine is based in the whole section samples, our data is at least representative and simulates the cases when limited tumour samples are available for testing.
The pathologist intra-observer agreement was slightly better than that with the digital analysis. However, the DA appeared to be more accurate in detection of FISH-positive patients. Interestingly, the second visual evaluation (VE2) was slightly more "sensitive" than VE1: it detected more 2+ patients and rescued 1 FISH-positive patient from the 0/1+ category by VE1. It is likely that this increase of sensitivity is a result of a learning curve -the pathologist adapting to evaluation of small samples of tissue in the TMAs as opposed to the IHC whole section slides used in routine pathology practice. This aspect may present additional benefit of the DA not only in the TMA analyses but also when a small tumour sample is available.
Objectivity of the digital analysis depends on numerous factors [8]; one particular factor is the accuracy of tumour tissue sampling for the analysis. If non-tumour tissue is included in the analysis, it may "dilute" the percentage of positive cells. In our experiment, no manual or automated annotation of the tumour tissue was performed, nevertheless, the DA recruited more 2+ and 3+ spots and patients than VE. Inevitably, our TMA spots contained variable proportions of tumour and nontumour tissues and the digital analysis results could have been distorted without proper selection of the tumour tissue. However, since the membrane connectivity is a non-cell-based estimate and does not require distinction between tumour and non-tumour cells, the only prerequisite for the digital analysis was a sufficient amount but not proportion of tumour tissue in the ROI. This also provided the benefit of avoiding manual annotation of the ROI -the laborious and potentially biasing step of the image analysis.
With regard to detection of FISH-positive patients, the digital analysis provided maximum accuracy of IHC interpretation possible in our TMAs. As outlined in the Results section, the "false-positive" and "false-negative" cases by DAmax were also discrepant by VE1max and VE2max and most likely represented a true biological variation of HER2 gene amplification and expression and/or possible issues in tissue processing [21][22][23][24][25][26]. Although HER2 FISH status is commonly used as a "gold standard" in HER2 IHC studies, in a small proportion of cases it may remain discrepant due to tissue heterogeneity, CEP17 polysomy/amplification (if only HER2/CEP17 ratio is used to define the HER2 status), or other unrecognized causes of variation [27][28][29][30]. Our data reveal a subpopulation of patients where conventional HER2 FISH positivity criteria based on HER2/CEP17 ratio may be not sufficient and support the need to further explore the biological continuum of HER2 positivity and clinical relevance of the test [30][31][32][33]. Although analysis of this complexity is beyond the scope of the present study, it is important to note that the membrane connectivity estimate represents a continuous variable of HER2 expression by IHC and can serve better than categorical IHC score in statistical analyses exploring the relationships of HER2 expression and amplification. In support of this perspective, we found significant correlations of the IHC membrane connectivity with the FISH results: HER2 copy number (r = 0.67), HER2/CEP17 ratio (r = 0.57), and mean CEP17 number per cell (r = 0.39), similar to the recent report of Vranek et al [34] (although the correlation to CEP17 did not reach statistical significance in this study of patients with the CEP17 polysomy). Of note, automation and further quantification of the FISH testing, with increase of accuracy and capacity of the test, seems to be an important step to further progress.

Conclusions
In conclusion, HER2 IHC digital image analysis based on membrane connectivity estimate, tested on early ductal carcinoma of the breast tissue microarrays, was in almost perfect agreement with the visual evaluation of the pathologist and more accurate in detection of HER2 FISH-positive patients. Most immediate benefit of integrating the DA algorithm into the routine pathology HER2 testing may be obtained by alerting/reassuring pathologists of potentially misinterpreted IHC 0/1+ versus 2+ cases. The algorithm was used without manual or automated annotation of tumour tissue and appeared to be independent of the proportion of tumour in the tissue analyzed. It provided a continuous variable reflecting HER2 IHC expression and could be useful for quality assurance, computer-assisted diagnosis, and HER2 amplification/expression heterogeneity studies.