Concordance among four commercially available, validated programmed cell death ligand-1 assays in urothelial carcinoma

Background Antibodies targeting the programmed cell death-1 (PD-1)/PD-ligand 1 (PD-1/PD-L1) checkpoint have shown promising clinical activity in patients with advanced urothelial carcinoma (UC). Expression of PD-L1 in UC tumors has been investigated using different antibody clones, staining protocols, and scoring algorithms. The aim was to establish the extent of concordance among PD-L1 immunohistochemistry (IHC) assays. Methods Tumor biopsy samples (N = 335) were assessed using four commercially available PD-L1 assays: VENTANA SP263, VENTANA SP142, PD-L1 IHC 28–8 pharmDx, and PD-L1 IHC 22C3 pharmDx. PD-L1 analytical staining and classification concordance, including agreement between clinically relevant scoring algorithms, were investigated using overall/positive/negative percentage agreement (OPA/PPA/NPA). Results Good analytical correlation was observed among the VENTANA SP263, PD-L1 IHC 22C3 pharmDx, and PD-L1 IHC 28–8 pharmDx assays for tumor cell (TC) and immune cell (IC) PD-L1 staining with Spearman rank coefficients of 0.92–0.93 for TCs and 0.88–0.91 for ICs. However, concordance (preset criterion: ≥85%) between patient PD-L1 status when applying the TC or ICICArea ≥ 25% (VENTANA SP263) cutoff was only achieved for PD-L1 IHC 22C3 pharmDx versus VENTANA SP263 (OPA 92.2%, PPA 86.4%, NPA 95.4%). Differences were observed between patient populations with UC tumors classified as PD-L1 high versus PD-L1 low/negative using combined positive score (CPS) ≥1, CPS ≥10, IC ≥5%, and TC/IC ≥25%. Conclusions The VENTANA SP263 and PD-L1 IHC 22C3 pharmDx assays are analytically similar in UC. When the different PD-L1 assays were combined with their specified clinical scoring algorithms, differences were seen in patient classification driven by substantial differences in scoring approaches. Electronic supplementary material The online version of this article (10.1186/s13000-019-0873-6) contains supplementary material, which is available to authorized users.

Four validated, commercially available assays (VENTANA PD-L1 SP263 and VENTANA PD-L1 SP142 [Ventana Medical Systems, Inc., Tucson, Arizona, USA], and PD-L1 immunohistochemistry [IHC] 22C3 pharmDx and PD-L1 IHC 28-8 pharmDx [Agilent Technologies, Santa Clara, California, USA]) have been developed independently in conjunction with immunotherapies targeting the PD-1/PD-L1 pathway. These assays use different antibodies, IHC protocols, scoring algorithms, and cutoffs to define high/low PD-L1 expression in UC ( Fig. 1) [21][22][23][24][25][26][27][28][29][30]. Unlike the application of these assays in non-small cell lung cancer (NSCLC), in UC, the PD-L1 scoring approaches differ widely among the various assays. In UC, VENTANA SP142 assesses the proportion of tumor area occupied by PD-L1stained immune cells (IC) (% of IC TumorArea ), while VENTANA SP263 utilizes the proportion of ICs with PD-L1 staining as a proportion of the IC area as well as the proportion of tumor cells (TCs) with PD-L1 membrane staining (% of TC or IC ICArea ) (Fig. 1). PD-L1 IHC 22C3 pharmDx uses the combined positive score (CPS) of TCs and ICs with PD-L1 staining, while PD-L1 IHC 28-8 pharmDx measures the proportion of TCs with PD-L1 membrane staining only (% of TC) (Fig. 1). In addition to the difference in scoring methods between the assays, in UC, there are significant differences between assays in the cutoffs used to define PD-L1 expression level [24,[27][28][29][30][31]. These differences raise the question of whether the UC patient populations defined as PD-L1 high are the same across clinical trials based on the algorithms Fig. 1 Comparison of PD-L1 assays for UC and differences in immune cell measurement and scoring algorithm. *Ratio of tumor cells (TC) and immune cells (IC) relative to number of all TC. † IC score is the percentage area of ICs present exhibiting PD-L1 positive IC staining. ‡ IC score is the proportion of ICs that are PD-L1 positive, expressed in relation to tumor area. CE European Conformity, Cis cisplatin, IVD in vitro diagnostic (particular combination of scoring method and cutoff) used, and therefore, whether results can be compared across trials.
To conserve patient tissue and pathology resources, the use of a single PD-L1 assay for tumor testing is desirable. However, such harmonization requires a thorough understanding of the concordance between staining, scoring algorithms, and cutoffs. To enable this, and to demonstrate interchangeable use, a first step is to compare the analytical performance of the available assays. Good analytical concordance has been previously demonstrated among three validated, commercially available PD-L1 IHC assays (VENTANA SP263, PD-L1 IHC 22C3 pharmDx, and PD-L1 IHC 28-8 pharmDx) across multiple TC PD-L1 protein expression cutoffs using samples from patients with NSCLC [31] or head and neck squamous cell carcinoma (HNSCC) [24]. The VENTANA SP142 assay was also evaluated, but did not show good concordance with the other three assays for TCs, an observation that has been supported across multiple independent studies [32,33]. More recently, the analytical comparability of these four assays has been investigated in staining of a small number of samples from patients with advanced UC for IC and TC staining, showing comparable results across assays, except for significantly lower staining of TC by VENTANA SP142; however, this was conducted in a small number of samples and no formal statistical evaluation was performed [34].
In addition to assessing the analytical performance of the four commercially available PD-L1 IHC tests, this study assessed the overlap between patient populations selected by these assays when different algorithms are used to define high versus low/negative PD-L1 expression. Comparing the technical performance of different assays and algorithms will allow appropriate interpretation of clinical outcomes for patients with UC treated with different anti-PD-1/PD-L1 therapies.

Study design
Archival formalin-fixed, paraffin-embedded clinical UC tumor sample blocks aged ≤5 years were obtained from commercial sources (Avaden BioSciences, Seattle WA, USA; Asterand Bioscience, Royston, UK; BioIVT, West Sussex, UK). AstraZeneca has a governance framework and processes to ensure that commercial sources have appropriate patient consent and ethical approval in place for collection of the samples for research purposes including use by for-profit companies.
Consecutive sections derived from tumor blocks were stained with VENTANA SP263, VENTANA SP142, PD-L1 IHC 22C3 pharmDx, and PD-L1 IHC 28-8 pharmDx according to their validated protocols for investigational use, and PD-L1 testing was carried out at Hematogenix (Tinley Park, IL, USA). The PD-L1 antibody clone (PD-L1 IHC 73-10 pharmDx), assessed in Blueprint phase II NSCLC and in metastatic breast cancer, was not commercially available at the time of analysis [35,36]. A single pathologist, trained by the manufacturers (Clinical Laboratory Improvement Amendments program-certified laboratory, Hematogenix), scored all samples in a blinded fashion, which were batched on an assay-by-assay basis. There was a washout period of ≥0.5 days between scoring the different assays. A single pathologist was used to remove reader subjectivity as a factor, thus ensuring a true inter-assay comparison.
The following parameters were recorded for each case and each assay: percentage of TCs with membrane staining for PD-L1 (TC score), percentage of tumor area occupied by tumor infiltrating ICs (IC area), percentage of ICs staining for PD-L1 (IC ICArea score) (as would be assessed for the VENTANA SP263 assay), percentage of tumor area occupied by PD-L1 staining tumor infiltrating ICs (IC TumorArea score) (as would be assessed for the VENTANA SP142 assay), and CPS of the number of PD-L1 positive cells divided by the total number of TCs × 100 (as would be assessed for the PD-L1 IHC 22C3 pharmDX assay). TC, IC area, and IC ICArea scores were recorded in 1% increments between 0 and 5%, and in 5% increments thereafter; IC TumorArea score was scored in 1% increments; and CPS was scored in increments of 1.

Comparison of scored and derived parameters
To determine whether it is possible to use derived values for IC TumorArea and CPS, rather than scoring directly, derived parameters were calculated as follows: •Derived IC TumorArea ¼ IC ICArea score Â IC area:

Statistical analysis Analytical concordance between assays
To assess the similarity in staining and scoring between the four assays, bubble plots and Spearman rank correlation coefficients were generated pairwise between assays for the TC, IC ICArea , and IC TumorArea scores. Correlation was classed as "good" where ρ ≥ 0.85. Concordance between the scored and derived values for IC TumorArea and CPS were assessed for the VENTANA SP142 and PD-L1 IHC 22C3 pharmDx assays, respectively, using the same approach. Plots showing the TC and IC ICArea scores for each assay ranked by the average value across assays were also generated. To demonstrate similarity between assays without the influence of cases where both assays were scored at 0% for the given parameter, Spearman rank correlation coefficients (ρ) were also generated excluding these cases.

Clinical concordance between assays
Two aspects of clinical concordance were assessed: first, whether the VENTANA SP263 clinically relevant algorithm (25%TC/IC) selects the same patients when applied to different assays. To do this, for each of the four assays, the patient status was determined using the clinically relevant algorithm for the assay itself ( Fig. 1) and also using the VENTANA SP263 algorithm (patient positive if either TC or IC ICArea score is ≥25%). Secondly, it was assessed whether the clinically relevant algorithms applied to the respective assays select the same patient population. Overall percentage agreement (OPA), negative percentage agreement (NPA), and positive percentage agreement (PPA) were calculated pairwise between assays using the appropriate comparator as reference assays for each clinically relevant cutoff; assays were considered concordant if OPA, PPA, and NPA were ≥ 85%. For each metric, the lower boundary of 95% confidence interval (CI) was calculated excluding upper bound using Clopper-Pearson method [37].

Results
UC tumor samples ≤5 years old from a total of 335 patients were included in this analysis. Patient demographics are shown in Table 1. Approximately 75% of patients were aged > 65 years, and patients were predominantly male (72%). The majority of tumor samples were of urothelial carcinoma (98%) and 76% were invasive (stage II or higher). Most of the samples were from transurethral resection of the bladder tumor (70%).

Direct scoring versus derived scoring
For both IC TumorArea and CPS, the correlation between the scored and derived scores, and the correlation between the ranks of the scored and derived scores, showed a high level of agreement (Spearman's correlation coefficient of 0.997 and 0.999, respectively) (Additional file 2 for VEN-TANA SP142 and PD-L1 IHC 22C3 pharmDx). Therefore, scored and derived IC TumorArea (and scored and derived CPS) can be considered interchangeable for each of these assays.
VENTANA SP142 showed similar prevalence versus the other three assays for ICs (22.7% staining for IC by IC area; 0.3% for IC staining by tumor area), but was less sensitive for PD-L1 staining on TCs (prevalence 6.3% at the ≥25% cutoff) (Additional files 2, 3, and 4). The percentages of TC staining for PD-L1, ranked by average value, were similar for the PD-L1 IHC 22C3 pharmDx, PD-L1 IHC 28-8 pharmDx, and VENTANA SP263 assays, but lower for VENTANA SP142 (Fig. 3). However, the percentage of IC (per IC area) staining for PD-L1 was similar across all four assays (Fig. 3).

Discussion
In this study of 335 UC tumor samples, a high level of analytical concordance was observed among the VENTANA SP263, PD-L1 IHC 22C3 pharmDx, and PD-L1 IHC 28-8 pharmDx assays for TC and IC staining of PD-L1. Concordance criteria were met between categorization of patients using PD-L1 IHC 22C3 pharmDx and VENTANA SP263 using the TC/IC algorithm, suggesting that these assays could be used interchangeably in UC to determine PD-L1 expression levels. Importantly, this finding was also true in the subset of samples from patients with muscle invasive cancer. Despite the good rank correlation between assays, the other assays did not meet the agreement criteria, driven by lower PPA, suggesting a lower sensitivity for PD-L1 IHC 28-8 pharmDx, and particularly for VENTANA SP142.
Significant differences were observed between VEN-TANA SP142 and the other three assays for TC staining, whereas IC staining was similar. The analytical findings of our study are consistent with previously reported Blueprint observations in NSCLC [38] and other studies in NSCLC and HNSCC, where VENTANA SP142 also consistently detects fewer TCs [32,33]. Our data also confirm the results of a recent study on samples from 30 patients with UC, which showed comparable results across assays for IC and TC staining, but significantly lower staining of TC by VENTANA SP142 [34].
Our study identified differences in the patient populations that would be classified as PD-L1 high versus PD-L1 low/negative by the PD-L1 IHC 22C3 pharmDx (CPS ≥1 and ≥ 10), VENTANA SP142 (IC ≥5%), and VENTANA SP263 (TC/IC ≥25%) algorithms. There was greater overlap between patient populations identified by VENTANA SP263 (TC/IC ≥25%) and PD-L1 IHC 22C3 pharmDx (CPS ≥1 and ≥ 10) than between VENTANA SP263 (TC/IC ≥25%) and VENTANA SP142 (IC ≥5%). According to our study, using the VENTANA SP142 assay and the IC ≥5% algorithm would misclassify a significant proportion of patients with UC tumors that are PD-L1 high according to the VENTANA SP263 assay (using the TC/IC algorithm); indeed, significantly fewer PD-L1 high patients would be identified using the VENTANA SP142 assay and algorithm. The discordance between patient populations may be explained by the inclusion of TCs in the VEN-TANA SP263 algorithm versus the VENTANA SP142 algorithm, the lower TC cutoffs for the PD-L1 IHC 22C3 pharmDx CPS algorithms versus the VENTANA SP263 algorithm, or the use of different denominators for the IC scoring approach in all three cases. Differences observed in assay sensitivity in this setting, particularly for PD-L1 IHC 28-8 pharmDx and VEN-TANA SP142, may also account for some variation in these patient populations. These differences in classification of patients as PD-L1 high versus PD-L1 low/ negative using different assays and different algorithms suggest that caution should be taken when comparing clinical outcomes across studies.

Conclusions
These findings inform comparisons between studies using different PD-L1 tests, as well as the next steps toward harmonization of PD-L1 diagnostic testing in UC. With compelling assay concordance data, a single PD-L1 base assay could potentially be used for different therapies, but the appropriate, clinically validated algorithm must be applied to retain the connection between the cutoff and the therapy, and ideally this would be tested and confirmed in a prospective cohort. While the PD-L1 IHC 22C3 pharmDx and VENTANA SP263 assays could be used interchangeably, the appropriate, clinically validated algorithm for each therapy must be applied, eg, CPS for pembrolizumab and TC/IC for durvalumab. (AstraZeneca, Washington DC, USA) for providing support in the statistical interpretation of the data. Medical writing support, which was in accordance with Good Publication Practice (GPP3) guidelines, was provided by Anne-Marie Manwaring, of Parexel (Worthing, UK) and was funded by AstraZeneca.