PD-L1 diagnostic tests: a systematic literature review of scoring algorithms and test-validation metrics

Background The programmed death receptor 1 (PD-1) protein is a cell-surface receptor on certain lymphocytes that, with its ligand programmed death ligand 1 (PD-L1), helps to down-regulate immune responses. Many cancer types express PD-L1 and evade immune recognition via the PD-1/PD-L1 interaction. Precision therapies targeting the PD-1/PD-L1 pathway have the potential to improve response and thereby offer a novel treatment avenue to some patients with cancer. However, this new therapeutic approach requires reliable methods for identifying patients whose cancers are particularly likely to respond. Therefore, we conducted a systematic literature review assessing evidence on test validation and scoring algorithms for PD-L1 immunohistochemistry (IHC) tests that might be used to select potentially responsive patients with bladder/urothelial cell, lung, gastric, or ovarian cancers for immunotherapy treatment. Methods and results To identify evidence on commercially available PD-L1 IHC assays, we systematically searched MEDLINE and Embase for relevant studies published between January 2010 and September 2016 and appraised abstracts from recent oncology conferences (January 2013 to November 2016). Publications that met the predefined inclusion criteria were extracted and key trends summarized. In total, 26 eligible primary studies were identified, all of which reported on the test validation metrics associated with PD-L1 IHC tests in lung cancer, most using immunohistochemistry testing. There was significant heterogeneity among the available tests for PD-L1. Specifically, no definitive cutoff for PD-L1 positivity was identifiable, with more than one threshold being reported for most antibodies. Studies also differed as to whether they evaluated tumor cells only or tumor cells and tumor-infiltrating immune cells. However, all of the tests developed and validated to support a therapeutic drug in the context of phase 2–3 clinical trials reported more than 90% inter-reader concordance. In contrast, other PD-L1 antibodies identified in the literature reported poorer concordance. Conclusions Published validation metric data for PD-L1 tests are mainly focused on immunohistochemistry tests from studies in lung cancer. The variability in test cutoffs and standards for PD-L1 testing suggests that there is presently no standardized approach. This current variability may have implications for the uptake of precision treatments.


Background
Checkpoint inhibitor therapy is a recent development in the field of cancer immunotherapy and precision medicine, and involves targeting immune pathways that enhance the body's ability to recognize and destroy tumor cells (TCs). One key mediator in such pathways is the programmed death receptor 1 (PD-1) protein, a cellsurface receptor on certain lymphocytes. The interaction between PD-1 and its ligand, programmed death ligand 1 (PD-L1), plays a crucial regulatory role in the human immune system by inhibiting the body's immune response to foreign antigens. However, many cancer cell types express PD-L1 and thereby activate PD-1/PD-L1 signaling, thus enabling these tumors to evade immune recognition. Precision therapies that focus on the PD-1/ PD-L1 pathway can offer a novel treatment avenue to some patients with cancer. Five PD-1/PD-L1 immunotherapies (atezolizumab, avelumab, durvalumab, nivolumab and pembrolizumab) have now been approved by the United States (US) Food and Drug Administration (FDA) and/or European Medicines Agency (EMA) for a variety of indications following the publication of clinical trials demonstrating their efficacy improving therapeutic response.
Although research into the effectiveness of these types of immunotherapy is rapidly evolving, there remains some uncertainty regarding the extent to which measuring levels of PD-L1 expression in individuals' tumor tissue helps to identify patients who are most likely to respond to treatment. For example, in Hodgkin's lymphoma, most tumors have been reported to express PD-L1, so assessing expression in patients can contribute only minimally to clinical decision-making about suitability for treatment [1]. However, for a specific group of cancers (e.g., non-small cell lung cancer), evidence suggests that responsiveness to PD-1 inhibitors such as pembrolizumab and nivolumab or to the anti-PD-L1 antibodies atezolizumab and durvalumab may be predicted by expression of PD-L1 on TCs and/or tumorinfiltrating immune cells (ICs) [1]. Therefore, tests detecting PD-L1 expression may play an important role in the use and development of anti PD-1/PD-L1 agents aimed at these tumor types, which include bladder/ urothelial cell, lung, gastric, and ovarian cancer.
Currently there are a range of commercially available PD-L1 IHC tests. Tests are typically designated by the antibody clone that is used to detect the presence of the PD-L1 protein; for example, the 22C3 test developed by Dako (PD-L1 IHC 22C3 pharmDx, Agilent Pathology Solutions) uses a monoclonal mouse anti-PD-L1 clone, 22C3. Some of the available tests have been developed and validated as part of clinical trials that were used to demonstrate the efficacy of the aforementioned licensed PD-1/PD-L1 immunotherapy medicines. Tests of this type can be further sub-divided into two types: companion diagnostics, which (per the US Food and Drug Administration (FDA) definition), provide information, often obtained in vitro, that is "essential for the safe and effective use of a corresponding drug or biologic product" [2], and complementary (or co-diagnostic) tests that may be used in treatment selection, but are not considered essential for safe and effective use of the corresponding therapy in practice. A key distinction between companion and complementary diagnostics is that, whereas companion diagnostics are tied to a specific drug within its approved label, complementary or co-diagnostics may be associated with particular drugs but are not included in the licensing indications for those drugs. Of note, IHC-22C3 for pembrolizumab is currently the only FDA-approved companion diagnostic for PD-1/PD-L1 targeted immunotherapies. Furthermore, although pembrolizumab is now licensed for multiple indications, the FDA only recommends IHC-22C3 for treatment selection for the following specific groups: patients with previously untreated metastatic non-squamous non-small cell lung cancer (NSCLC) whose tumors express PD-L1 at a level of 50% of higher (or second line NSCLC patients with ≥1% expression) and patients with recurrent locally advanced or metastatic, gastric or gastroesophageal junction adenocarcinoma who have Combined Positive Score (CPS) (a measure based on the number of PD-L1 stained tumor cells, lymphocytes, macrophages) of ≥1. Other tests such as IHC 28-8, SP142, and SP263 for nivolumab, atezolizumab and durvalumab respectively, are regarded as complementary diagnostics and are not considered by the FDA as being essential for safe and effective treatment selection.
The landscape of available potential PD-L1 diagnostic tests is further complicated by the fact that each test has its own antibody detection system and tests are performed using different platforms. As a result, the extent to which particular tests are either interchangeable across different indications or superior in terms of accuracy can be important to both uptake of PD-1/PD-L1 targeted therapies and use of these tests for patient management decisions. To provide insights into this area and to help identify and address potential knowledge gaps, a systematic literature review (SLR) was conducted to provide insights into the characteristics of different tests and to examine the validity of commercially available PD-1/PD-L1 tests in assessing bladder/urothelial cell, lung, gastric, and ovarian cancers.

Objectives
This review explored the characteristics of commercially available PD-L1 tests currently in use for bladder/ urothelial cell, lung, gastric, and ovarian cancers, by addressing the following specific research questions: What types of tests, platforms, and scoring algorithms are currently being used? How has the validity of these tests, platforms, and scoring algorithms been tested?

Methods
The SLR was conducted in accordance with the methods outlined in the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.
Systematic searches were conducted in MEDLINE® (via PubMed) and Embase® (via embase.com) for studies published in English between January 1, 2010 and September 15, 2016. Medical Subject Headings (MeSH), EMTREE terms, and free-text terms were used and combined, where appropriate, with Boolean operators ("AND", "OR," and "NOT"). Key search terms included text variations on biomarkers of interest, such as "programmed death-ligand," "PDL1," "PD-L1," and relevant validation metrics, such as "Sensitivity and Specificity" (MeSH) and "valid*." (The MEDLINE search strategy is provided in a supplementary appendix.) Two searches were run; the second supplementary search used the same core algorithm but with some additional terms (for example "correlat*" and "immunohistochemistry" [MeSH]) to ensure the search was comprehensive.
Supplementary searches were undertaken to capture 'grey' literature-data from sources not indexed in the electronic databases. To capture such evidence, proceedings from the three most recent meetings of the following six subject-specific conferences were searched: Study selection was based on criteria that were defined a priori and are summarized in Table 1. The titles and abstracts of records retrieved via the literature searches were first appraised by a single reviewer, and 10% of the screening decisions made at this level were checked by second reviewer to confirm their accuracy, as a quality control measure. Relevant studies that passed this first round of screening then underwent full-text screening, which was conducted by two reviewers to confirm each inclusion and exclusion decision. Any discrepancies at the abstract and full-text level were resolved in discussion with a third reviewer where necessary.
Data abstraction of the included studies was performed using a predefined data abstraction template designed in Microsoft Excel®. For each included study, Commercially available tests were prioritized to ensure that review was relevant to current practice. The geographic emphasis reflects the countries in which PD-1 /PD-L1 immunotherapies are currently licensed. Most evidence in this field is published in English so language limits were designed to reflect this.
Abbreviations: EU5 European Union 5, N/A not applicable, PD-1 programmed death receptor 1, PD-L1, programmed death ligand 1, US United States data were captured by a single investigator, with validation of the accuracy and completeness of this abstraction being performed by a second reviewer. Any discrepancies were resolved in a discussion with a third investigator. Specific key information was abstracted from included studies on the following: patient population, type of test, test developer, test platform, testscoring algorithms, test thresholds/cutoffs, and testvalidation metrics. Due to the variety of study designs considered in this review, it was not possible to undertake a risk-of-bias assessment using a single standardized tool. Heterogeneity in the studies also meant that a quantitative meta-analysis of their data was not appropriate; therefore, the evidence abstracted from included studies was qualitatively synthesized and key trends were summarized.

Search results
The indexed database searches yielded 950 records. After removing publications duplicated between databases, 589 abstracts remained and were screened, of which 57 met the criteria for detailed review of their associated full-text publications. Of these 57 publications subjected to fulltext screening, 12 were eligible for inclusion in the SLR, as they reported on PD-L1 test validation metrics for commercially available tests. An additional eight studies were identified from the supplementary search and 10 conference abstracts also met the eligibility criteria. Therefore, a total of 30 references (collectively representing 26 unique study populations and four linked publications) were included in the review. The study screening and selection process is illustrated in Fig. 1. All 26 included studies reported on test validation metrics associated with PD-L1 tests in lung cancer. One of the studies also reported data relating to bladder/ urothelial cell cancer [3]. No evidence relating to gastric or ovarian cancer was identified.  VENTANA PD-L1 (SP142) Assay by Roche (referred to hereafter by the antibody SP142): 9 studies [3,6,8,9,[16][17][18][19][20] PD-L1 (E1L3N®) XP® Rabbit mAb #13684 by Cell Signaling Technology [CST] (a reagent provider): 9 studies [8,11,15,[20][21][22][23][24][25] 4059 by ProSci, Inc.: 1 study [26] h5H1 by Advanced Cell Diagnostics: 1 study [27] 9A11 (developer not reported): 1 study [8] In all cases, PD-L1 expression was evaluated using an immunohistochemistry (IHC) platform. One of the studies specified that diaminobenzidine tetrahydrochloride was used as the reagent to produce the "brown staining" for the IHC process [8]. Three studies evaluated results derived from alternative test platforms as well as IHC. Two studies [8,20] measured PD-L1 expression using quantitative fluorescence (QIF) and another study looked at fluorescence in-situ hybridization (FISH) [12].
The antibodies manufactured by Dako and Roche had all been originally developed and validated to support a therapeutic drug in the context of a clinical trial. These antibodies were evaluated in eight studies as follows: Three studies looked at the IHC-SP142 (Roche), developed alongside atezolizumab [3,16,17] Two studies looked at IHC-SP263 (Roche), developed alongside durvalumab [14,28] Two studies looked at IHC-22C3 (Dako), developed alongside pembrolizumab [4,5] One study looked at IHC-28-8 (Dako), developed alongside nivolumab [10] Test-scoring algorithms and thresholds used among the PD-L1 tests The thresholds and scoring systems used to determine PD-L1 positivity varied between the antibodies and across studies. Eleven studies [4, 7, 10-12, 14, 19-21, 23, 28] investigated dichotomous cutoffs (representing the proportion of cells with PD-L1 expressed) for PD-L1 positivity using different antibodies (the thresholds used in these studies are summarized in Table 2). Amongst these 11 studies, nine [4, 6, 9-11, 19, 20, 22, 27] set thresholds a priori (for example, based on cutoffs used in previously published research) and two studies [4,13] attempted to establish an optimal threshold based on the study findings. In one study [18], it was unclear whether the thresholds used had been specified prospectively or retrospectively.
A further 11 studies [5, 9, 13, 15-17, 22, 24-27] used a hybrid score that combined components of staining intensity with the percentage of positive cells to determine PD-L1 positivity. One study evaluated two tests, SP142 (Roche) and E1L3N (CST; reagent provider), by means of a QIF process that used an automated scoring system. In this system, the QIF score of PD-L1 signal for each antibody in the tumor and stroma was calculated by dividing the target PD-L1 pixel intensities by cytokeratin and DAPI positivity [20].
A second study [8] that incorporated QIF did not provide details on the scoring approach. Another study [12] investigated FISH and evaluation criteria included CD274, PDCDILG2-CEB 9 ratio, gene copy numbers, proportions of TCs with ≥4 PDL1/2 and ≥5 PDL1/2 signals, and gene clusters. Yet another study [6] validated a six-step scoring system that integrated all of the cutoff criteria from four tests that have been used in clinical trials: 28-8 and 22C3 (both Dako) and SP142 and SP263 (both Roche).

Types of cells tested for PD-L1 expression
There was variation among the studies with regard to the cell type tested, specifically, whether PD-L1 expression was measured on TCs and/or tumor-infiltrating ICs. Nine studies tested TCs only [4, 5, 7, 10-12, 21, 26, 27], two tested both TCs and tumor stroma [20,29], 14 studies evaluated both TCs and ICs [3, 6, 8, 9, 13-17, 19, 22, 24, 25, 28], and in one study it was unclear which type of cell had been tested [23]. TCs were more frequently evaluated than tumor-infiltrating ICs or tumor stroma, regardless of whether dichotomous or hybrid scoring algorithms were used.

Test validation metrics
Individual test performance Most studies (18/26) focused on a single antibody and reported validation metrics that were specific to the one test under investigation, without comparing its performance with that of another antibody or testing approach. The results of these studies by outcome are summarized below and in Table 3. Among the tests developed in a clinical trial setting to accompany a therapeutic product, the validation metrics were similar and all the tests had a greater than 90% inter-observer concordance [10]. In comparison, E1L3N, a test developed outside of clinical-trial settings [i.e., not specifically for a particular PD-1/PD-L1-targeted therapy], reportedly had slightly lower inter-observer concordance metrics [21][22][23], namely below the 84-88% concordance level at the 1% cutoff [21]. In the studies that reported intra-observer and inter−/intra-site concordance, high agreement (above 90%) was observed for all these metrics across the tests developed in a clinical trial setting to accompany a therapeutic product, except for inter-site concordance for SP263 (Roche; durvalumab), which was 86.4% [14] and for 22C3 (Dako; pembrolizumab) 88.3% [5]. Two studies reported on the extent of agreement in test results when different types of samples (biopsy or surgical-resection) were tested, and these found some conflicting results. One study looked at the use of the SP142 test (Roche) in biopsy and surgical-resection samples. It reported an overall discordance rate of 48% (95% confidence interval, 4.64%-13.24%) and a κ score of 0.218, indicating poor agreement between the test outputs from the different sample types [13]. The study authors commented also that in all cases, the biopsy specimens underestimated the PD-L1 status relative to the expression level in the whole tumor (further data not provided in the study report). Another study found overall concordance between biopsy and surgicalresection samples ranged from 82.5% (κ = 0.3969) (i.e., fair agreement), at a score of hybrid score of 51 (range, 0-170) or greater, to 92.4% (κ = 0.8366) (i.e., high agreement), at a score of 1 or greater [26].
Head-to-head test performance Seven studies reported data relating to the comparative performance of two or more tests, and their key findings are summarized in Table 4. Among these studies, three reported on the overall test concordance between two or more antibodies. The first found acceptable agreement between two tests developed in a clinical trial setting to accompany a therapeutic product, 28-8 (Dako; nivolumab) and SP263 (Roche; durvalumab), for which the overall test concordance was 90.3%. The remaining two studies found mixed results when a clinical trial test developed to support a therapeutic product was compared with E1L3N, which was not developed or validated as part of a clinical trial. Of these studies, one observed poor concordance when SP142 (Roche, atezolizumab) was compared with the antibody E1L3N (CST; reagent provider, not developed or validated as part of a clinical trial) (κ concordance at 1% cutoff = 0.340, 5% cutoff = 0.286, and 50% = 0.189) [20]. The other study reported moderate agreement between 28 and 8 (Dako, nivolumab) and E1L3N (75.0% and 86.2% at 5% and 50% cutoffs, respectively) [11].

Harmonization of scoring algorithms across antibodies
One study reported on inter-observer concordance based upon a six-step scoring system which integrated the criteria employed by the four different clinical trial tests (28-8 and 22C3 [both Dako], SP142 and SP263 [both Roche]) and found moderate agreement using this harmonized approach (κ = 0.47 to 0.49) [6]. The study also reported good concordance coefficients (κ = 0.59 to 0.80) when using integrated dichotomous proportion cutoffs across the antibodies (≥ 1%, ≥ 5%, ≥ 10%, ≥ 50%); however, proportion scoring of PD-L1-positive IC yielded lower inter-observer concordance coefficients both for the six-step score (κ < 0.2) and the dichotomous cutoffs (κ = 0.12 to 0.25), concluding that unified PD-L1 IHC scoring criteria for TCs may be feasible, whereas scoring for ICs requires detailed training [6].

Bladder cancer
One study reported on the test-validation performance of a PD-L1 test in bladder/urothelial cell cancer for the antibody SP142 (Roche) and found it had acceptable inter-reader concordance between pathologists (> 90%) when measuring PD-L1 expression in both IC and TC in bladder/urothelial cell cancer [3].

Discussion
The results of this SLR demonstrate that there are varied cutoff and scoring algorithm approaches among the commercially available PD-L1 antibody tests in lung cancer. There is, for example, no commonly accepted standard or threshold for determining positivity for each of the antibodies based on the proportion of PD-L1positive cells. Further differences between scoring algorithms relate to the way in which staining patterns are interpreted; some studies have investigated the use of proportional scoring [4, 7, 10-12, 14, 19-21, 23, 28] for the respective antibodies, whereas other studies have looked at hybrid test-scoring methods that also take into account staining intensity [5, 13, 15-18, 22, 24-27].
In general, our review found that the concordance between tests developed in a clinical trial setting to accompany a therapeutic product was deemed acceptable, with inter-reader concordance exceeding 90% [7]. This finding is mirrored in recently published data from phase 1 of the Blueprint Project, which explored the  , and Roche SP263) and found comparable results across the tests when applied to assess TC staining in NSCLC, although the test SP142 resulted in fewer stained TCs overall (phase 2 of this project is now underway and will seek to validate these findings and also provide data on a fifth assay developed by Dako that uses the antibody 73-10). Our SLR did, however, find conflicting evidence concerning concordance when different antibodies developed in a clinical trial setting to accompany a therapeutic product were compared with those developed outside this type of setting, such as E1L3N [11,15,20]. Our findings are in line with other reviews in this topic area (which were performed non-systematically), which have also reported on the variations in cutoffs used for different antibodies to determine PD-L1 positivity [30][31][32]. In particular, our research did not identify a definitive threshold result that can be universally applied to predict clinical response to PD-L1-targeted precision treatments, which has been noted previously by Festino et al. [30]. There were also differences among the studies included in our review in terms of the types of cells that were tested for PD-L1 expression (i.e., TCs only, or TCs and ICs), with some studies [13,15] also noting differences in staining patterns and concordance depending on whether biopsy and surgical resection samples were tested. Two recent review articles have also reported that cell type can play a key role in determining test outcomes. Specifically, these publications have indicated that ICs express significantly higher levels of PD-L1 than TCs (e.g., Ma et al. [31] and Festino et al. [30]) and that the expression by TCs is sometimes more heterogeneous compared with that of ICs. It has also been theorized that different cell phenotypes/characteristics may also contribute to this variability in PD-L1 expression across cancer cells [32].
One limitation of our review is that of the existing commercially marketed tests considered, most were IHC tests, with only three studies reporting on QIF [8,20] and FISH [12]. We did not, for example, find any data on multimarker or next-generation tests that identify PD-L1 expression. In addition, only limited evidence was found on PD-L1 tests in bladder/urothelial cell cancer, and there were no validation studies for commercially available tests in gastric or ovarian cancers.
The heterogeneity in the findings of this review has important implications for clinical practice. Notably, the lack of standard thresholds for responder identification and concordance between a subset of tests indicates the existence of (1) potential risks for efficient treatment selection and use of precision therapies; (2) confusion about whether it is important to request a particular PD-L1 test; and (3) potential adverse effects on patient management decisions (e.g., if the test thresholds used in clinical practice do not correspond with those used in the clinical trials in which particular IHC clones were developed and validated, and in which treatment efficacy was demonstrated, the patient may be inaccurately identified as a potential therapy recipient). However, it is also important to note that no study from our search results reported evidence for these possibilities. Ambiguity around test thresholds, decision algorithms, and interchangeability of PD-1/PD-L1 testing could also present uncertainty for those payers who view accurate prediction of the subpopulation of treatment responders as being a key value of precision therapy approaches. Where there is variability in the interpretation or selection of particular tests, there is the potential for physician confusion, interpretation dilemmas, and payer uncertainty. There are illustrative examples of such difficulties from previous attempts to introduce biomarker testing to the selection of precision therapy and patient management. In the case of IHC and molecular testing for epidermal growth factor receptors, for instance, the substantial variability in test cutoffs or thresholds and the potential for variable interpretation of early-generation tests have been well documented. Following early introduction of tests for this marker and initial launch of EGFR-targeted agents, some health technology assessment and payer organizations (notably, large commercial health plans in the United States and the Canadian Agency for Drugs and Technologies in Health [33] in Canada) had concerns around interpretation and selection of some EGFR tests, arguing that the connection between test results and patient management or treatment selection was insufficiently clear. Another example occurred in the years immediately following the launch of trastuzumab, when there was significant controversy among physicians over the selection of HER2 IHC vs. FISH testing that led, in some cases, to slower uptake of the associated precision medicines. When clinical practice guidelines were updated to indicate that IHC testing should be conducted initially, with a subset of these patients receiving receiving FISH testing for confirmation, this clarified the appropriate clinical testing pathway for prescribing trastuzumab [34]. These instances of uncertainty about how companion diagnostic tests should be interpreted and used had implications for access to precision treatments in some markets, and/or influenced uptake and use of these medicines and their associated tests [34][35][36].
Conducting additional studies and increasing both interpretation and education about test cutoffs would help to better inform the use of PD-1/PD-L1 diagnostics and ensure more consistent clinical assessment and application of the class of PD-1/PD-L1 inhibitors [31]. In addition, the available literature suggests that greater understanding is needed on the interchangeability of these PD-L1 tests for predicting response to anti-PD-L1 and anti-PD-1 targeted therapies. Such evidence would be crucial for supporting decisionmaking in a context where multiple PD-L1 tests are available (which seem to have variable validity in inter/intra-observer and inter/intra-site concordance) and where findings are not always consistent or reproducible across tests.

Conclusions
Most validation-metric data available for PD-L1 tests relate to the use of IHC tests in the context of lung cancer, and this evidence raises some key challenges that may influence the uptake of PD-L1 testing. In particular, standardization among available PD-L1 IHC tests is currently lacking (with regard to antibodies used, cutoffs/thresholds for a given antibody, and differences in scoring algorithm and test sites) and there is limited information on the extent, if any, to which the tests might be interchangeable. Developing strategies to address this variability in available IHC tests and publishing data that clarify the value of non-IHC-based approaches, such as FISH and next-generation tests that incorporate PD-L1, will be important to address as the availability of precision treatments focused on these biomarkers continues to increase.