Specimen and sample stability
To assess RNA stability, results for 21 samples were obtained on 2 separate days, 3 months apart. Complimentary DNA was generated for each of the two experiments using the original isolated RNA sample. Comparison of probability score values (range 0.0–1.0) showed high correlation (R2 = 0.99, p < 0.001), 100% concordance in binary risk classification (Class 1 or Class 2) and 90% concordance on subclass risk classification (Class 1A, 1B, 2A, or 2B) (18 of 21 cases; 95% confidence interval [CI] = 70–99%). Analysis was also performed on an additional 20 samples tested on 2 separate days at intervals ranging from 48 to 122 days apart. Again, probability score values were highly correlated (R2 = 0.96, p < 0.02) with 100% concordance (95% CI = 83–100%) in binary class assignment. Risk classification using normal and reduced confidence subclasses was concordant in 18 of 20 (90%) cases (95% CI = 68–99%).
To evaluate long-term cDNA stability, we monitored reproducibility of assay performance for one Class 1 and one Class 2 cDNA sample (positive controls) included from experiment to experiment and across multiple lots of reagents (Fig. 2). Two negative water controls without template were also included with each OpenArray run over a 3-month period in which 56 assays were performed. No assays were rejected due to amplification in the negative controls. The Class 1 positive control sample had a mean quantitative probability score of 0.176 (SD = 0.029, 2SD = 0.059 and 3SD = 0.088) and the Class 2 positive control sample had a mean probability score of 0.752 (SD = 0.027, 2SD = 0.055, and 3SD = 0.082), reflecting robust assay repeatability.
Short-term cDNA stability was also evaluated. Ten samples underwent reverse transcription and the cDNA was stored for 96 h per standard operating procedures; RNA from the same 10 samples was then reverse transcribed on the day the assay was performed. All samples were run on a single assay and the resulting probability scores from the two groups were compared. We found probability score values to be significantly correlated (R2 = 0.89, p < 0.05), and both subclass and binary risk classifications were 100% concordant (95% CI = 69–100% for both).
To further examine sample stability, the success rates of GEP processing at various time points after diagnosis were assessed. We examined a total of 6772 FFPE-derived samples with documented age of specimen that were stored for up to 1 year, 1–2 years, 2–3 years, 3–4 years, or greater than 4 years prior to GEP testing. Overall we observed 98% (6647 of 6772) success rate in all specimens. There was a slight decrease in success rates in samples that had been stored for longer periods of time (p < 0.0001; Fig. 3).
We also examined the effect of delay in sample processing on the stability of the 31-gene GEP assay. We evaluated outcomes in 275 retrospective research samples processed 1.5 to 16 years after diagnosis in which a significant association between GEP Class and recurrence-free survival has been previously published [8, 9]. Multivariate Cox regression model to evaluate the interaction between GEP Class and time to sample processing showed no effect of delay in processing time (p = 0.25) and no statistical interaction between the sample age and GEP Class covariates (p = 0.51) was observed. These data indicate that the delay in processing time does not alter association of Class assignment and recurrence risk.
DecisionDx-Melanoma assay reliability
To assess inter-assay reliability of the 31-gene expression profile test, results were obtained on two separate days for 168 clinical melanoma samples. The time interval between the testing of matched samples ranged from 1 day to greater than 6 weeks. A total of 44 clinical samples were analyzed using the 7900HT Real-Time PCR System, and 124 samples were analyzed using the QuantStudio Real-Time PCR System. Comparison of probability score values (range 0.0–1.0) resulted in highly correlated scores (R2 = 0.96, p < 0.001; Fig. 4a). Binary risk classification was concordant for 167 of 168 (99%, 95% CI 96–100%) cases and subclassification was concordant for 155 of 168 (92%, 95% CI 87–96%) cases. The single case changing from Class 1 to 2 generated probability scores close to the 0.5 cutoff in the first run (0.476). Overall, the mean absolute difference in matched probability scores was 0.03 and showed 95% of variability to be within acceptable limits and not likely to change class assignment, as determined by Bland–Altman analysis (Fig. 4b).
We evaluated intra-assay reliability by obtaining results from 7 samples run in triplicate on a single OpenArray plate. The process was repeated on 3 separate runs for a total of 21 samples. Binary classification resulted in 100% concordance (95% CI 94–100%) while subclassification resulted in 98% concordance (62 of 63; 95% CI 91–100%).
Lot to lot variability for critical reagents has been evaluated in experiments ranging from 4 to 19 samples and with 2–6 reagent lots. Correlation of discriminant scores was above 0.96 for all experiments, with binary class concordance of 100% in all cases and subclass concordance above 90% for all but one reagent, in which a subclass concordance of 75% was achieved based on only one of four samples being discrepant (Additional file 1: Table S1).
Inter-platform reliability was assessed by comparing probability scores generated from 21 samples tested on both the 7900HT and QuantStudio systems. The results indicated significant correlation of probability score values between the two systems (R2 = 0.85, p < 0.001; Fig. 4c), and concordant subclass prediction was observed for 95% of cases (19 of 21). One of the matched probability score values generated for each of the two discordant cases was in the reduced confidence range (0.421, Class 1B and 0.513, Class 2A). The mean absolute difference in probability scores between instruments was 0.06 (Fig. 4d).
Twenty-two samples were run on two different QuantStudio instruments and the resulting probability scores were compared to evaluate inter-instrument reliability. Probability score values were highly correlated (R2 = 0.99, p < 0.001) and binary classification was concordant in 21 of 22 (95%) cases (95% CI 88–100%). The mean absolute difference in probability score values between instruments was 0.02.
Inter-operator reproducibility of the predictive modeling algorithm
To evaluate inter-operator reliability of the JMP Genomics predictive modeling software, RBM analysis of gene expression data for 268 clinically tested melanoma samples was performed separately by two personnel on multiple days. Quantitative probability scores generated by both analyses were identical (R2 = 1.0, p < 0.001; data not shown), and qualitative subclass and binary class prediction was concordant for all 268 cases (100%).
DecisionDx-Melanoma technical experience
From March 1, 2013 through June 30, 2016, DecisionDx-Melanoma testing was requested for 8244 primary melanoma cases from 1123 centers in the United States and Spain. Samples submitted for DecisionDx-Melanoma testing must have a sufficient density of tumor cells in order to proceed with gene expression profiling. Of the 8244 specimens, 1221 (15%) had insufficient tumor content for testing. As shown in Fig. 5, 90% of the 1221 samples with insufficient tumor density were submitted during the period from March 1, 2013 to December 31, 2015, reflecting a 20% rate of insufficient tissue for testing. Quality control studies completed in March 2015 permitted a decrease in the required tumor content from ≥60% to ≥40% melanoma within a macro-dissectible area of the tissue section. This, coupled with efforts to improve biopsy tissue preservation at the local processing level (including educational outreach to pathology laboratory staff, pathologists, and ordering clinicians), resulted in a dramatic reduction in the number of insufficient specimens. From January 1, 2016 to June 30, 2016, only 4.4% (124 of 2806) of samples lacked sufficient tumor content, reflecting 78% reduction in quality control rejections compared to the previous period (Fig. 5). No changes in the proportion of thin tumor (≤1 mm Breslow thickness) cases was observed in this period.
Overall, 98% (6895 of 7023) of cases submitted with sufficient tumor volume were successfully tested and reported, with only 1.8% cases having a reported technical failure due to amplification failure in control and/or prognostic genes. The technical success rate increased to 99% (2647 of 2682) for the period of January 1, 2016 to June 30, 2016 (Fig. 5).