Out-of-sample extension of diffusion maps in a computer aided diagnosis system. Application to breast cancer virtual slide images
© Philippe et al; licensee BioMed Central Ltd. 2013
Published: 30 September 2013
Skip to main content
© Philippe et al; licensee BioMed Central Ltd. 2013
Published: 30 September 2013
While the pathologist population tends to dramatically drop, the number of pathological cases to be examined increases sharply, mainly due to early screening campaigns; developing automated systems would thus be useful to help pathologists in their daily work. As Virtual Microscopy (VM) is more and more introduced in pathology departments  where it holds immense potential despite the large amounts of data to be managed, its combination with image processing techniques can allow to find objective criteria for differential diagnosis or to quantify prognostic markers. Thus, many works try to develop computer-aided diagnosis systems (CADS) based on image retrieval and classification [2, 3]. The first step consists in building a knowledge database involving many features extracted from a set of well-known images; it is an 'off-line' procedure conducted once. These features are represented by vectors of non-linear data acting as a signature for the original images. In a second step, signatures are obtained from new unknown images to analyze and compared with the database; it is an 'on-line' procedure. Because of tumor heterogeneity, it is essential to build knowledge databases containing representative features of the multiple morphological types of lesions before considering to implement a CADS. But, as it is almost impossible for a pathologist to manually segment large virtual slide images (VSI), the usual practice consists in manually selecting some 'representative areas'. A bias is then introduced in the process as this choice is obviously subjective. It is then mandatory to find wiser solutions leading to an unbiased collection of these 'representative areas' (and later called 'patches'). In a previous work , we have proposed an original strategy: starting from a collection of breast cancer VSI, then taking advantage of stereological sampling methods and diffusion maps, a knowledge database is obtained from a reduced number of patches that are representative of given histological types. The sampling tools offered by stereology are well-suited in this context . Systematic sampling starting from a random point with a fixed periodic interval is able to reduce the area to be analyzed, while preserving the collection of distinctive regions encountered in a tumor. However, even if the working area becomes smaller, the number of selected patches can be very large and may include many redundant elements. A data reduction has then to be conducted. Among the available methods, the diffusion maps technique [6, 7] has been retained since it provides a very attractive framework for processing and visualizing huge non-linear bulk data. Diffusion maps belongs to unsupervised learning algorithms dealing with a spectral analysis of non-linear data, providing a clustering only for given training points with no straightforward extension for out-of-sample cases. The work presented here focuses on a way to get around this problem and explains how unknown VSI can be classified by considering the diffusion maps as a learning eigenfunction of a data-dependent kernel. It makes use of the Nyström formula to estimate diffusion coordinates of new data . An application on histological types of breast cancer is presented with VSI of Invasive Ductal Carcinoma and Mastosis.
VSI come from histological sections of breast tumors stained in the same laboratory according to the Hematoxylin-Eosin-Safron protocol and acquired with the same digital scanner (a ScanScope CS from Aperio Technologies). The aim being to develop a generalized CADS, it is mandatory to manage color calibration of each device used along the process, from histological staining up to image acquisition . For this study, we have collected image patches from two histological types: Invasive Ductal Carcinoma (IDC) and Mastosis (Ma) with patches from the 'normal' morphology for further be able to remove non-informative patches. VSI have been acquired at X20 (0.5 µm per pixel) and stored in TIFF 6.0 file format (compression 30%). The tools are developed in Python language with the help of specialized modules (PIL: Python Imaging Library, SciPy and mathplotlib).
In order to reduce the expertise workload and to obtain a reliable ground truth, a stereological test grid for point counting is over-imposed onto VSI in the ImageScope viewer . The grid step has been set to 1000 x 1000 pixels (3500 points in average per image). The pathologist has then to determine which histological class is associated with the local areas centered on grid points; 30 possibilities are proposed for breast tumors. A simple mark has to be drawn on a grid point in the overlay layer whose name corresponds to his choice. Each area is then extracted at the plain resolution and stored as an uncompressed TIFF image. These areas (also called 'patches') are squares of size 400 x 400 pixels. This size has been chosen according to the representative structures encountered in breast tumors and allows to expertise only 16% of a VSI.
In fact, p(x i ,x j ) may be viewed as the transition kernel of the Markov chain on G. In other words, p(x i ,x j ) defines the transition probability for going from x i to x j in one time step. The eigenvectors Πk of P, ordered by decreasing positive eigenvalues, give the practical observation space axes. It must be noticed that Π0 is never used since linked to eigenvalue ⌊=1 (i.e. the data set mean or trivial solution). Projection is then done along (Π1,Π2,Π3) for a 3D visualization. Choosing ∑ in w(x i ,x j ) is an empirical task which should permit a moderate decrease of the exponential; some works use the median value of all distances D KL (x i ,x j ) where other use the mean distance obtained from the k nearest neighbors of a subset of X .
where λ i ( m ) and u i ( m ) are the i th diagonal entry and i th column of Ë ( m ) and U ( m ) respectively. P N , M is a nxm sub-matrix of the complete graph obtained from distances w(x i ,x j ). Its computation is an 'on-line' procedure having to be conducted for each new test set (X\Y). For a 3D visualization, the second to fourth columns are used (the first one being the trivial solution).
Computation time on a PC (dual core)
Features extraction (in seconds)
Spectral analysis (in seconds)
Histological type in the nearest neighborhood
Number of test points
This work is the second part of a CADS we aim to develop based on an original strategy starting from VS and leading to an unbiased knowledge database containing reference patches of breast tumors. The first part has been presented in . We have shown that combining stereological sampling and data reduction based on diffusion maps offers an interesting general framework. The results illustrated here are a proof of concept of the second part that is to classify new unknown patches. About 400 high resolution VS are now available in our lab; the benign and malignant breast tumors are classified into 30 histological types and subtypes. We plan to project some reference patches extracted from these 30 classes in the same 3D space, in order to build clusters, and then to classify a new unknown VS previously split in patches. But the spectral decomposition is very CPU intensive and managing for example 30 000 patches at a time (1 000 per histological type) would rapidly become impossible to compute. The Nyström extension seems to provide a good approximation of eigenvectors which then allow to reduce this computational burden.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.