Determining similarity in histological images using graphtheoretic description and matching methods for contentbased image retrieval in medical diagnostics
 Harshita Sharma^{1, 2}Email author,
 Alexander Alekseychuk^{2},
 Peter Leskovsky^{2},
 Olaf Hellwich^{2},
 RS Anand^{1},
 Norman Zerbe^{3} and
 Peter Hufnagl^{3}
DOI: 10.1186/174615967134
© Sharma et al.; licensee BioMed Central Ltd. 2012
Received: 14 August 2012
Accepted: 9 September 2012
Published: 4 October 2012
Abstract
Background
Computerbased analysis of digitalized histological images has been gaining increasing attention, due to their extensive use in research and routine practice. The article aims to contribute towards the description and retrieval of histological images by employing a structural method using graphs. Due to their expressive ability, graphs are considered as a powerful and versatile representation formalism and have obtained a growing consideration especially by the image processing and computer vision community.
Methods
The article describes a novel method for determining similarity between histological images through graphtheoretic description and matching, for the purpose of contentbased retrieval. A higher order (regionbased) graphbased representation of breast biopsy images has been attained and a treesearch based inexact graph matching technique has been employed that facilitates the automatic retrieval of images structurally similar to a given image from large databases.
Results
The results obtained and evaluation performed demonstrate the effectiveness and superiority of graphbased image retrieval over a common histogrambased technique. The employed graph matching complexity has been reduced compared to the stateoftheart optimal inexact matching methods by applying a prerequisite criterion for matching of nodes and a sophisticated design of the estimation function, especially the prognosis function.
Conclusion
The proposed method is suitable for the retrieval of similar histological images, as suggested by the experimental and evaluation results obtained in the study. It is intended for the use in Content Based Image Retrieval (CBIR)requiring applications in the areas of medical diagnostics and research, and can also be generalized for retrieval of different types of complex images.
Virtual Slides
The virtual slide(s) for this article can be found here: http://www.diagnosticpathology.diagnomx.eu/vs/1224798882787923.
Keywords
Attributed Relational Graphs (ARG) Region of Interest (ROI) Breast tissue biopsy Connected components Graphtheoretic A* searchBackground
Histology may greatly benefit from development of suitable automatic analysis methods. Histological image analysis can contribute towards diagnosis and treatment planning, study and research work. Sometimes, it is required to find the similarity between histological images or their regions. Given a database of reference images and a query image, one or several images from the database need to be retrieved which are similar to the query. Contentbased image retrieval (CBIR) can can address this problem, particularly using graphbased approach.
Pathologists make use of staining intensity, morphological changes and notably spatial relationships of tissue components during histopathological examinations. Designing a system which retrieves sample regions being structurally similar to a region in question can contribute towards automated detection of malignant changes. Besides research and education, clinical pathology is expected to benefit from such a system where visually interesting regions containing similar tissue structures can be selected and retrieved from existing large databases for further studies. Therefore, the work has been performed keeping in mind the generic nature of medical images as well as the specific nature of the histological data to be analysed, by exploiting the representational power of graphs to describe such complex images efficiently.
Tagare et al. have presented a contentbased retrieval approach for medical image database in [1], where it has been strongly emphasised that medical image information contains spatial data and a large part of image information is geometric. The stateoftheart generalpurpose CBIR techniques using lowlevel features based on texture, colours and shape are insufficient for histological images since these methods do not incorporate highlevel structural information and neighbourhood relationships between image regions. Therefore, an appropriate improvement in this direction can be the use of structural methods adopting graphs, being explored in this paper.
Graphs have recently drawn increasing attention of the scientific community as effective structural descriptors due to their ability to represent relational information. They can be employed for providing efficient descriptions of images by associating nodes with specified attributes to image components and edges with appropriate weights to relationships between these components. This property can be exploited to obtain graphbased representations of the database and the query images, and then to search for structurally similar images by means of inexact graph matching, which involves calculation of a matching cost. The closest matches can then be obtained and displayed in decreasing order of similarity (i.e. increasing cost of matching). Hence, the aim of this work is to provide an algorithm for automatic contentbased retrieval of similar images from large histological databases, which, at such scale, would not be feasible to perform only by visual analysis of humans.
In order to analyse histological images for diagnostic purpose, a semiautomatic method using lowlevel features of tissue images has been proposed in [2] for automatic selection of ROIs for further diagnosis. Kayser et al. [3] discuss the information recognition algorithms that can be used for field of view detection in virtual microscopy, by measuring diagnosisrelevant information. They include graph representations of tissues based on Voronoi diagrams. Some classification methods have been developed as tools for diagnostic assistance in histopathological examinations of lungs [4, 5].
Graph theory has also been used by authors for information representation in the field of histology. The most common method is Delaunay triangulations (and their corresponding Voronoi diagrams) where nuclear components of the tissue are considered as graph nodes [6, 7]. Minimum spanning trees can also be obtained from them. Probabilistic graphs where nucleus forms nodes and edges are assigned according to some probability distribution have been proposed in [8]. However, all these graphs exhibit lowlevel (pixelbased) information of the image, unlike graphs introduced in this work as they contain highlevel (regionbased) information related to structure and spatial relationships between regions.
Overview of Content Based Image Retrieval
In Information Retrieval (IR) systems, the user specifies a query either in the form of text, documents, images, or sounds and the system is expected to return the items that are semantically similar to the query in some sense. CBIR is an information retrieval system that includes techniques for retrieving digital images by their visual content. The horizon of CBIR includes methods ranging from image similarity functions to highly complex image annotation systems [9].
At present, CBIR is an extremely active area of research. Descriptions of a variety of CBIR approaches implemented in the past are given in reviews [10] and [11]. CBIR has been applied to medical domain and a comprehensive review on medical CBIR systems is given by Müller et al. [12]. However, most of the recently developed retrieval methods are dedicated to radiological images [13]. Specifically for histological images, research in this field has been comparatively less. An application with histopathological images is described in [14], using a property concept frame representation for morphological characteristics based on fuzzy logic. However, it does not emphasize on the spatial relationships between the various tissue components, which are considered an important aspect in our work, in order to describe the overall topology of the breast tissues.
In Diamond Project [15], the interactive search in large distributed data repositories was addressed. Particularly, the most relevant to medical domain are MassFind [16], FatFind [17] and PathFind [18]. MassFind is an application for diagnosing lesions in mammograms, which focuses on performance of different distance metrics to define similarity between ROI images. FatFind exploits the property of perfect round shape of adipocytes in cell microscopy images for their automatic counting, by making use of lowlevel shape features of cells. PathFind is a tool employing “discardbased search” for contentbased retrieval of WSIs. However, in all the applications, less attention is given to highlevel structural representation and retrieval algorithms specific to histological images; but more emphasis is on development of design and implementation strategies of the search methods, to handle huge data collections in largescale efficient networkdistributed frameworks.
 1.
Signature calculation: Mathematically describing images based on the characteristics of their visual content. The mathematical description is called “signature” and may include intensity, colour, texture, shape, size, location or their mixtures [19]. The signature must be selected carefully, depending on the context, as it describes the content within the image.
 2.
Similarity measure calculation: Assessing the similarity between a pair of images (query and database) and retrieving those database images having highest similarity to the query submitted to the system.
 1.
The offline task includes feature extraction from and signature calculation of the database images, as well as the storage of the computed signatures. At this stage, there is no interaction with the user for retrieval task.
 2.
The online task includes analysis of the query image and its signature calculation. It also includes similarity computation, search and retrieval of similar database entries as well as interaction with the user through a GUI.
Graph Theory
A graph is a set containing a finite number of points, called nodes (or vertices), which are connected by lines called edges (or arcs). In this paper, a graph is considered as a 4tuple G=(V,E,α,β), where

V is the finite set of vertices.

E ⊆ V × V is the set of edges.

α:V → L is a function assigning labels to the vertices.

β:E → L is a function assigning labels to the edges.
Attributed Relational Graph
Regional Adjacency Graph
Graph Matching
 1.
Graph Isomorphism: It finds an exact structural correspondence between two graphs. It is a bijective mapping that preserves the number of nodes and edges. It is illustrated in Figure 4.
 2.
Subgraph Isomorphism: If nodes along with their corresponding edges are deleted from a graph G, a subgraph G ^{ ′ } denoted by G ^{ ′ }⊆ G is obtained. A subgraph isomorphism from G to G” is an isomorphism from a graph G” to a subgraph G ^{ ′ } of G. It is shown in Figure 5.
 3.Monomorphism: It is a more relaxed matching than subgraph isomorphism as extra edges are also allowed between nodes in the larger graph. Figure 6 illustrates monomorphism from graph G to graph G ^{ ′ }. Formally, it can be stated as: Let G and G’ be graphs. A graph monomorphism between G(V,E,α,β) and G ^{ ′ }(V ^{ ′ },E,^{ ′ } α ^{ ′ },β ^{ ′ }) is an injective mapping F _{ mono }:V→V ^{ ′ } such that:$\alpha \left(v\right)={\alpha}^{\prime}\left({F}_{\mathrm{mono}}\right(v\left)\right)\phantom{\rule{.5em}{0ex}}\forall \phantom{\rule{.5em}{0ex}}v\in \mathrm{V.}$(1)
 4.
Maximum Common Subgraph (MCS): An MCS of two graphs, G and G ^{ ′ }, is a graph G” that is a subgraph of both G and G ^{ ′ }, such that it has the maximum number of nodes among all possible subgraphs of G and G ^{ ′ }. MCS of two graphs is usually not unique. It can be used to measure the similarity of objects as the larger the MCS, higher will be the similarity. It is shown in Figure 7. A common subgraph of G and G ^{ ′ }, CS(G,G ^{ ′ }), is a graph G” such that there exist subgraph isomorphisms from G to G” and from G ^{ ′ } to G” or viceversa. G” is a MCS of G and G ^{ ′ }, MCS(G,G ^{ ′ }), if it is the common subgraph with maximum nodes.
Types of graph matching
 1.
Exact matching: These methods find a strict correspondence between two graphs if it exists. Structurally, it ensures that the mapping between nodes of the two graphs must be ‘edgepreserving’, that means if two nodes in one graph are linked by an edge, they are mapped to two nodes in the other graph that are also linked by an edge. For ARGs, the matching ensures that the attributes are also identical in both graphs.
 2.Inexact matching: The algorithms do not find a strict correspondence between two graphs but a more relaxed one, as there maybe a match between nodes where edges are not preserved. Further, for ARGs, the attributes of nodes and edges may differ. In this case, a cost (or distance) is calculated that takes into account differences among the corresponding attributes. The matching finds a mapping that minimizes this cost. It is used where the constraints imposed by exact matching are too strict for graphs used, such as graphs not identical to each other. Two types of inexact matching algorithms exist [24]:
 (a)
Optimal inexact matching: These algorithms always find a solution that is the global minimum of the matching cost, i.e. they will find an exact solution if it exists. However, they are usually more expensive than exact ones as they require exponential time and space due to the NP completeness of the problem. Due to this reason, they are suitable for graphs with a small number of nodes and edges.
 (b)
Approximate or suboptimal matching: These algorithms only ensure to find a local minimum of the matching cost. Not always ensured, but often the local minimum found is close to the global minimum. However, even if an exact solution exists, they may not be able to find it.
 (a)
The basic A* Search Algorithm
Dijkstra’s algorithm [25] starts with the source node and traverses the nodes in a graph such that shortest path from the source, found so far, is prolongated first. Thus, by reaching the goal node the shortest path is guaranteed to be found. On the other hand, the Greedy BestFirstSearch algorithm [26] selects the node closest to the goal by using a heuristical estimate of the distance of a node from the goal node irrespective to distance to source and thus finds a path to the goal in shortest time, which is not necessarily the shortest path. A* algorithm [27] was developed to combine formal approaches like Dijkstra’s algorithm and heuristic approaches like Greedy BestFirstSearch algorithm.
Path Scoring
where:

g(n) is the distance from source node to node n.

h(n) is the heuristic function that is used as an estimate of the minimum cost from current node n to the goal node. It is important to choose a good heuristic function. The more accurate the heuristic the faster the goal node is reached and through shorter path.

f(n) this is the current approximated cost of the shortest path to the goal node going through node n.
A* computes the sum f(n) of g(n) and h(n) as it moves from the source to the goal and selects the node with the lowest f(n) in each iteration. Let h^{∗}(n) be the true minimal cost from n to goal. The behaviour of the algorithm depends on the heuristic h(n) as [28]:

If h(n)=0, then A* turns into Dijkstra’s algorithm as only g(n) plays a role. It is guaranteed to find a shortest path.

If h(n) ≤ h^{∗}(n), then A* is guaranteed to find a shortest path. The lower h(n) is, the more nodes are expanded, making it slower.

If h(n)=h^{∗}(n), then optimal path will be found and no other nodes will be expanded, making it very fast. Hence for a given perfect heuristic, A* will behave perfectly.

If h(n) > h^{∗}(n), then A* is not guaranteed to find the shortest path, but it can be even faster than the optimal h(n)=h^{∗}(n) case.

If h(n) ≫ g(n), then A* turns into Greedy BestFirstSearch algorithm as only h(n) plays a role.
Implementation
Properties
 1.
Completeness: A* is complete, as it takes an input, evaluates the paths possible from source to goal, and returns a solution if it exists. Hence, if there is a solution, it will be found.
 2.
Admissibility: For optimal performance A* must be admissible, i.e., h(n) should be a lower bound on the true minimal cost h ^{∗}(n) ( h(n) ≤ h ^{∗}(n)∀n). Then it would find an optimal path from source to goal if it exists.
 3.Complexity: The time complexity depends on the value of h(n). When h(n) is very small (in the worst case), the number of nodes traversed is exponential to the length of the shortest path. However, when the search space is a tree, which holds true in the case considered here, goal is a single node, and h(n) meets the condition that the error of h(n) does not grow faster than the logarithm of h ^{∗}(n),$\lefth\right(n){h}^{\ast}(n\left)\right=O\left(\mathrm{log}\right({h}^{\ast}\left(n\right)\left)\right)$(3)
then the number of nodes traversed become polynomial [26].
Methods
 1.
Image acquisition: H&Estained breast biopsies are used in this study. Specimens are digitized and the wholeslide images (WSIs) are rescaled to about 100x effective magnification for further experimentation.
 2.
Image segmentation: In this step the images are prepared for graphbased description. It involves segmentation of the images as well as removal of artefacts and obtaining connected components in each segmented image.
 3.
GraphTheoretic representation: The segmented images are then represented using ARGs which involve the description of nodes and edges.
 4.
Graph matching: The graph representing query image is then compared to a database of graphs already generated in order to retrieve most similar images based on the distance between the graphs. A graph matching algorithm based on the A* search is used.
 5.
Display of the closest matches: The images from the database are arranged in order of decreasing similarity based on cost of graph matching and the top results are displayed to the user.
Image segmentation
 1.
Soft pixel classification: Likelihood of belonging to a tissue of particular type is calculated for each image pixel based on textonbased texture descriptions.The segmentation decision is made for every point (local area) on MAP (Maximum A Posteriori) principle based on texture descriptions of all allowed tissue classes previously learned.
 2.
Region segmentation: Grouping of pixels and hard label assigment is performed based on spatial label coherence and similarity to texture models already obtained in the previous stage. Such optimal grouping is performed using Graphcut [29] algorithm.
The maximum size of the pixel area for decision making is tissue type related. In these experiments 16× 16 pixels for epithelial, 32×32 pixels for connective, 48×48 for lobular and 64×64 for fat tissues were used. Here, the effective pixel size for these images (i.e. how large the physical area of the tissue which corresponds to one pixel) was roughly 1.0 micrometer ×1.0 micrometer. The segmentation algorithm is not described in further details here and is a subject to a separate publication. Segmentation results were just provided for this study. The segmentation is done into four tissue types: lobules, fibrous connective tissue, epithelial lining cells, lumens&fat (centres of ducts and adipose tissue).
The multilabel (L=4 here) segmented image is decomposed into binary images, one image for each label. Then morphological operations closing and opening are performed twice each on each binary image. These operations aim to remove small artefacts, fill in the potential gaps between tissue fragments and smooth the contours of the shapes. The size of structuring element chosen depends on the magnification of the WSIs used in the study. Then connected components are identified in each binary image. A connected component analysis ensures that only connected pixels are assigned the same label and form a region. It is required for distinguishing the regions within the image.
Graphtheoretic Representation
Each of the images in the database as well as the query image have been described by corresponding graphs. Namely, ARGs have been constructed, where each node corresponds to one connected region in the image and edge is obtained between neighbouring regions which share a common boundary. The procedure involves describing nodes and edges with attributes explained below.
Node Description
Describing the nodes includes identifying the nodes and then assigning attributes to them. Each node has a unique identifier number that is used to simply recognise it in subsequent algorithm. Also, though a node denotes a region, for representational purpose, its position is assumed to be at the centroid of the region. The actual information about the region that each node carries is:

Area: It is defined as the total number of pixels inside the region corresponding to the node. The areas are found for each region, and regions with area of less than a predefined threshold are ignored and not considered as separate nodes.

Perimeter: The attribute gives the length of the boundary of a region. It is computed by summing of distances between each adjacent pair of pixels along the border of the specified region.

Label: It defines the class of tissue for the region.
Initially, other features were also considered for node attributes, however, they were not retained for the final implementation, since they were found unsuitable or inefficient for this particular application. Actually PCA would be the right way for selecting the appropriate attributes, however, in order to reduce the computational complexity, we have performed a heuristic selection of attributes. The features not retained for node description are:

Convex area: It is the number of pixels in the convex hull of a region.

Eccentricity: For an ellipse, eccentricity is defined as the ratio between the distance between its foci and its major axis length. It has a value between 0 and 1. For a region, it is the eccentricity of the ellipse which has the same secondmoments as that of the region.

Euler number: It is defined as the difference between the number of objects in a region and the number of holes inside those objects.

Orientation: It can be defined as the angle between the xaxis and the major axis of the ellipse which has the same secondmoments as the region. Its value is between 90° to 90°.

Solidity: It is the fraction of pixels in the convex hull that are also in the region and computed as ratio between the area of a region and its convex area.
Edge Description
The process of describing edges involves identifying the edges and assigning weights to them. The edge information (weights) is obtained as:

Distance between centroids: It is taken as the Euclidean distance between the centroids of two regions.

Common boundary length: It is the number of pixels lying on the common border between two neighbouring regions. It has been calculated by considering the 4connectivity of each pixel. The algorithm counts those 4connected neighbours of a pixel which have a different label than the pixel itself.
Normalisation
The graph attributes obtained in the above steps are expressed in different units. Thus, this data need to be converted to relative units so that they becomes comparable in subsequent procedures. A global normalisation is performed. For each feature (except label of nodes), first the global maximum and minimum values are obtained from all the graphs in the database. Then the features are normalised to [0,1] range using the global maximum and minimum values for each one.
A*based graph matching method
Given a query image, its ARG is matched to each ARG in the database and the cost of matching is assigned to each pair of graphs. The graph matching problem has been formulated as an A* based tree search problem. Functions for the cost g(n), heuristic h(n) and total cost f(n) have been designed using the information present in the corresponding image ARGs. The heuristic h(n) is designed to be a consistent lower bound estimate of the exact cost, hence, admissibility criterion is satisfied that leads to the optimal solution.
For the proposed method, K and M both have value 2, where k=1 for area and k=2 for perimeter in node attributes, and m=1 for distance between centroids and m=2 for common boundary length in edge attributes. Note that for node attributes, area has been assigned double the weight of perimeter, as it is considered more important feature during the matching of nodes.
where, δ_{ i } in equation 10 describes the distance between the attributes of already matched nodes and edges, and ${a}_{i}^{\left(1\right)}$ in equation 11 refers to areas of the nodes in G not matched yet. Note that only the area attribute is used at this point as computing δ_{ i } will not be possible for unmatched nodes and the most important attribute that needs to be considered in the heuristic function is area.
where n_{ p } ∈ G and ${n}_{q}^{\prime}\in {G}^{\prime}$ are matching nodes and a_{ p }^{(1)} and ${{a}_{q}^{\prime}}^{\left(1\right)}$ denote the first attributes corresponding to feature vectors a_{ p } and ${\mathbf{a}}_{\mathbf{q}}^{\mathbf{\prime}}$. They describe the area of the two regions being matched.
The main problem with optimal graph matching is its high computational complexity. The complexity of the described search is exponential in the worst case, however, practically, it depends on the data to be handled as only nodes of same label can be matched. It considerably reduces the search space and complexity is scaled down.
Results and Discussions
Dataset
The data used for this work consists of histological images provided by The Charité Hospital, Berlin. These are biopsy images of the breast tissue. The samples have been stained with the H&E dye. The WSI images are produced by a Zeiss MIRAX SCAN WSI scanner. We used selected archived slides from daily clinical workload that were not older than 6 months at the time of digitalization. The glass slides have been produced in APlaboratory of the Institute of Pathology at Charité hospital. They have not been modified in any way. We have evaluated the method on 3 WSI images of FEAsuspected breast biopsies divided into subimages representing possible retrieval results. Our aim was to demonstrate the potential of the graphbased approach, leaving the indepth performance evaluation for future research. One of the reasons for this is the relatively high computational complexities of the segmentation, the description and the retrieval algorithm, which are subject for future change and improvements too.
The images have been presegmented to four categories describing different types of tissue. They are then divided for one approach (described in Section Experimental Approaches and Results) into smaller subimages to obtain the database for different image sizes of 64×64, 128×128, 256×256 and 512×512. Query image is selected by giving a choice of four sizes, and the selection is resized to the size selected by the user. The number of images used in the database, depending on the size of query image is:
64×64: 70869 images
128×128: 27596 images
256×256: 9132 images
512×512: 2485 images
Graph representations of the images stated above were obtained and stored for future reference.
Experimental Approaches and Results
Two types of configurations were used for experiments. These are as follows:
Subgraph isomorphism approach
Inexact graph matching approach
Observations
It can be observed in both approaches that:
For subgraph isomorphism approach
The method yields regions from the whole images which are closest matches to the region in the query image.
Advantage: As expected, the first match gives an exact match, as the graph generated is a subgraph from the graph of one of the whole images. The results obtained as subsequent matches show similarity in the structure and spatial relationships between regions, as those in query image. Hence, it can be used to locate region groups with desired type, shape and neighbourhood relationships.
Limitation: It does not take into account the size of query image. It considers all the regions which are present in the query image, including those regions which are only partially included in the query, and may have a large part outside the query. As a result, the matches obtained have the size corresponding to entire regions rather than the size of query. Hence, there is no control on the size of retrieved results.
For inexact graph matching approach
In this method, subimages which are structurally similar to the query image, and of same size as query image, are retrieved. It works similar to a practical CBIR system.
Advantages: The user can select a size for his query and the retrieved images are of same size as query. Hence the user can enjoy control on the size of results. The matches obtained are observed to show spatial and structural similarity to the query selected.
Limitations: Selection of query image may not be in accordance as the division of images into subimages, and this may lead to truncation effect, as some important structures maybe truncated due to this division. Further, the original images have to be first cropped to a size which is divisible by the size of subimage, and this can also lead to loss of information along borders. In order to reduce this information loss, the division of WSI images has been done by allowing an overlap between successive subimages. However, if we increase the overlap, there is an increase in redundancy of results, as they may be retrieved from same areas. As a result there is a tradeoff between redundancy and loss of information due to truncation, and overlap selected has to balance this.
Performance evaluation
 1.Precision: The percentage of retrieved images that are relevant to the query:$\mathrm{Precision}=\frac{\mathrm{Number}\phantom{\rule{1em}{0ex}}\mathrm{of}\phantom{\rule{1em}{0ex}}\mathrm{relevant}\phantom{\rule{1em}{0ex}}\mathrm{images}\phantom{\rule{1em}{0ex}}\mathrm{retrieved}}{\mathrm{Total}\phantom{\rule{1em}{0ex}}\mathrm{number}\phantom{\rule{1em}{0ex}}\mathrm{of}\phantom{\rule{1em}{0ex}}\mathrm{images}\phantom{\rule{1em}{0ex}}\mathrm{retrieved}}\times 100$(28)
 2.Recall: The percentage of all the relevant images in the search database which are retrieved, defined by:$\mathrm{Recall}=\frac{\mathrm{Number}\phantom{\rule{1em}{0ex}}\mathrm{of}\phantom{\rule{1em}{0ex}}\mathrm{relevant}\phantom{\rule{1em}{0ex}}\mathrm{images}\phantom{\rule{1em}{0ex}}\mathrm{retrieved}}{\mathrm{Total}\phantom{\rule{1em}{0ex}}\mathrm{number}\phantom{\rule{1em}{0ex}}\mathrm{of}\phantom{\rule{1em}{0ex}}\mathrm{relevant}\phantom{\rule{1em}{0ex}}\mathrm{images}}\times 100$(29)
where P_{ s } is precision (in %), s is scope length and score is a value from {0,0.25,0.5,0.75,1}. The score values express the similarity of the retrieved images to the query image in terms of structure and spatial relationships. Higher score is assigned for higher resemblance. The evaluation is subjective and coarse, so quantitative results (precision values) have been rounded down to integers. Resulted plots are plotted between the precision vs. different scope lengths.
The proposed method has been compared with a common, histogrambased retrieval system. The histograms for segmented subimages have been found using 4 bins. Then the distances between the histograms of query image and subimages have been calculated. Similar as for the graphbased approach, exponential distance has been used. Finally, the results were compared for both methods.
Precision at different scope lengths for histogram based method
Precision at different scopes for histogram based method  

P_{ s }/ Window size  64×64  128×128  256×256  512×512  Average P_{ s } 
P_{10}  23  45  55  23  37 
P _{20}  21  35  46  13  29 
P _{30}  16  33  43  18  28 
P _{40}  14  28  41  21  26 
P _{50}  12  26  39  20  24 
Precision at different scope lengths for graphtheoretic method
Precision at different scopes for graphtheoretic method  

P_{ s }/ Window size  64×64  128×128  256×256  512×512  Average P_{ s } 
P_{10}  80  55  63  70  67 
P _{20}  63  44  53  40  50 
P _{30}  58  39  50  36  46 
P _{40}  53  33  38  33  39 
P _{50}  46  29  35  29  35 
Average precision for both, the histogram and the graphtheoretic methods
Average precision for both CBIR methods  

Scope length  Average P_{ s } for Method 1  Average P_{ s } for Method 2  Improvement (%) 
10  37  67  81 
20  29  50  72 
30  28  46  64 
40  26  39  50 
50  24  35  46 
The tables and graphs obtained justify our choices of the parameters used and methods employed for the system proposed. It can be concluded that:
The results obtained using graphtheoretic technique are better than simple histogram based method, as it takes into account the structural characteristics of the image and neighbourhood relationships between regions, which are completely neglected in the histogrambased method.
As scope length increases, the precision declines for proposed method, which shows that it gives the most relevant results earlier in the list of retrieved results. This is a desirable property of any CBIR system that the results initially obtained are the most useful. However, it is evident that it does not hold for all cases of histogrambased method.
The results so obtained by the proposed method are not as high as reported for general CBIR applications (about 90% precision or more). The highest precision reported for image size 64×64 and scope length 10 is 80%. The reason behind this is the complexity and subjectivity associated with histological images. The evaluation was biased strongly with the subjective scoring as even when the histological image shows the same tissue composition, several factors have to be kept in mind before assigning a score. The closeness to query image depends on the type of tissue regions, size and shape of regions as well as the neighbourhood relationships between them. Due to this relative scores have been used, however, more thorough evaluation should be performed especially by employing medical professionals.
The performance depends on the characteristics of the query image, i.e. the number of tiled images available in the database that lie close to the position of the query window.
Execution Time Requirements
 1.
Complexity of the images: The time required for graph generation and graph matching is highly dependent on the complexity of database and query images, i.e. the number of nodes and edges.
 2.
Size of database and query images: Given the same overall complexity, for a larger sized query, more time is required, specially for matching. Nevertheless, for less complex larger images, the method gives quicker results when compared to more complex, but smaller images.
Time requirement for graph based CBIR system
Time requirement for graph based CBIR system  

Number of nodes  Graph Generation Time  Graph Matching Time 
& 5  &1 s  & 0.1 s (MATLAB) 
510  12 s  0.11 s (MATLAB) 
1020  23 s  1530 s (MATLAB) 
2050  320 s  1530 s (C++) 
50100  2040 s  60300 s (C++) 
>100  > 40 s  >300 s 
Conclusions
In this work we have developed a novel method for determining similarity between histological images through graphtheoretic description and matching useful for the purpose of contentbased retrieval. A higher order (regionbased) graphtheoretic representation of histological images has been proposed and a treesearch based optimal matching algorithm has been employed. The proposed method facilitates the automatic retrieval of images structurally similar to a given image. Such a system can be used for several applications in the biological and medical field.
The method has been applied specifically for histological images. The reason behind the conception of the idea is the fact that the stateofthe art CBIR methods that differentiate images mostly in terms of lowlevel colour, shape and texture features do not perform well with histological images, as only these features are inadequate to capture the spatial content and neighbourhood relationships of histological images. The structural characteristics are very important to differentiate between morphological components in a particular tissue, and the method developed utilizes this fact to obtain similar tissue areas, of particular interest to the user.
It can be seen that the results obtained are satisfactory for histological images, as shown for the human breast in our study. The performance evaluation suggests that the technique developed is effective and superior to the simpler histogrambased technique. The execution time depends on the size and complexity of the query image selected by the user.
Future work on this system may include the incorporation of other appropriate attributes like Euler number, solidity etc. for nodes and the differences between properties like compactness for adjacent nodes as edge attributes in the graphbased representation of images. Additionally, the procedure for graph matching can be optimised from an applicationoriented point of view so that the execution time for matching large sized graphs is further reduced.
Moreover, in the current study, the focus is on breast tissue biopsy images. The method can be generalised to other types of histological images or can be studied for new categories of images in which structure and spatial relationships are of major importance.
Abbreviations
 ARG:

Attributed Relational Graph
 CBIR:

ContentBased Image Retrieval
 CS:

Common Subgraph
 FEA:

Flat Epithelial Atypia
 GUI:

Graphical User Interface
 H&E:

Hematoxylin and Eosin
 IR:

Information Retrieval
 MCS:

Maximum Common Subgraph
 NP:

Nondeterministic Polynomial time
 PCA:

Principal Component Analysis
 RAG:

Region Adjacency Graph
 ROI:

Region of Interest
 WSI:

Whole Slide Images.
Declarations
Acknowledgements
Harshita Sharma and R. S. Anand would like to acknowledge Indian Institute of Technology, Roorkee, India, for providing them the opportunity of carrying out this research in association with Technical University, Berlin, Germany. This study has been supported by the German Federal State of Berlin in the framework of the “Zukunftsfonds Berlin” and the Technology Foundation Innovation Centre Berlin (TSB) within the project “Virtual Specimen Scout”. It was hereby cofinanced by the European Union within the European Regional Development Fund (EFRE).
Authors’ Affiliations
References
 Tagare HD, Jaffe CC, Duncan J: Medical image databases:A contentbased retrieval approach. J Am Med Inform Assoc. 1997, 4 (3): 184198. 10.1136/jamia.1997.0040184.PubMed CentralView ArticlePubMed
 Romo D, Romero E, Gonzalez F: Learning regions of interest from low level maps in virtual microscopy. Diagnostic Pathol. 2011, 6 (Suppl 1): S2210.1186/174615966S1S22.View Article
 Kayser K, Görtler J, Borkenfeld S, Kayser G: How to measure diagnosisassociated information in virtual slides. Diagnostic Pathol. 2011, 6 (Suppl 1): S910.1186/174615966S1S9.View Article
 Kayser K, Radziszowski D, Bzdyl P, Sommer R, Kayser G: Towards an automated virtual slide screening: theoretical considerations and practical experiences of automated tissuebased virtual diagnosis to be implemented in the Internet. Diagnostic Pathol. 2006, 1: 1010.1186/17461596110.View Article
 Kayser G, Riede U, Werner M, Hufnagl P, Kayser K: Towards an automated morphological classification of histological images of common lung carcinomas. Elec J Pathol Histol. 2002, 8: 02203.
 Bilgin C, Demir C, Nagi C, Yener B: Cellgraph mining for breast tissue modeling and classification. Eng Med and Biol Soc, 2007.29th Annual Int Conference of the IEEE. 2007, 2007: 53115314.View Article
 Altunbay D, Cigir C, Sokmensuer C, Demir C: Color Graphs for Automated Cancer Diagnosis and Grading. IEEE Trans On Biomed Eng. 2010, 57 (3): 665674.View Article
 Sudbo J, Marcelpoil R, Reith A: New algorithms based on the Voronoi diagram applied in a pilot study on normal mucosa and carcinomas. Analytical Cellular Pathology. 2000, 21 (2): 7186.View ArticlePubMed
 Datta R, Joshi D, Li J, Wang JZ: Image retrieval: Ideas, influences, and trends of the new age. ACM Comput Surveys. 2008, 40: 160.View Article
 Rui Y, Huang TS, Chang SF: Image retrieval: Current techniques, promising directions, and open issues. J Visual Commun and Image Representation. 1999, 10: 3962. 10.1006/jvci.1999.0413.View Article
 Smeulders AWM, Member S, Worring M, Santini S, Gupta A, Jain R: Contentbased image retrieval at the end of the early years. IEEE Trans Pattern Anal and Machine Intelligence. 2000, 22: 13491380. 10.1109/34.895972.View Article
 Müller H, Michoux N, Bandon D, Geissbuhler A: A review of contentbased image retrieval systems in medical applications  clinical benefits and future directions. Int J Med Informatics. 2004, 73: 123. 10.1016/j.ijmedinf.2003.11.024.View Article
 Ballerini L, Li X, Fisher BR, Rees J: A QuerybyExample ContentBased Image Retrieval System of NonMelanoma Skin Lesions. Proc MICCAI09 Workshop MCBRCDS 2009: Medical Contentbased Retrieval for Clinical Decision Support, London, Lecture Notes in Computer science, Springer. 2009, 5853: 3138.
 Jaulent MC, Le Bozec, Cao Y, Zapletal E, Degoulet P: A property concept frame representation for flexible image content retrieval in histopathology databases. Proceedings of the Annual Symposium of the American Society for Medical Informatics (AMIA), Los Angeles, CA , USA. 2000, 20 (Suppl): 379383.
 University CM: The Diamond Project. [http://diamond.cs.cmu.edu/]
 Yang L, Jin R, Sukthankar R, Zheng B, Mummert L, Satyanarayanan M, Chen M, Jukic D: Learning Distance Metrics for Interactive SearchAssisted Diagnosis of Mammograms. proceedings of SPIE Medical Imaging. 2007, San Diego, CA, 6514,6514H6514,6514H.
 Goode A, Chen M, Tarachandani A, Mummert LB, Sukthankar R, Helfrich C, Stefanni A, Fix L, Saltzman J, Satyanarayanan M: Interactive Search of Adipocytes in Large Collections of Digital Cellular Images. proceedings of the International Conference of Multimedia and Expo(ICME). 2007, IEEE, Beijing, China, 695698.
 Satyanarayanan M, Sukthankar R, Goode A, Huston L, Mummert L, Wolbach A, Harkes J, Gass R, Schlosser S: The Open Diamond Platform for Discardbased Search. Tech rep School of Computer Science, Carnegie Mellon University. 2008, CMUCS08132. [http://diamond.cs.cmu.edu/papers]
 Long LR, Antani S, Deserno TM, Thoma GR: ContentBased Image Retrieval in Medicine: Retrospective Assessment, State of the Art, and Future Directions. IJHISI. 2009, 4 (1): 116.
 Zhou XS, Zillner SS, Moller M, Sintek M, Zhan Y, Krishnan A, Gupta A: Semantics and CBIR, A Medical Imaging Perspective. Proc of the ACM International Conference on Image and Video Retrieval Niagara Falls. 2008, Canada, 571580.
 Dumay CM, van der Geest RJ, Gerbrands JJ, Jansen E, Reiber JHC: Consistent inexact graph matching applied to labelling coronarysegments in arteriograms. Proc. 11th Int. Conference on Pattern Recognition. 1992, Vol. 3; The Hague, Netherlands, 439442.
 Treméau A, Colantoni P: Regions adjacency graph applied to color image segmentation. IEEE Trans Image Process. 2000, 9 (4): 73510.1109/83.841950.View ArticlePubMed
 Conte D, Foggia P, Sansone C, Vento M: How and why pattern recognition and computer vision applications use graphs. Appl Graph Theory in Compu Vision and Pattern Recognit, Studies Comput Intelligence. 2007, 52: 85135. Springer Berlin and Heidelberg, Germany
 Conte D, Foggia P, Sansone C, Vento M: Thirty Years of Graph Matching in Pattern Recognition. Intl J Pattern Recognit and Artif Intelligence. 2004, 18 (3): 265298. 10.1142/S0218001404003228.View Article
 Dijkstra EW: A note on two problems in connexion with graphs. Numerische Mathematik 1,. Springer, 1959, 1: 269271. 10.1007/BF01386390.View Article
 Russell SJ, Norvig P: Artificial Intelligence: A Modern Approach. 2003, N J,Prentice Hall, Upper Saddle River
 Hart P, Nilsson N, Raphael B: A Formal Basis for the Heuristic Determination of Minimum Cost Paths. IEEE Trans Syst Sci and Cybernetics. 1968, 4 (2): 100107.View Article
 Patel A: Heuristics: A* Search Algorithm. [http://theory.stanford.edu/amitp/GameProgramming/Heuristics.html]
 Boykov Y, Jolly MP: Interactive graph cuts for optimal boundary and region segmentation of objects in ND images. Int Conference on Comput Vision. 2001, 1: 105112.
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.