A block diagram of the method employed is given in Figure 9. The main steps are explained as:
-
1.
Image acquisition: H&E-stained breast biopsies are used in this study. Specimens are digitized and the whole-slide images (WSIs) are rescaled to about 100x effective magnification for further experimentation.
-
2.
Image segmentation: In this step the images are prepared for graph-based description. It involves segmentation of the images as well as removal of artefacts and obtaining connected components in each segmented image.
-
3.
Graph-Theoretic representation: The segmented images are then represented using ARGs which involve the description of nodes and edges.
-
4.
Graph matching: The graph representing query image is then compared to a database of graphs already generated in order to retrieve most similar images based on the distance between the graphs. A graph matching algorithm based on the A* search is used.
-
5.
Display of the closest matches: The images from the database are arranged in order of decreasing similarity based on cost of graph matching and the top results are displayed to the user.
Image segmentation
It includes a group of methods employed before graph-based image analysis. To acquire a region-based signature, a key step is to segment images. Hence, the original breast biopsy images are first segmented using a supervised approach which has been performed in two stages:
-
1.
Soft pixel classification: Likelihood of belonging to a tissue of particular type is calculated for each image pixel based on texton-based texture descriptions.The segmentation decision is made for every point (local area) on MAP (Maximum A Posteriori) principle based on texture descriptions of all allowed tissue classes previously learned.
-
2.
Region segmentation: Grouping of pixels and hard label assigment is performed based on spatial label coherence and similarity to texture models already obtained in the previous stage. Such optimal grouping is performed using Graph-cut [29] algorithm.
The maximum size of the pixel area for decision making is tissue type related. In these experiments 16× 16 pixels for epithelial, 32×32 pixels for connective, 48×48 for lobular and 64×64 for fat tissues were used. Here, the effective pixel size for these images (i.e. how large the physical area of the tissue which corresponds to one pixel) was roughly 1.0 micrometer ×1.0 micrometer. The segmentation algorithm is not described in further details here and is a subject to a separate publication. Segmentation results were just provided for this study. The segmentation is done into four tissue types: lobules, fibrous connective tissue, epithelial lining cells, lumens-&-fat (centres of ducts and adipose tissue).
The multilabel (L=4 here) segmented image is decomposed into binary images, one image for each label. Then morphological operations closing and opening are performed twice each on each binary image. These operations aim to remove small artefacts, fill in the potential gaps between tissue fragments and smooth the contours of the shapes. The size of structuring element chosen depends on the magnification of the WSIs used in the study. Then connected components are identified in each binary image. A connected component analysis ensures that only connected pixels are assigned the same label and form a region. It is required for distinguishing the regions within the image.
Graph-theoretic Representation
Each of the images in the database as well as the query image have been described by corresponding graphs. Namely, ARGs have been constructed, where each node corresponds to one connected region in the image and edge is obtained between neighbouring regions which share a common boundary. The procedure involves describing nodes and edges with attributes explained below.
Node Description
Describing the nodes includes identifying the nodes and then assigning attributes to them. Each node has a unique identifier number that is used to simply recognise it in subsequent algorithm. Also, though a node denotes a region, for representational purpose, its position is assumed to be at the centroid of the region. The actual information about the region that each node carries is:
-
Area: It is defined as the total number of pixels inside the region corresponding to the node. The areas are found for each region, and regions with area of less than a predefined threshold are ignored and not considered as separate nodes.
-
Perimeter: The attribute gives the length of the boundary of a region. It is computed by summing of distances between each adjacent pair of pixels along the border of the specified region.
-
Label: It defines the class of tissue for the region.
Initially, other features were also considered for node attributes, however, they were not retained for the final implementation, since they were found unsuitable or inefficient for this particular application. Actually PCA would be the right way for selecting the appropriate attributes, however, in order to reduce the computational complexity, we have performed a heuristic selection of attributes. The features not retained for node description are:
-
Convex area: It is the number of pixels in the convex hull of a region.
-
Eccentricity: For an ellipse, eccentricity is defined as the ratio between the distance between its foci and its major axis length. It has a value between 0 and 1. For a region, it is the eccentricity of the ellipse which has the same second-moments as that of the region.
-
Euler number: It is defined as the difference between the number of objects in a region and the number of holes inside those objects.
-
Orientation: It can be defined as the angle between the x-axis and the major axis of the ellipse which has the same second-moments as the region. Its value is between -90° to 90°.
-
Solidity: It is the fraction of pixels in the convex hull that are also in the region and computed as ratio between the area of a region and its convex area.
Edge Description
The process of describing edges involves identifying the edges and assigning weights to them. The edge information (weights) is obtained as:
-
Distance between centroids: It is taken as the Euclidean distance between the centroids of two regions.
-
Common boundary length: It is the number of pixels lying on the common border between two neighbouring regions. It has been calculated by considering the 4-connectivity of each pixel. The algorithm counts those 4-connected neighbours of a pixel which have a different label than the pixel itself.
Same as with nodes, other characteristics can also be included in edge attributes. However, it was found that the properties given above are suitable for representing a histological image. Area is important for matching of similarly-sized regions, whereas perimeter conveys approximate shape information. The distance between centers determines how far the nodes are placed with respect to each other, and common boundary length denotes upto what extent two regions are adjacent to each other. Thus, these properties are most useful for determining the structure and neighbourhood relationships for histological image analysis. One example of such a graph-representation is given in Figure 10.
Normalisation
The graph attributes obtained in the above steps are expressed in different units. Thus, this data need to be converted to relative units so that they becomes comparable in subsequent procedures. A global normalisation is performed. For each feature (except label of nodes), first the global maximum and minimum values are obtained from all the graphs in the database. Then the features are normalised to [0,1] range using the global maximum and minimum values for each one.
A*-based graph matching method
Given a query image, its ARG is matched to each ARG in the database and the cost of matching is assigned to each pair of graphs. The graph matching problem has been formulated as an A* based tree search problem. Functions for the cost g(n), heuristic h(n) and total cost f(n) have been designed using the information present in the corresponding image ARGs. The heuristic h(n) is designed to be a consistent lower bound estimate of the exact cost, hence, admissibility criterion is satisfied that leads to the optimal solution.
To describe the process of matching, two ARGs are defined first: A Test ARG G(V,E,α,β) and Model ARG G′(V′,E,′α′,β′). N is the number of nodes in G and N′ is number of nodes in G′such that N ≤ N′. Also, W is the number of edges in G and W′ is the number of edges in G′. For each node n
i
(i ∈ 1,2…N) in G, the set of attributes is given as a
i
:
(4)
where K is the total number of attributes associated with each node n
i
. Hence, for all nodes N, the set a is the set of all vectors a
i
given by:
(5)
Similarly, for each edge e
j
(j ∈ 1,2…W) in G, the set of attributes is given as b
j
:
(6)
where M is the total number of attributes associated with each edge e
j
. Hence, for all edges W, the set b is the set of all vectors b
j
given by:
(7)
For the proposed method, K and M both have value 2, where k=1 for area and k=2 for perimeter in node attributes, and m=1 for distance between centroids and m=2 for common boundary length in edge attributes. Note that for node attributes, area has been assigned double the weight of perimeter, as it is considered more important feature during the matching of nodes.
The task is to find the best mapping between G and G′ and the minimum matching cost for attaining it. A graph monomorphism is being sought between G and G′ as explained in Section Graph Matching. To begin with, the simplest case can be to assign the number of unmatched nodes as the heuristic function that denotes estimate of cost of a path from a node to goal, and the number of matched nodes as the cost function that denotes the cost from start to a node. For a partial mapping till node n in G(n ≤ N), these functions will be defined as:
The dissimilarity of the nodes (and correspondent edges) must also be incorporated in these formulations. A pairwise distance between feature vectors of nodes already matched needs to be included in equation 8, and the term containing attributes for unmatched nodes should be included in equation 9. It gives:
(10)
(11)
where, δ
i
in equation 10 describes the distance between the attributes of already matched nodes and edges, and in equation 11 refers to areas of the nodes in G not matched yet. Note that only the area attribute is used at this point as computing δ
i
will not be possible for unmatched nodes and the most important attribute that needs to be considered in the heuristic function is area.
Now, let us consider two incremental functions g
Δ
(n) and h
Δ
(n) which denote the contribution of a node n, to the cost function g(n), if it has been already traversed, and its contribution to the heuristic h(n) if it has not been already traversed. These functions can be defined as:
(12)
(13)
From the equations 10 and 11, it follows that:
(15)
The constant 1 in these equations must be re-scaled, otherwise it will have a greater impact and may mask the effect of attributes of ARGs. For this reason, a constant c is introduced, which has been determined experimentally. After this the equations become:
(17)
In order to yield optimum results, the admissibility criterion must be satisfied, i.e. the estimated cost of a path from a node to goal must be a lower bound of the actual cost of a path from the node to goal. This can be ensured if the estimated cost contributed by n, represented by h
Δ
(n) is by n, represented by g
Δ
(n). As δ
n
≥ 0 and , remember that all attributes have been normalized to the range [0,1] aforehand, in order to ensure that h
Δ
(n) ≤ g
Δ
(n) a constant 1 is added to the distance δ
n
. It now yields:
The problem that may arise here is that as 1 becomes very large compared to distances δ
i
, the effect of distance may be masked, hence, a new constant, γ, is introduced. It is an experimentally determined parameter and distance becomes:
Rewriting the equations 10 and 11 for the nodes traversed up to node n, g(n) and h(n) take the form:
(20)
(21)
The equation 20 can be used for g(n), however, for histological images, it is important that larger nodes are given higher importance in matching. This is because smaller nodes may represent less important regions or artefacts, but the larger nodes will always represent significant regions. A mismatch between larger nodes should be penalised by a higher cost as compared to mismatch between smaller nodes. Hence, a weight has been introduced:
(22)
The weights w
i
are proposed as:
(23)
where n
p
∈ G and are matching nodes and a
p
(1) and denote the first attributes corresponding to feature vectors a
p
and . They describe the area of the two regions being matched.
The distances δ
i
in equation 10 between each pair of nodes are calculated as:
(24)
where, δ1i is the distance between corresponding nodal attributes and δ2i is the distance between their corresponding edge attributes. λ∈[0,1] balances the mutual relevance of the two distances. In the method, equal relevance has been considered. The distance between nodes and edges has been formulated as an Exponential distance, in order to further intensify the mismatch between corresponding attributes, as compared to a linear technique or Euclidean distance. The nodal distance for node attributes is defined as:
(25)
Edge distance for edge weights is defined as:
(26)
The total cost f(n) of the partial mapping till node n is given by:
The graph matching is implemented through a tree-based search using A* algorithm which always extends the partial mapping of nodes towards an optimum. The tree is the representation of Open List realized using a priority queue, containing partial mappings in increasing order of their costs. It is constructed by first allowing each test node to be mapped to each available model node, if the match is permissible, the pairs forming the first level of the tree. The cost of each pair is computed, and the pair with the lowest total cost f(n) is expanded. Each leaf of the tree now represents a combination of matched nodes or partial mapping from nodes of test graph to those of model graph. The Closed List consists of the latest and most favourable partial mapping constructed. The tree is expanded until best optimum mapping of maximum nodes is found. An example of the tree-based search method employed is illustrated in Figure 11.
The main problem with optimal graph matching is its high computational complexity. The complexity of the described search is exponential in the worst case, however, practically, it depends on the data to be handled as only nodes of same label can be matched. It considerably reduces the search space and complexity is scaled down.