Theory of sampling and its application in tissue based diagnosis
© Kayser et al. 2009
Received: 27 January 2009
Accepted: 16 February 2009
Published: 16 February 2009
Skip to main content
© Kayser et al. 2009
Received: 27 January 2009
Accepted: 16 February 2009
Published: 16 February 2009
A general theory of sampling and its application in tissue based diagnosis is presented. Sampling is defined as extraction of information from certain limited spaces and its transformation into a statement or measure that is valid for the entire (reference) space. The procedure should be reproducible in time and space, i.e. give the same results when applied under similar circumstances. Sampling includes two different aspects, the procedure of sample selection and the efficiency of its performance. The practical performance of sample selection focuses on search for localization of specific compartments within the basic space, and search for presence of specific compartments.
When a sampling procedure is applied in diagnostic processes two different procedures can be distinguished: I) the evaluation of a diagnostic significance of a certain object, which is the probability that the object can be grouped into a certain diagnosis, and II) the probability to detect these basic units. Sampling can be performed without or with external knowledge, such as size of searched objects, neighbourhood conditions, spatial distribution of objects, etc. If the sample size is much larger than the object size, the application of a translation invariant transformation results in Kriege's formula, which is widely used in search for ores. Usually, sampling is performed in a series of area (space) selections of identical size. The size can be defined in relation to the reference space or according to interspatial relationship. The first method is called random sampling, the second stratified sampling.
Random sampling does not require knowledge about the reference space, and is used to estimate the number and size of objects. Estimated features include area (volume) fraction, numerical, boundary and surface densities. Stratified sampling requires the knowledge of objects (and their features) and evaluates spatial features in relation to the detected objects (for example grey value distribution around an object). It serves also for the definition of parameters of the probability function in so – called active segmentation.
The method is useful in standardization of images derived from immunohistochemically stained slides, and implemented in the EAMUS™ system http://www.diagnomX.de. It can also be applied for the search of "objects possessing an amplification function", i.e. a rare event with "steering function". A formula to calculate the efficiency and potential error rate of the described sampling procedures is given.
Diagnostic surgical pathology or tissue – based diagnosis is confronted with remarkable changes in its environment and workflow. The technological progress has led to a broad application of molecular biological methods such as Fluorescent in Situ Hybridization (FISH), and other DNA – sequence amplification techniques [1, 2]. Commercially available slide scanners digitize a complete glass slide within a few minutes, and permit the implementation of completely digitized images into routine diagnostics [3, 4]. In other words, the workload of a pathologist increases steadily not only by increase of material, but, in addition, due to the mandatory introduction of new, still tissue – based diagnostic technologies. Thus, the question arises: How can the availability of and access to digitized histological slides (virtual slides) be used to release the diagnostic pathologist from time consuming work steps in order to make the pathologist's work more effective and disease related?
In the early days of telepathology, which can be considered to be the "mother of the digital pathologist's world", several authors reported on the diagnostic accuracy of viewing digitized slides in comparison to conventional microscopy [4–8]. The results were clear: the diagnostic accuracy viewing at a digitized (or virtual) slide is indistinguishable to that of conventional microscopy; however, the required time is essentially longer [9, 10]. The non appropriate and more time consuming search for appropriate fields of view or the performed sampling procedure are obviously one reason of these constraints. To our knowledge, the theory of sampling in cytology and histopathology has not been described in detail, and is nearly unknown in the environment of diagnostic pathologists. In this article we want to explain the main theoretical aspects and the derivatives of sampling which are performed in routine tissue – based diagnostics. The derived formulas will allow interested pathologists or scientists to search for applications that can diminish the sampling time in virtual slides.
Surgical pathology is a medical discipline that "extracts" information from human tissue and classifies the information in distinct terms that are called diagnoses. The common performance is to screen an organ or a tissue section for those spaces or areas that contain the most significant information, and try to classify this information seen in the specific field of view. Thus, tissue – based diagnosis is based upon a procedure to search for small samples that allow to derive information that is valid for the whole (or even patient). In other words, an appropriate sampling procedure is a precondition to evaluate accurate and reproducible diagnoses [2, 4, 11–15]. Therefore, a detailed definition and accurate description of the sampling method is a necessity if we want to further evaluate the diagnostic algorithms. This statement induces the definition of sampling as follows: Sampling is a method to derive information from a limited (small) compartment of a large (even unlimited) system that is valid for the entire (basic) system. The system can be a space, a function or set of functions, a body, an organ, a slide, or a DNA sequence.
the method of sampling, and
the aim of the sampling procedure, i.e., which information should be extracted.
Different aims can require different methods of sampling, or at least different parameters of the same algorithm. The inclusion of an "aim" or "goal" to be assessed introduces the calculation of efficiency, or a cost/benefit estimation.
➢ search for localization of specific items within the basic space, with the knowledge or assumption, that the space under consideration contains such items, and
➢ search for presence of specific items (tumour cells, ores, lobster, etc.), where the exact localisation of these items is of minor interest (for example localization of tumour cells in a cytological smear).
The prepositions to apply an adequate sampling procedure in tissue – based diagnosis include that number and size of the samples are limited. In addition, the detectable information has to be known. This information commonly depends upon additional (external) factors, and can be translated into diagnostic features that allow the detection and identification of a probe within the sampling space. These features can depend upon the size of the probes, their number, and their position within the collective, or even within the sampling space.
the evaluation of a diagnostic significance of a certain object or "basic unit" which is the probability that the object can be grouped into a certain diagnosis, and
the probability to detect these basic units within the entire space.
Sampling is basically an information detection and transformation procedure, and thus undertaken to reach a certain final aim, for example to state a diagnosis, or to identify the presence or absence of certain objects. A time and space invariant translation of the sampling procedure can be assumed as long as we want to obtain reproducible results (figure 3). Such a translation permits a separation of the object detection likelihood from the diagnostic significance of the segmented objects, and allows us to compute both properties separately. Assuming a digitized image, each point (pixel x, y) within the basic image is either a presentation of an object or not. All object features can be reduced to a function that represents the object pixels in relation to sample and basic image size (figure 4). The introduction of an exponential diagnosis function then gives us the well known formula of Krige, which is commonly used to detect ores, oil fields, or underground water reservations . Furthermore, the application of specific mappings (dilation, erosion) permits us to "increase the magnification of an object within its sampling frame, or to define the center of gravity in an object in order to compute the image structure. One will obtain a so – called order of structures if these procedures are repetitively applied to light microscopy .
➢ the frequency of the analyzed units in relation to each other or to the basic space (structure);
➢ certain features of the biological units in relation to the basic space (to further identify and classify the objects).
No information about the basic (reference) space is needed. The detection of biological meaningful units is then equivalent to the segmentation of the image and analysis of randomly chosen segmented elements. This procedure forms the basis of numerous investigations since the 1950s. It is commonly called stereology [22–26]. In principle, a grid consisting of regular lines (points) with identical length (and distance in between) is overlayed to the image, and the number of hits (intersections) is counted. From the number of intersections the volume – adjusted frequency, size, surface can be derived, independent from the orientation and shape of the elements. In a binary image the pixels (binary x, y points) can be used as a grid. Random chosen are the cutting angle (plane/volume), and the start point of the grid (pixel). Stratified are the selection of the grid (all pixels) and the count of intersections. Thus, any random sampling is provided by the start of the procedure, for example by random selection of the upper right position (x, y) coordinates of the sample space. From the relation x/A (number of hits x/reference area A) two-dimensional (and also three – dimensional) parameters can be derived. These include the area density (Aa), the volume density (Vv), the boundary density (Ba), the numerical density (Na), and the surface density (Sv). It should be noted, that this quite easily applied procedure permits the estimation of significant three – dimensional object features without any sophisticated three dimensional reconstruction [23, 27, 18, 29].
only those areas which contain features of (any) cell (gray value selection at low magnification)
within these areas only those cells which seem to be abnormal (gray value, size, moderate magnification)
within these cells those with abnormal nuclear size (DNA content), high magnification.
terminate the procedure once the diagnosis – significant information has been obtained
Stratified sampling requires some external knowledge in order to detect the biological meaningful events such as cancer cells. The image features of a cancer cell have to be known if one would like to detect this event by stratified sampling. The alternative algorithm would be to "sample" all cells, and start, if possible, a statistical analysis. This would then try to evaluate the rare events (supposing that cancer cells are rare to normal cells). Again, some external knowledge would be necessary. Obviously, this is related to the diagnosis function s(Ci, D).
Stratified sampling requires an accurate segmentation of objects with known features. Independent upon the actual segmentation procedure the sampling can be performed as active and passive sampling.
Any segmentation procedure has to accurately define the area of an object, which is equivalent to detect its boundary. Each pixel has to be distinguished either to belong to the object or not, which can be written: f(x, y, meaning) = [1,0], with f(x, y, object) = , and f(x, y, backgound) =  This approach is called passive sampling, as it discriminates the object area by a simple yes – no function . In other words, passive sampling is provided by a constant relation between the objects and the grid (intersections). The intersection has the probability function p(i) = .
Active sampling is a different approach. It is provided by an objective-specific relation between the objects and the grid (intersections). The probability that a pixel belongs to an object ranges between [1,0]. The intersection has a probability function p(i, o), i.e., the probability to detect the pixels that belong to a certain object depends on the object itself and its neighborhood . For example, a pixel displays a probability of 0.7 that it belongs to the object. This probability can increase or decrease dependent upon additional parameters, such as size, orientation, or shape of neighboring objects. Naturally, the probability value of 0.7 itself might be used to define whether it is an "object" – or a "background" pixel.
The probability function p(i, o) can be calculated if we separate p(i, o) in its two components: p(i, o) = gr(x, v) * af(gr, v).
gr(x, y) is the frequency distribution of different objects in the reference space v,
af(gr, v) is the detection probability in the space v.
If we assume that af(gr, v) = const in the reference space v, we can estimate p(i, o) by a set of measurements in different sample spaces and transform p(i, o) =  if gr(x, y) > const, and p(I, o) =  elsewhere.
The specific object (cell) is rare within the basic population.
It has to possess regular neighborhood relations to objects of the basic population.
It has to be randomly distributed within the reference space.
We perform a random sampling of the specific (rare) object (O) within the basic population Ni (to estimating O [Ni]).
We perform a stratified sampling "around" each detected specific object (to estimating Ni(0)).
If Ni(O) = constant we can assume a specific function of the object (cell) within the basic population (for example cellular immune competence, functional activation of cells, etc.).
E(Ne) = error of detecting an individual event (i.e., probability of identification/missing a tumor cell)
E(B(n)) = error of measuring all elements in the reference space (i.e., related to the biological variance of the tissue, dependent upon N)
E(Ne/v) = error of measuring the size of events e in relation to size of sampling space S (frequency of e in sample space sv).
We obtain the smallest sampling error if we select the reference volume as sample size, and if we are dealing with regular tissue (small biological variance).
The smaller the sample sizes in relation to the size of events, the bigger is the sampling error, as long as the error to segment (identify) per event is not increasing.
The sampling error is increasing if we choose different sizes of the samples.
To take and to analyze samples of a broad variety of tissues is a basic procedure in surgical pathology, or in tissue – based diagnosis. All diagnostic algorithms depend upon a correct and reliable sampling procedure, and extensive training in surgical pathology addresses to identify and sample those tissue compartments that probably contain the most significant information to classify the disease present [7, 19, 34–38]. The majority of investigations addresses to an optimum sampling procedure, for example. How many sentinel lymph nodes should be investigated in relation to the stage of breast cancer [29, 31], or "optimizing sampling of tomato fruit for carotenoid content, or how to perform endometrial sampling in patients with trophoblastic disease after suction curettage [39, 40]. In the early days of stereology several authors took attention on the sampling procedures, as the results of counting interceptions are closely associated to the nature of the used sampling method [22, 23]. Recently, sampling has returned to the focus of investigations, especially in live imaging . Most of the investigations try to optimize the sampling, which is equivalent to evaluate the "best" stratified sampling method.
In addition to medical applications, sampling plays a dominant role in geology, especially mining. In fact, Krige's sampling analysis can be considered to be the first approach to develop a "sampling theory" [12, 20].
In this article we want to derive a scheme of sampling that permits a principle view of sampling, its different methods, and to calculate the efficiency of the used sampling method. In principle, two different algorithms exist, the random sampling and the stratified sampling [12, 9]. Random sampling has to be performed, if no knowledge of the information searched for exists. It is the appropriate technique to measure features of biological units such as chromosomes, DNA fragments, nuclei, cells, vessels, etc. Its accuracy (error rate) can be predefined by number and size of the chosen samples in relation to the expected size of events and to the reference space. Its results can be implemented in additional classification algorithms, such as diagnostic procedures. The sampling can be terminated if a certain classification can be performed with a predefined accuracy, i.e, a diagnosis can be assessed with high certainty. The accurate measurement of events' features is a prerequisite, but not the aim of stratified sampling. Its implementation requires additional (external) information, and numerous investigations have been performed to "speed up" the procedure (or to make it more efficient) using spatial structures within the reference space. When an exponential event probability distribution is given, Krige's formula can be derived from stratified sampling.
In addition to the discussed principle differences between random and stratified sampling procedures, passive and active sampling plays a major role in image segmentation algorithms. The common principle of active sampling associates neighbourhood knowledge (i.e. knowledge derived from general external observations) to the object under investigation, for example to accurately define its boundaries . Especially in measuring accurate thresholds for grading purposes in immunohistochemistry this approach has been proven to be successful . A furthermore derived application is the functional sampling, which is again a stratified sampling in principle. This procedure can assist to investigate in the "biological importance" of rare events, which is widely not known to our experience.
In aggregate, a general theory of sampling is derived that possesses its applications in numerous, if not all natural sciences. They range from agriculture to mining, from aircraft maintenance to medicine. In surgical pathology it is of major importance that all diagnostic investigations start with appropriate sampling.
The financial support of the international Academy of Telepathology e.V., and of the Verein zur Förderung des biologisch-technologischen Fortschritts in der Medizin e.V. are gratefully acknowledged.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.