Table of Contents

OBJECTIVES
By the end of this session the reader should be able to:
 To understand how reference ranges are determined and how to use them to interpret laboratory data
 To understand common statistical terms related to laboratory diagnostics
 To understand how disease prevalence affects predictive values of laboratory tests
 To understand the basic principle of quality control as applied to a diagnostic test
 Describe the sources of variation influencing laboratory tests (i.e., analytical, intraindividual, and interindividual)
 Describe the use of the parametric and nonparametric methods for calculation of reference ranges and the requirements for applying each method
 Calculate percent false positives from the number of tests performed on a patient
 Calculate the number of patterns of test results from the number of types of results and the number of tests performed on a patient
 Calculate clinical sensitivity of a test
 Calculate clinical specificity of a test
 Calculate predictive values of a test (both positive and negative tests)
 Calculate efficiency of a test
 Describe the use of the coefficient of variation of a test in determining whether or not a change in a patient's laboratory result is significant
KEY TERMS
Accuracy  closeness of agreement of a measured value and the true value
Analytical variation  observed differences in the value of an analyte after it has been prepared for analysis
Coefficient of variation  a relative measure of precision, determined by dividing the standard deviation by the mean
Intraindividual variation  differences in true value of an analyte within the same individual
Interindividual variation  differences in true value of analyte between individuals
Mean  the arithmetic average
Negative predictive value  the fraction of negative values which are correct; determined by dividing the true negatives by the sum of the true negatives and false negatives
Positive predictive value  the fraction of positive values which are correct; determined by dividing the true positives by the sum of the true positives and false positives
Precision  the closeness of agreement between independent measurements; generally expressed as coefficient of variation or standard deviation
Reference range  the inner 95% of values for a laboratory test as measured in a defined population; the subject population is typically disease free with regards to the test of interest
Sensitivity  the ability of a test to detect a true positive; determined by dividing the true positives by the sum of the true positives and false negatives
Specificity  the ability of a test to detect a true negative; determined by dividing the true negatives by the sum of the true negatives and false positives
Standard deviation  a measure of precision (square root of the variance)
BACKGROUND/SIGNIFICANCE
A basic understanding of statistics is assumed for this chapter. In order to interpret laboratory tests it is essential that health professionals understand sources of variation, reference ranges, predictive values and how laboratory results are interpreted.
There are many different reasons for performing laboratory tests. We perform them when we believe the results may help us answer some question that we have about a patient. In most cases the result itself is not the answer; we interpret the result to answer the question we posed. This interpretation is based on statistical principles. The purpose of this chapter is to provide a statistical framework for interpreting laboratory test results.
One reason for performing a laboratory test might be to screen an asymptomatic individual for evidence of occult disease. In this case, we probably expect to compare the result to a reference range (formerly called “normal range”) to determine whether the individual has the disease under consideration. It is important to understand how reference ranges are established in order to use them appropriately.
In another setting, we may be dealing with a very sick patient for whom we have developed a differential diagnosis based on the history and clinical findings. We may determine one or more laboratory tests in order to make a definitive diagnosis. The same test result might have a different significance in a clinically ill patient versus a clinically healthy patient. We have statistical tools to help us address this apparent paradox.
Another common reason for performing laboratory tests is to help follow the course of disease in a particular patient, and assess the impact of treatment. In this case, we want to know whether an interval change in the results of a laboratory test represents a real change in the patient versus just analytical variation. Knowledge of the statistical data routinely collected in the clinical laboratory can facilitate interpretation of serial test results.
SOURCE OF VARIATION
Proper interpretation of results requires an understanding of the sources of variation which influence laboratory tests. These can be categorized as preanalytical, analytical, intraindividual, and interindividual variation.
Analytical variation is produced by conditions which affect the sample and the testing system from the moment the sample is removed from the patient until the final result is generated. (It is helpful to further subdivide this category into preanalytical factors, which include all the things that can happen to a sample as it is collected, transported, processed, and stored, and analytical factors which affect the testing process itself.) All test results are subject to analytical variation. Important interferences with laboratory tests include hemolysis (rupture of RBC into plasma), lipemia (excess lipids in a plasma sample) and icteric (high concentrations of bilirubin).
Table 1. Hemolysis Interferences with Common Lab Tests  

Test  Erythrocyte  Plasma  Effect of Hemolysis 
Sodium, mEq/L  16  140  Lower plasma result 
Potassium, mEq/L  100  4.4  Raises plasma result 
Transaminase (ALT), U/L  500  25  Raises plasma result 
Folates, ng/dL  2001200  2.51.5  Raises plasma result 
Table 2. Examples of Effects of Drugs on Laboratory Tests  

Test  Effect  Mechanism 
Bilirubin  Decrease  Barbiturates induce glucuronyl transferase 
Bilirubin  Increase  Any drug with liver toxicity (e.g., acetaminophen) 
Amylase  Increase  Opiates cause constriction of sphinter of Oddi 
Digoxin  Increase  Quinidine releases digoxin from heart muscle and decreases renal clearance causing substantial increases in serum digoxin 
Intraindividual variation is produced by conditions which cause a single individual’s laboratory values to change at different times of day or under different physiologic conditions. Examples of factors which contribute to intraindividual variation include circadian rhythms, hydration, activity, stress, posture, and food intake. When we use the results of serial testing to follow the course of disease in a patient, it is important to recognize the potential contribution of normal physiologic factors and try to distinguish it from medically important variation.
Interindividual variation reflects the many different factors which cause laboratory test results to vary from one individual to another within a population. Examples of such variables include age, sex, diet, body mass, general activity level, and genetics. The results of a test performed on a group of individuals will reflect analytical variation and intraindividual as well as interindividual variation.
The remainder of this chapter addresses each of these categories in more detail, together with the applicable statistical concepts. Familiarity with the normal distribution (bellshaped curve, or Gaussian distribution), including the concepts of the mean and standard deviation, is assumed.
ESTABLISHING REFERENCE RANGES
When we establish a reference range, we want to have a tool for comparing the test result from one individual with those from a relatively large number of other members of a similar population. What we want to determine is the expected range of interindividual variation. We already know that the results of a test performed on a group of people will reflect intraindividual and analytical variation as well as interindividual variation; it is obvious that if the contributions from the first two are relatively large, they will obscure the part of the total variation that is due to actual differences among individuals. Part of the process of establishing a reference range is simply taking steps to reduce the magnitude of this obscuring effect.
Define the reference population. Demographically, it should match the population whose laboratory results will be compared to this reference range. Based on what is already known about the analyte, consider whether separate reference ranges should be established for adults versus children, men versus women, and so forth. Profound biochemical changes take place in the period between birth and adulthood, and many of these are reflected by clinical chemistry test values in this age group that differ significantly from those considered normal in adults. The most pronounced and/or accelerated changes are seen in the newborn period and during puberty. Table 3 gives examples of laboratory tests that are affected by age. Some hospitals only give reference ranges for adults, yet report out children’s results (with incorrect reference ranges), thus it is important to know what values change with age.
Table 3. Lab Results that are affected by age  

Lab Tests that are Higher in Newborns and Children  Lab Tests that are Lower in Newborns and Children  
Alkaline Phosphatase  Bicarbonate  
Ammonia  Albumin  
AST  Amylase  
Bilirubin  Cholesterol  
Creatine Kinase (CK)  Creatinine  
Potassium  Cooper  
Gamma glutamyl transferase (GGT)  Glucose  
Thyroid stimulating hormone (TSH)  Haptoglobin  
Thyroxine (T4)  IgA, IgM, IgE Osmolality 
Here are the steps involved in establishing a reference range:
 Select the analyte for which the reference range is to be established, and the methodology and instrumentation which will be used for testing. Study what is already known about the analyte, including variables which are already known to cause variation within and between individuals. Review the methodology with particular attention to the impact of variables known to be associated with sample procurement, transport, processing, and storage.
 Define the reference population. Demographically, it should match the population whose laboratory results will be compared to this reference range. Based on what is already known about the analyte, consider whether separate reference ranges should be established for adults versus children, men versus women, and so forth.
 Choose a sampling method. Ideally, this method should yield a random sample of individuals representing the reference population.
 Collect, process, and test the specimens. It is important to treat these specimens exactly as patient specimens are treated.
 Analyze the results. This process can be broken down into the following steps:
a) Organize the results. A convenient format is a frequency histogram, as shown in the example in Table 4. In this example, the possible values for the analyte are shown in the column on the left, from the lowest at the top to the highest at the bottom. (It is true that the actual levels of most analytes form a continuum, not limited to a set of discrete numbers, but as results are rounded off, we work with a limited number of possibilities which represent regular intervals along the continuum.)
The next column shows the number of reference specimens on which that result value was recorded.
The body of the histogram is a graphic representation of these results. In this example, there is an “X” corresponding to each individual result.
The last column shows the cumulative rank for each value. In this format, the cumulative rank corresponds to the total number of “X’s” in that row and all the rows above it, representing the total number of samples having that result value or a lesser value.
Table 4. Frequency Histogram  

Value  Number of Observations  Rank  
1  4  xxxx  4 
2  15  xxxxxxxxxxxxxxx  19 
3  23  xxxxxxxxxxxxxxxxxxxxxxx  42 
4  25  xxxxxxxxxxxxxxxxxxxxxxxxx  67 
5  22  xxxxxxxxxxxxxxxxxxxxxx  89 
6  19  xxxxxxxxxxxxxxxxxxx  108 
7  12  xxxxxxxxxxxx  120 
8  2  xx  122 
9  0  
10  0  
11  0  
12  0  
13  1  x  123 
b) Eliminate outliers. An outlier is an extreme value at one end of the data set which is so far away from the rest of the data set that the reference range might be substantially different if we include that point versus deleting it.
There is not a single universally accepted mathematical definition of the term “outlier,” so what we offer here is a “rule of thumb.” If the distance between the most extreme data point and its nearest neighbor is more than a third of the total range covered by the data set, then that data point should be regarded as an outlier and eliminated.
reject the largest value (n) when:
(1)reject the smallest value X(l) when:
(2)After eliminating an outlier, reexamine the data set for new outliers which may become apparent only after reducing the range.
c) Once the outliers have been eliminated, determine whether the data have a Gaussian distribution. Often, inspection of the histogram is sufficient to conclude that the distribution is clearly nonGaussian. The more rigorous approach is to calculate the mean, median, mode, skewness, and kurtosis.
The mean is the average value:
(3)The median is the value of the middle observation, when the set is arranged in rank order:
(4)The mode is the most frequently observed value.
Skewness is a measure of asymmetry.
Kurtosis is a measure of relative peakedness versus flatness.
(The formulas for calculating skewness and kurtosis are beyond the scope of this course but the functions are available in some of the popular software packages which include statistics.)
If the mean equals the median equals the mode, and the skewness and kurtosis are zero, then the distribution is Gaussian. For our purposes in this course, values for skewness and kurtosis of 1 to +1 are sufficiently near zero to be treated as such.
d) Select a method for calculating the reference range
There are two basic methods: parametric and nonparametric.
If the distribution is Gaussian, a parametric method may be used. A minimum of 30 result values are required to calculate a reference range by this method. The reference range is defined as the mean plus or minus two standard deviations: Reference Range = X + 2SD
If the distribution is nonGaussian, a nonparametric method must be used. At least 120 result values (after elimination of outliers) are required to calculate a reference range by this method. The nonparametric method may also be used for a Gaussian distribution, as long as there are 120 result values in the data set. Three steps are involved:
First, arrange the results in order and assign a rank to each observation, so that
X(l) < X(2) < … < X(n)
Second, calculate the ranks which correspond to the 2.5 and 97.5 percentiles:
0.025 (n+l) and 0.975 (n+l)
Third, find the values which correspond to those ranks. Those values are the upper and lower limits of the reference range (Central 95^{th} percentile).
Suppose we collect one more reference sample, and its result just happens to fall near one end of the result distribution. Can you see how this chance event might change the reference range a little, especially if we are using a minimum number of observations?
The point of going through this rather lengthy exercise is to demonstrate concretely that reference ranges are not exact. They represent approximations subject to a wide variety of influences. At best, they can be improved by increasing the size of the population samples on which they are based. Statistical methods exist for estimating the validity of reference ranges, but they are beyond the scope of this course.
USING REFERENCE RANGES
Figure 1 shows a graphic representation of a reference range. The area under the curve represents the entire population of healthy subjects from which the reference sample was drawn. The vertical lines near each end of the curve represent the upper and lower limits of the reference range. The area under the curve between the two vertical lines includes 95% of the reference population. The tails of the curve have been cut off, excluding the values obtained from approximately 5% of the healthy reference subjects from the reference range. Does this mean we should no longer consider those subjects healthy? No. Is there any reason for choosing to include 95% and to exclude 5%? The original selection of 95% limits was arbitrary; it is done now for the sake of convention. Unless otherwise specified, you may presume that the reference range provided with a laboratory result represents the middle (central) 95% of the reference population.
Figure 1: Graphic Representation of a Reference Range
We have a specific term for the values of healthy individuals which fall outside the limits of the reference range: we call them “false positives” regardless of whether they fall at the high or low end of the range. The terms “positive” and “negative” in this context have nothing to do with high or low numbers, but rather indicate positivity versus negativity for disease.
It follows that the more tests performed on an individual, the more percent “false positives” will result. This can be calculated by the formula:
Percent False Positives = (10.95^{n}) X 100%, where “n” is the number of tests performed.
Thus, one test will yield 5% “false positives.”
Let’s use a hypothetical reference range to interpret some patient results. The first result we’ll consider happens to fall outside the range. What are the possibilities? Either the patient is sick, or the patient is well but has a test result which is an outlier (either one of which would have been excluded from calculations), or is demographically different from the reference population, or the specimen may have been collected under different conditions, or the specimen may have been handled differently from the reference specimens. Now let’s consider a result that falls inside the range. Either the patient’s result is within the range because the patient is well, or the patient does in fact have the disease or condition we are considering, but the test result just doesn’t show it for any of the reasons mentioned in the first example, or for the reason that the presence of disease doesn’t necessarily assure abnormal laboratory results.
Meaningful interpretation of laboratory data requires an understanding of test results to be expected for patients having various diseases and conditions, as well as for healthy individuals. It would be ideal if such reference ranges for disease never had any areas of overlap with the socalled “reference ranges,” but since that is rarely the case, the next section addresses the interpretation of overlapping result distributions.
Finally, the number of patterns of test results is given by the following formula:
# Patterns of Test Results = X^{n}, where “X” is the number of types of results and “n” is the number of tests.
Thus, for three types of results (e.g., “low,” “normal,” and “high”) and two tests, there would be nine patterns possible.
Taxonomy of Overlapping Distributions
Figure 2 shows a pair of curves, representing hypothetical distributions of test results from two distinct populations, one healthy and one diseased with a region of overlap. The horizontal line represents the continuum of possible result values for the analyte we are measuring. The curve on the left represents the distribution of results from the healthy reference population. The curve on the right represents the distribution of results from a group of people known to have the disease. The vertical line represents the upper limit of the reference range.
All of the results which fall to the left of the vertical line are called negative. All of the results which fall to the right of the line are called positive. Some diseased patients have “negative” results, since part of their distribution falls to the left of the vertical line. We call this group of results “false negatives.” Conversely, when we defined our reference range we already acknowledged that a small group of healthy individuals would have “false positive” results. To complete the taxonomy, we call the results which accurately reflect the status of the individuals from which they came “true positives” and “true negatives,” respectively.
Sensitivity and specificity are performance characteristics of a test. To determine these characteristics, it is necessary to obtain test results on populations in whom the presence or absence of disease has been established by some method independent of this test.
Sensitivity is defined as the proportion of diseased subjects correctly classified by the test; i.e., the ability to detect a true positive in a person afflicted with the disease.
(5)Figure 2: Overlapping Distribution of Test Results for Healthy and Diseased Subjects
Specificity is defined as the proportion of healthy subjects correctly classified; i.e., the ability to exclude a diagnosis in a healthy person.
(6)A convenient format for arranging data is shown in Tables 4 and 5. Note that sensitivity refers only to the diseased population while specificity refers only to the healthy population. The relative sizes of the two populations do not affect sensitivity or specificity.
Table 4. Sensitivity and Specificity  

Number of Subjects with Positive Test  Number of Subjects with Negative Test  TOTALS  
Number of Subjects with Disease  TP  FN  TP + FN 
Number of Subjects without Disease  FP  TN  FP + TN 
TOTALS  TP + FP  FN + TN  TP + FP + FN + TN 
Table 5. Example of Sensitivity and Specificity  

Number of Subjects with Positive Test  Number of Subjects with Negative Test  TOTALS  
Number of Subjects with Disease  68  32  100 sensitivity = 68% 
Number of Subjects without Disease  2  98  100 specificity = 98% 
TOTALS  70  130  200 
Sensitivity and specificity tell us how well a test performs when run on groups of people in whom we already know the diagnosis. In clinical practice, we do not use tests this way. Often, we are running a test on one patient for whom we have not yet made a diagnosis. What we want to know about the test is the odds that the result will correctly classify our patient with respect to the diagnosis we are considering.
Predictive Values
Predictive values describe the odds that the results of a test will correctly classify an individual with respect to the disease or condition under consideration. To determine predictive values, we need to know the prevalence of the disease in the population we are testing. Prevalence is the fraction of the population which has the disease.
The predictive value of a positive test result is the fraction of positive test results which are correct, or the true positives divided by all the positives, both true and false.
(7)The predictive value of a negative test is the fraction of all negative results which are correct, or the true negatives divided by all the negatives, both true and false.
(8)Using the hypothetical data provided in Table 5, we can calculate the predictive value of a positive result to be 97% and that of a negative result to be 75%.
It is important to recognize the impact of disease prevalence on predictive values. Tables 6 and 7 show two more hypothetical data sets. Both have the same sensitivity and specificity as shown in Table 5, but the prevalence has been decreased, demonstrating the impact on predictive values. In general, as the prevalence of disease increases, the predictive value of a positive test improves. As the prevalence of disease decreases, the predictive value of a negative test improves, and the predictive value of a positive test is diminished by increasing numbers of false positive results.
Table 6. Effect of Low Prevalence  

Positive  Negative  Total  
Diseased  68  32  100  
Healthy  20  980  1,000  Sensitivity = 68% 
Total  88  1,012  1,100  Specificity = 98% 
PV+ = 77%  PV = 97%  Prevalence = $\frac{100}{1100}$ = 9% 
Table 7. Effect of Further Decrease in Prevalence  

Positive  Negative  Total  
Diseased  68  32  100  
Healthy  200  9,800  10,000  Sensitivity = 68% 
Total  268  9,832  10,100  Specificity = 98% 
PV+ = 25%  PV = 99.7%  Prevalence = $\frac{100}{10100}$ = 1% 
Intraindividual vs. Analytical Variation
Up to this point we have focused on the interpretation of individual results relative to group results, or analysis of interindividual variation. We will now focus on determining if therapeutic intervention has changed laboratory values that can be detected analytically.
How do you know if my therapy has changed the patient’s lab values? The way to determine if the patient’s lab values have actually changed is to determine if the difference between the first and subsequent measurement is greater than 3 times the standard deviation of the assay. If the difference between the two measurements is greater than 3 times the standard deviation of the assay then you can be 95% confident that the difference between the two measurements is not due to chance (Reference Kaplin LA, Pesce AJ, and Kazmierczak SC, Clinical Chemistry Theory, Analysis, Correlation, 4th edition, St. Louis: Mosby, 2003, page 385).
Figure 3: Quality Control Chart for Sodium with a Mean of 113.2 and a Standard Deviation of 0.5 mmol/L
EXAMPLE
Example: A patient had a sodium of 120 mmol/L on day 1. After treatment the patient’s sodium increased to 126 mmol/L. Has the patient become less hyponatremic?
In order to answer this question you need to know the analytical variation of the lab’s sodium analysis. This can be determined by calling the lab and asking what the standard deviation of the sodium assay is. The laboratory runs quality control specimens with each batch of samples and will know the analytical variability of the assay in the range of interest. A typical quality control chart is shown in Figure 3. The standard deviation for this sodium control is 0.5 mmol/L. Three times 0.5 mmol/L is 1.5 mmol/L. Since the observed change (6 mmol/L) is greater than 1.5 mmol/L, you can be 95% confident that the patient’s sodium has increased to a degree which can be detected analytically.
The primary reason to run controls is to assess whether the test system is functioning properly and generating reliable test results. When the technologists in the laboratory examine the results obtained on the control sample, they are expecting some variation, and they are trying to distinguish between two possible sources of variation: random analytic variation and systematic error.
Random analytic variation is inevitable and all its points will fall within a Gaussian distribution. Systematic error occurs when some new variable is introduced, such as deterioration of a reagent, clogging of a tube within the instrument, etc. The problem with systematic error is that it is likely to compromise the accuracy of test results.
Each time a technologist sets up an analytical test for patient samples he or she first calibrates the assay with standards containing a known concentration of the analyte. To ensure that the calibration is accurate, the technologist then runs a series of controls. Typically controls are run at three levels, low, normal and elevated concentrations. The technologist then checks the control values to see whether the result falls within the pattern of a Gaussian distribution before analyzing and reporting patient samples. Once the mean and standard deviation have been established for a particular control for a particular analyte, 95% of the control results should fall within ± 2 SD of the mean. About one in 20 results will fall outside these limits, but within 3 SD. If this is just a random event, the next time the control is tested, the result will return to within 2 SD 95% of the time. If it persists outside 2 SD, this is interpreted as most likely a sign of some systematic error, and the technologist proceeds to investigate and take corrective action. Likewise, if any result is more than 3 SD from the mean, it is interpreted as probable systematic error and treated as such.
Clinical laboratory technologists do not issue test results if the control results do not fall within established limits. Their objective is to generate the most accurate results possible given the methodology and instrumentation available. The principles of quality control are a major component of their education and training. As PointofCare testing becomes more widespread, it is important for pharmacists and other care providers to understand the need to verify the integrity of the test system before using it.
there is a ton of info on each page. i understand that each page corresponds to a lecture, but is it possible to keep this structure and additionally have a page for each term (e.g. "Predictive Values") as well? it would also be nice to have links to other wiki pages for technical terms. this way it would be more digestible.
Thanks for your suggestions. We will discuss your ideas at our next wiki group meeting and see what we can incorporate. Did you have any specific suggestions for links?
Thanks again for your comments,
Rob Fitzgerald
Hi,
I have a question.
My laboratory participate in PT program.
The institution that send the samples and then analyze the results uses 1SD as pass or fail criteria.
I looked at different standards including ISO 13528 that clearly define acceptable range for the results as +/ 2SD.
Also, the CV for the tests received from different participants vary between 23 – 59. All results reported by my laboratory remain below 1.7 SD and I would qualify those results as valid, however, they received fail code.
I would appreciate your comments,
Regards,
Roger
I wonder why this question has not been answered so far?!!
But my guess for this is that the program assesses accuracy rather than precision (as all external QC programs do) and this is calculated by the % Deviation from the "true" or "expected" value compared with your measured value. Persumably they used 1SD as an estimate of the limit of deviation which should not exceed 2%.
what are the types of reference values?
how e establish reference values and reference values or range are same?
When we compare instruments twice a year, what is an appropriate means of determining if they are the same and what to do if they are not the same?
We have 8 remote sights that have different models of the same vendors instruments. How do you establish a reference interval or explain the diference to the medical staff?
Appreciatively
CLIA provides guidelines on instrument comparison.
One way to approach this (this is by no means comprehensive or the best way but "one" of the many ways. I am trying to be as specific as possible to be useful.)
You could use 20 patient samples with a wide range of values to correlate the two instruments (Deming is a good place) .
Set your own acceptable criteria based on clinical considerations. If the slope is acceptable and close to 1 (and r^2 close to 1) you are on track. If there is an intercept, determine if it is statistically significant (pvalue). If the slopes are different determine if it is some thing to do with instrumentation, reagents or any other experimental factor (troubleshoot). If slopes are truly different determine its clinical significance and develop a correlation factor, if possible. But if it changes every six months you probably have some sort of systemic lab problem you might want to investigate further. If the slopes and intercepts are clinically comparable or if a valid correlation factor can be applied to make the instruments report the same value, you have nothing to explain to the medical staff. If the instruments are significantly different you have a more complex issue. I know this is not fully helpful but..
Hi,
I have a question. In our hematology lab we run daily qc with commercially available high, normal and low controls. However because of the high cost (I am from a third world country) and unreliable supplies we are looking at running patient samples for qc. That is to say we run the qc samples supplied by the manufacturer and ensure that the instrument is functioning properly. Then we run patient sample and select 3 patient samples of high, normal and low values to be run with the next batch of sample ( within 6 hours). Is such a qc run reliable. How do I calculate the sd (accepted range) for these samples.
Thank you