Clinicians spend a great deal of time diagnosing complaints or abnormalities in their patients, generally arriving at a diagnosis after applying various diagnostic tests. Clinicians should be familiar with basic principles when interpreting diagnostic tests. This chapter deals with those principles.
A diagnostic test is ordinarily understood to mean a test performed in a laboratory, but the principles discussed in this chapter apply equally well to clinical information obtained from history, physical examination, and imaging procedures. They also apply when a constellation of findings serves as a diagnostic test. Thus, one might speak of the value of prodromal neurologic symptoms, headache, nausea, and vomiting in diagnosing classic migraine, or of hemoptysis and weight loss in a cigarette smoker as an indication of lung cancer.
In Chapter 3, we pointed out that clinical measurements, including data from diagnostic tests, are expressed on nominal, ordinal, or interval scales. Regardless of the kind of data produced by diagnostic tests, clinicians generally reduce the data to a simpler form to make them useful in practice. Most ordinal scales are examples of this simplification process. Heart murmurs can vary from very loud to barely audible, but trying to express subtle gradations in the intensity of murmurs is unnecessary for clinical decision making.
A simple ordinal scale—grades I to VI—serves just as well. More often, complex data are reduced to a simple dichotomy (e.g., present/absent, abnormal/normal, or diseased/well). This is done particularly when test results are used to help determine treatment decisions, such as the degree of anemia that requires transfusion. For any given test result, therapeutic decisions are either/or decisions; either treatment is begun or it is withheld. When there are gradations of therapy according to the test result, the data are being treated in an ordinal fashion.
The use of blood pressure data to decide about therapy is an example of how information can be simplified for practical clinical purposes. Blood pressure is ordinarily measured to the nearest 1 mm Hg (i.e., on an interval scale). However, most hypertension treatment guidelines, such as those of the Joint National Committee on the Detection, Evaluation, and Treatment of Hypertension 1, choose a particular level (e.g., 140 mm Hg systolic pressure or 90 mm Hg diastolic pressure) at which to initiate drug treatment. In doing so, they transformed interval data into dichotomous data.
To take the example further, turning the data into an ordinal scale, the Joint National Committee also recommends that physicians choose a treatment plan according to whether the patient’s blood pressure is “prehypertension” (systolic 120 to 139 mm Hg or diastolic 80 to 89 mm Hg), “stage 1 hypertension” (systolic 140 to 159 mm Hg or diastolic 90 to 99 mm Hg), or “stage 2 hypertension” (systolic ?160 mm Hg or diastolic ?100 mm Hg).
The Accuracy of a Test Result
Diagnosis is an imperfect process, resulting in a probability rather than a certainty of being right. The doctor’s certainty or uncertainty about a diagnosis has been expressed by using terms such as “rule out” or “possible” before a clinical diagnosis. Increasingly, clinicians express the likelihood that a patient has a disease as a probability. That being the case, it behooves the clinician to become familiar with the mathematical relationships between the properties of diagnostic tests and the information they yield in various clinical situations. In many instances, understanding these issues will help the clinician reduce diagnostic uncertainty. In other situations, it may only increase understanding of the degree of uncertainty. Occasionally, it may even convince the clinician to increase his or her level of uncertainty.
A simple way of looking at the relationships between a test’s results and the true diagnosis is shown in Figure 8.1. The test is considered to be either positive (abnormal) or negative (normal), and the disease is either present or absent. There are then four possible types of test results, two that are correct (true) and two that are wrong (false). The test has given the correct result when it is positive in the presence of disease (true positive) or negative in the absence of the disease (true negative). On the other hand, the test has been misleading if it is positive when the disease is absent (false positive) or negative when the disease is present (false negative).
Figure 8.1.The relationship between a diagnostic test result and the occurrence of disease.
There are two possibilities for the test result to be correct (true positive and true negative) and two possibilities for the result to be incorrect (false positive and false negative).
The Gold Standard
A test’s accuracy is considered in relation to some way of knowing whether the disease is truly present or not—a sounder indication of the truth often referred to as the gold standard (or reference standard or criterion standard). Sometimes the standard of accuracy is itself a relatively simple and inexpensive test, such as a rapid streptococcal antigen test (RSAT) for group A streptococcus to validate the clinical impression of strep throat or an antibody test for human immunodeficiency virus infection. More often, one must turn to relatively elaborate, expensive, or risky tests to be certain whether the disease is present or absent. Among these are biopsy, surgical exploration, imaging procedures, and of course, autopsy.
For diseases that are not self-limited and ordinarily become overt over several months or even years after a test is done, the results of follow-up can serve as a gold standard. Screening for most cancers and chronic, degenerative diseases fall into this category. For them, validation is possible even if on-the-spot confirmation of a test’s performance is not feasible because the immediately available gold standard is too risky, involved, or expensive. If follow-up is used, the length of the follow-up period must be long enough for the disease to declare itself, but not so long that new cases can arise after the original testing (see Chapter 10).
Because it is almost always more costly, more dangerous, or both to use more accurate ways of establishing the truth, clinicians and patients prefer simpler tests to the rigorous gold standard, at least initially. Chest x-rays and sputum smears are used to determine the cause of pneumonia, rather than bronchoscopy and lung biopsy for examination of the diseased lung tissue
Electrocardiograms and blood tests are used first to investigate the possibility of acute myocardial infarction, rather than catheterization or imaging procedures. The simpler tests are used as proxies for more elaborate but more accurate or precise ways of establishing the presence of disease, with the understanding that some risk of misclassification results. This risk is justified by the safety and convenience of the simpler tests. But simpler tests are only useful when the risks of misclassification are known and are acceptably low. This requires a sound comparison of their accuracy to an appropriate standard.
Lack of Information on Negative Tests
The goal of all clinical studies aimed at describing the value of diagnostic tests should be to obtain data for all four of the cells shown in Figure 8.1. Without all these data, it is not possible to fully evaluate the accuracy of the test. Most information about the value of a diagnostic test is obtained from clinical, not research, settings. Under these circumstances, physicians are using the test in the care of patients.
Because of ethical concerns, they usually do not feel justified in proceeding with more exhaustive evaluation when preliminary diagnostic tests are negative. They are naturally reluctant to initiate an aggressive workup, with its associated risk and expense, unless preliminary tests are positive. As a result, data on the number of true negatives and false negatives generated by a test (cells c and d in Fig. 8.1) tend to be much less complete in the medical literature than data collected about positive test results.
This problem can arise in studies of screening tests because individuals with negative tests usually are not subjected to further testing, especially if the testing involves invasive procedures such as biopsies. One method that can get around this problem is to make use of stored blood or tissue banks. An investigation of prostate-specific antigen (PSA) testing for prostate cancer examined stored blood from men who subsequently developed prostate cancer and men who did not develop prostate cancer . The results showed that for a PSA level of 4.0 ng/mL, sensitivity over the subsequent 4 years was 73% and specificity was 91%. The investigators were able to fill in all four cells without requiring further testing on people with negative test results. (See the following text for definitions of sensitivity and specificity.)
Lack of Information on Test Results in the Nondiseased
Some types of tests are commonly abnormal in people without disease or complaints. When this is so, the test’s performance can be grossly misleading when the test is applied to patients with the condition or complaint.
Magnetic resonance imaging (MRI) of the lumbar spine is used in the evaluation of patients with low back pain. Many patients with back pain show herniated intervertebral discs on MRI, and the pain is often attributed to the finding. But, how often are vertebral disc abnormalities found in people who do not have back pain? Several studies done on subjects without a history of back pain or sciatica have found herniated discs in 22% to 58% and bulging discs in 24% to 79% of asymptomatic subjects with mean ages of 35 to more than 60 years old 3. In other words, vertebral disc abnormalities are common and may be a coincidental finding in a patient with back pain.
Lack of Objective Standards for Disease
For some conditions, there are simply no hard-and-fast criteria for diagnosis. Angina pectoris is one of these. The clinical manifestations were described nearly a century ago, yet there is still no better way to substantiate the presence of angina pectoris than a carefully taken history. Certainly, a great many objectively measurable phenomena are related to this clinical syndrome, for example, the presence of coronary artery stenosis on angiography, delayed perfusion on a thallium stress test, and characteristic abnormalities on electrocardiograms both at rest and with exercise. All are more commonly found in patients believed to have angina pectoris, but none is so closely tied to the clinical syndrome that it can serve as the standard by which the condition is considered present or absent.
Other examples of medical conditions difficult to diagnose because of the lack of simple gold standard tests include hot flashes, Raynaud’s disease, irritable bowel syndrome, and autism. In an effort to standardize practice, expert groups often develop lists of symptoms and other test results that can be used in combination to diagnose the clinical condition. Because there is no gold standard, however, it is possible that these lists are not entirely correct.
Circular reasoning can occur—the validity of a laboratory test is established by comparing its results to a clinical diagnosis based on a careful history of symptoms and a physical examination, but once established, the test is then used to validate the clinical diagnosis gained from history and physical examination!
Consequences of Imperfect Gold Standards
Because of such difficulties, it is sometimes not possible for physicians in practice to find information on how well the tests they use compare with a thoroughly trustworthy standard. They must choose as their standard of validity another test that admittedly is imperfect, but is considered the best available. This may force them into comparing one imperfect test against another, with one being taken as a standard of validity because it has had longer use or is considered superior by a consensus of experts.
In doing so, a paradox may arise. If a new test is compared with an old (but imperfect) standard test, the new test may seem worse even though it is actually better. For example, if the new test were more sensitive than the standard test, the additional patients identified by the new test would be considered false positives in relation to the old test. Similarly, if the new test is more often negative in patients who really do not have the disease, results for those patients would be considered false negatives compared with the old test. Thus, a new test can perform no better than an established gold standard test, and it will seem inferior when it approximates the truth more closely unless special strategies are used.
Computed tomographic (“virtual”) colonoscopy was compared to traditional (optical) colonoscopy in screening for colon cancer and adenomatous polyps that can be precursors of cancer 4. Both tests were performed on every patient without the clinician interpreting each test knowing the results of the other test. Traditional colonoscopy is usually considered the gold standard for identifying colon cancer or polyps in asymptomatic adults. However, virtual colonoscopy identified more colon cancers and adenomatous polyps (especially those behind folds in the colon) than the traditional colonoscopy. In order not to penalize the new test in comparison to the old, the investigators ingeniously created a new gold standard—a repeat optical colonoscopy after reviewing the results of both testing procedures—whenever there was disagreement between the tests.
Sensitivity and Specificity
Figure 8.2 summarizes some relationships between a diagnostic test and the actual presence of disease. It is an expansion of Figure 8.1, with the addition of some useful definitions. Most of the remainder of this chapter deals with these relationships in detail.
Figure 8.2.Diagnostic test characteristics and definitions.
Se = sensitivity; Sp = specificity; P = prevalence; PV = predictive value; LR = likelihood ratio. Note that LR+ calculations are the same as Se/(1 ? Sp) and calculations for LR? are the same as (1 ? Se)/Sp.
Figure 8.3 illustrates these relationships with an actual study 5. Deep venous thrombosis (DVT) in the lower extremities is a serious condition that can lead to pulmonary embolism; patients with DVT should receive anticoagulation. However, because anticoagulation has risks, it is important to differentiate between patients with and without DVT. Compression ultrasonography is highly sensitive and specific for proximal thrombosis and has been used to confirm or rule out DVT.
Compression ultrasonography is expensive and dependent on highly trained personnel, so a search for a simpler diagnostic test was undertaken. Blood tests to identify markers of endogenous fibrinolysis, D-dimer assays, were developed and evaluated for the diagnosis of DVT. Figure 8.3 shows the performance of a D-dimer assay in the diagnosis of DVT. The gold standard in the study was the result of compression ultrasonography and/or a 3-month follow-up.
Figure 8.3.Diagnostic characteristics of a D-dimer assay in diagnosing deep venous thrombosis (DVT).
(Data from Bates SM, Kearon C, Crowther M, et al. A diagnostic strategy involving a quantitative latex D-dimer assay reliably excludes deep venous thrombosis. Ann Intern Med 2003;138:787–794.)