CHAPTER 2: Evaluating Clinical Evidence

Intro

Contents

Introduction

Excellence in clinical care requires integrating clinical expertise, patient preferences, and the best available clinical evidence.[1]

Carefully study the clear descriptions of how the history and physical examination can be viewed as diagnostic tests; how to assess the accuracy of laboratory tests, radiographic imaging, and diagnostic procedures; and how to evaluate clinical research studies and disease prevention guidelines. Mastering these analytic skills will improve your clinical practice and ensure that your assessments and recommendations are based on the best clinical evidence (Fig. 2-1).

FIGURE 2-1 Evidence-based clinical practice Venn diagram.

Evidence-based clinical practice Venn diagram.

(Adapted with permission from Haynes RB, Sackett DL, Gray JM Transferring evidence from research into practice: 1. The role of clinical care research evidence in clinical decisions. ACP J Club. 1996;125:A14–A16.)

You will develop your clinical expertise as you learn about and practice your clinical discipline, enabling you to more efficiently make diagnoses and identify potential interventions. Chapter 3 addresses strategies for engaging patients in health care decisions, recognizing that patients bring individualized preferences, concerns, and expectations to the clinical encounter.

Elements of the history and physical examination can be considered diagnostic tests, whose accuracy can be evaluated according to criteria presented later in this chapter. Throughout the regional examination chapters, you will find evidence-based recommendations for health promotion interventions, especially screening and prevention. These recommendations are also based on evidence from the clinical literature that can be evaluated according to criteria presented in this chapter.

The History and Physical Examination as Diagnostic Tests

The process of diagnostic reasoning begins with the history. As you learn about your patient, you will start to develop a differential diagnosis. This is a list of potential causes for the patient’s problems and the length of the list will reflect your uncertainty about the possible explanation for a given problem. Your list will start with the most likely explanation, but will also include other plausible diagnoses, particularly those that have serious consequences if undiagnosed and untreated.

You will assign probabilities to the various diagnoses that correspond to how likely you consider them to be explanations for your patient’s problem. For now, these probabilities will be based on what you have learned from textbooks and lectures. In time, these probability estimates will also reflect your clinical experience.

When you begin approaching clinical problems your goal is to determine whether you need to perform additional testing (Fig. 2-2).[2]

FIGURE 2-2 Probability revisions.

Probability revisions.

(Adapted with permission from Guyatt G, Rennie D, Meade M Users’ Guides to the Medical Literature. 2nd ed. New York, NY: McGraw-Hill Company; 2008; Chapter 14, Figure 14-2.)

If your probability for a disease based on your history and examination is very high (i.e., exceeds the treatment threshold), then you can move ahead and initiate treatment. Conversely, if your probability for a disease is very low (i.e., below the test threshold), then you do not need further testing. The area between the test and treatment thresholds represents clinical uncertainty, and you need further testing to revise probabilities and guide your clinical management.

The expectation is that test results will enable you to cross a test-treatment threshold. You should understand that these test-treatment thresholds are not set in stone and will vary based on the potential adverse effects of the treatment and the seriousness of the condition. For example, you will require a much higher treatment threshold (confidence that the patient has a high probability of having the disease) for initiating cancer chemotherapy compared to prescribing an antibiotic for a urinary tract infection.

You would require a much lower test threshold (confidence that the patient has a low probability of having the disease) when excluding ischemic heart disease than bacterial sinusitis. However, knowing whether a test result will achieve that effect can be challenging and requires you to understand how to evaluate the performance of a diagnostic test.

Evaluating Diagnostic Tests

You can turn to the clinical literature to determine how results from diagnostic tests—which include elements of the clinical history and physical examination, as well as laboratory tests, radiographic imaging, and procedures—can be used to revise probabilities. Two concepts in evaluating diagnostic tests will be explored: the validity of the findings and the reproducibility of the test results.

Validity

The initial step in evaluating a diagnostic test is to determine whether it provides valid results. Does the test accurately identify whether a patient has a disease? This involves comparing the test against a gold standard—the best measure of whether a patient has disease. This could be a biopsy to evaluate a lung nodule, a structured psychiatric examination to evaluate a patient for depression, or a colonoscopy to evaluate a patient with a positive stool blood test.

The 2 × 2 table is the basic format for evaluating the performance characteristics of a diagnostic test, which means how much the test results revise probabilities for disease.

There are two columns—patients with disease present and patients with disease absent. These categorizations are based on the gold standard test. The two rows correspond to positive and negative test results. The four cells (a, b, c, d) correspond to true positives, false positives, false negatives, and true negatives, respectively.[3]


Setting up the 2 ¥ 2 Table

Gold Standard: Disease Present Gold Standard: Disease Absent
Test positive a
True positive
b
False positive
Test negative c
False negative
d
True negative

Sensitivity and Specificity

The first test statistics to estimate are sensitivity and specificity.

Sensitivity and Specificity

  • Sensitivity is the probability that a person with disease has a positive test. This is represented as a/(a + c) in the disease present column of the 2 × 2 table. Sensitivity is also known as the true positive rate.
  • Specificity is the probability that a non-diseased person has a negative test, represented as d/(b + d) in the disease absent column of the 2 × 2 table. Specificity is also known as the true negative rate.
  • Examples. An example of these statistics would be the probability that splenomegaly (see Chapter 11, p. 479) is associated with percussion dullness below the left costal margin (sensitivity). Conversely, the probability that a patient without splenomegaly will have percussion dullness is the false positive rate (1 ? specificity) for this physical maneuver.

Knowing the sensitivity and specificity of a test does not necessarily help you make clinical decisions because they are statistics based on knowing whether the patient has disease. However, there are two exceptions. A negative result from a test with a high sensitivity (i.e., a very low false-negative rate) usually excludes disease.

This is represented by the acronym SnNOUT—a Sensitive test with a Negative result rules OUT disease. Conversely, a positive result in a test with high specificity (e.g., a very low false-positive rate) usually indicates disease. This is represented by the acronym SpPIN—a Specific test with a Positive result rules IN disease.[4]

Positive and Negative Predictive Values

The typical clinical scenario faced by clinicians involves determining whether a patient actually has disease based on a test result that is either positive or negative. The relevant test statistics here are the positive and negative predictive values.[3]


Positive and Negative Predictive Values

  • The positive predictive value (PPV) is the probability that a person with a positive test has disease, represented as a/(a + b) from the test positive row in the 2 × 2 table.  An example of this statistic is found in prostate cancer screening (see Chapter 15, p. 612), where a man with a PSA value greater than 4.0 ng/mL has only a 30% probability of having prostate cancer found on biopsy.5
  • The negative predictive value (NPV) is the probability that a person with a negative test does not have disease, represented as d/(c + d) in the test negative row in the 2 × 2 table.  Among men with a PSA level of 4.0 ng/mL or below, 85% are found to be cancer-free on biopsy.6

Prevalence of Disease

Although the predictive value statistics seem intuitively useful, they will vary substantially according to the prevalence of disease (i.e., the proportion of patients in the disease present column). The prevalence is based on the characteristics of the patient population and the clinical setting. For example, the prevalence of many diseases will usually be higher among older patients and among patients being seen in specialist clinics or at referral hospitals.

The box below shows a 2 × 2 table where both the sensitivity and specificity of the diagnostic test are 90% and the prevalence (proportion of subjects that have the disease) is 10%. The positive predictive value calculated from the test positive row of the table would be 90/180 = 50%. This means that half of the people with a positive test have disease.


Predictive Values: Prevalence of 10% with Sensitivity and Specificity = 90%

Disease Present Disease Absent Total
Test positive a
90
b
90
180
Test negative c
10
d
810
820
Total 100 900 1,000

Sensitivity = a/(a + c) = 90/100 or 90%; specificity = d/(b + d) = 810/900 = 90%

However, if the sensitivity and specificity remained the same, but prevalence was only 1%, then the cells would look very different.


Predictive Values: Prevalence of 1% with Sensitivity and Specificity = 90%

Disease Present Disease Absent Total
Test positive a
9
b
99
108
Test negative c
1
d
891
892
Total 10 990 1,000

Sensitivity = a/(a + c) = 9/10 or 90%; specificity = d/(b + d) = 891/990 = 90%

Now the positive predictive value calculated from the test positive row of the table would be 9/108 = 8.3%. The consequence is that the great majority of positive tests are false positives—meaning that most of the subjects who undergo gold standard tests (which are usually invasive, expensive, and potentially harmful) will not have disease.

This has implications for patient safety and resource allocation because clinicians want to limit the number of non-diseased patients who undergo gold standard tests. However, as shown by the example, predictive values will not necessarily provide us with sufficient guidance for using tests across populations with differing disease prevalence.

Likelihood Ratios

Fortunately, there are other ways to evaluate the performance of a diagnostic test that can account for the varying disease prevalence observed in different patient populations. One way uses likelihood ratio statistics, defined as the probability of obtaining a given test result in a diseased patient divided by the probability of obtaining a given test result in a non-diseased patient.[3],[7] The likelihood ratio tells us how much a test result changes the pre-test disease probability (prevalence) to the post-test disease probability.

In the simplest case, we will assume that the test result is either positive or negative. Therefore, the likelihood ratio for a positive test is the ratio of getting a positive test result in a diseased person divided by the probability of getting a positive test result in a non-diseased person.

From the 2 × 2 table, we see that this is the same as saying the ratio of the true positive rate (sensitivity) over the false positive rate (1 ? specificity). A higher value (much ;1) indicates that a positive test is much more likely to be coming from a diseased person than from a non-diseased person, increasing our confidence that a person with a positive result has disease.

The likelihood ratio for a negative test is the ratio of the probability of getting a negative test result in a diseased person divided by the probability of getting a negative test result in a non-diseased person.[7] From the 2 × 2 table, we see that this is the same as saying the ratio of the false negative rate (1 ? sensitivity) divided by the true negative rate (specificity). A lower value (much ;1) indicates that the negative test is much more likely to be coming from a non-diseased person than from a diseased person, increasing our confidence that a person with a negative result does not have disease.

The box below shows how to interpret likelihood ratios based on how much a test result changes the pre- to post-test probabilities for disease.[8]


Interpreting Likelihood Ratios

Likelihood Ratiosa Effect on Pre- to Post-Test Probability
LRs ; 10 or ; 0.1 Generate large changes
LRs 5–10 or 0.1–0.2 Generate moderate changes
LRs 2–5 and 0.5–0.2 Generate small (sometimes important) changes
LRs 1–2 and 0.5–1 Alter the probability to a small degree (rarely important)

;


We will show how likelihood ratios can be used to revise probabilities for disease with the example of breast cancer screening.


How Likely is It That a Woman with Abnormal Mammogram Has Breast Cancer?

A 57-year-old woman at average risk for breast cancer has an abnormal mammogram. She wants to know the probability that she has breast cancer. The literature states that the baseline risk (prevalence) is 1%, the sensitivity of mammography is 90%, and the specificity is 91%.


Bayes Theorem

One way to use likelihood ratios to revise probabilities for disease is with the Bayes theorem.[4] This theorem requires converting the estimated prevalence (pre-test probability) to odds using the equation:

The pre-test odds are multiplied by the likelihood ratio to estimate the post-test odds using the following equation:

The post-test odds are then converted to a probability using the equation:

For the example, the 1% prevalence represents the pre-test probability; this means that the pre-test odds are 0.01/0.99 or 0.01. The likelihood ratio for a positive test is sensitivity/(1 ? specificity), which is 90%/9% = 10. The pre-test odds are multiplied by this likelihood ratio (0.01 × 10) to give post-test odds of 0.10. The post-test odds are converted [0.1/(1 + 0.1)] to a post-test probability of about 9%.

Fagan Nomogram

If you are more comfortable thinking in terms of probability of having disease, then the Fagan nomogram may be an easier way for you to use likelihood ratios (Fig. 2-3).[9] With this nomogram, you read the pre-test probabilities from the line on the left, then take a straight edge and draw a line from the pre-test probability through the likelihood ratio in the middle line, and then read the post-test probability on the line on the right.

FIGURE 2-3 Fagan nomogram.

Fagan nomogram.

(Adapted with permission from Fagan TJ. Letter: nomogram for Bayes theorem. N Engl J Med. 1975;293:257.)

You can also use the Fagan nomogram to answer the mammography question (Fig 2-3). The pre-test probability (prevalence) = 1% and the likelihood for a positive test [sensitivity/(1 ? specificity)] = 10. The blue line corresponds to the case of a positive test with a post-test probability of about 9%. If the mammogram result was negative (red line), then the likelihood ratio for a negative test [(1 ? sensitivity)/specificity] would be 10%/91% = 0.11 and the post-test probability for breast cancer would be 0.1%.

Natural Frequencies

Using frequency statements is another, perhaps more intuitive, alternative to likelihood ratios for determining how a test result will change the probability of disease.[9],[10] Natural frequencies represent the joint frequency of two events, such as the number of patients with disease and the number who have a positive test result. Start by taking a large number of people (e.g., 100 or 1,000, depending upon the prevalence) and break the number down into natural frequencies (i.e., how many of the people have disease, how many with disease will test positive, how many without disease will test positive).


Natural Frequencies to Answer the Mammography Question

We can use natural frequencies to answer the mammography question by creating a 2 × 2 table based on a population of 1,000 women. The 1% prevalence means that 10 women will have breast cancer. The sensitivity of 90% means that 9 of the women with breast cancer will have an abnormal mammogram. The specificity of 91% means that 89 of the 990 women without breast cancer will still have an abnormal mammogram. The probability that a woman with an abnormal mammogram will have breast cancer is 9/(9 + 89) = about 9%.

Mammogram Result Breast Cancer No Breast Cancer Total
Positive 9 89 98
Negative 1 901 902
10 990 1,000

Data compiled from Gigerenzer G. What are natural frequencies? BMJ. 2011;343:d6386.


Reproducibility

Kappa Score

Another characteristic of a diagnostic test is reproducibility.[3] An important aspect of evaluating diagnostic elements of the history or physical examination is determining the reproducibility of the findings for diagnosing a clinical disorder. When, for example, two clinicians examine a patient, they may not always agree upon the presence of a given finding. This raises the question of whether this finding is useful for diagnosing a clinical disorder.

By chance, if many patients are being examined, there will be a certain amount of agreement between the two clinicians. Understanding whether there is agreement well beyond chance, though, is important in knowing whether the finding is useful enough to support clinical decision making. The kappa score measures the amount of agreement that occurs beyond chance (Fig. 2-4).[12] The box shows how to interpret Kappa values.

FIGURE 2-4 Kappa scores.

Kappa scores.

 (Adapted with permission from McGinn T, Wyer PC, Newman TB Tips for learners of evidence-based medicine: 3. Measures of observer variability [kappa statistic]. CMAJ. 2004;171:1369–1379.)


Interpreting Kappa Values

Value of Kappa Strength of Agreement
;0.20 Poor
0.21–0.40 Fair
0.41–0.60 Moderate
0.61–0.80 Good
0.81–1.00 Excellent

Understanding Measure of Agreement between Different Observers

The clinicians agree 75% of the time that a patient has an abnormal physical finding. The expected agreement based on chance is 50%. This means that the potential agreement beyond chance is 50% and the actual observer agreement beyond chance is 25%. The kappa level is then 25%/50% = 0.5, which indicates moderate agreement.

Precision

In the context of reproducibility, precision refers to being able to apply the same test to the same unchanged person and obtain the same results.[4] Precision is often used when referring to laboratory tests. For example, when measuring a troponin level for cardiac ischemia, clinicians might use a particular cutoff level to decide whether to admit a patient to a coronary care unit.

If the test results are imprecise, this could lead to admitting a patient without ischemic heart disease or sending a patient home with an ischemic event. A statistical test used to characterize precision is the coefficient of variation, defined as the standard deviation divided by the mean value. Lower values indicate greater precision.