Testing a Test is about how we use evidence to make decisions. We will begin by using the M.A.A.R.I.E. framework to better understand research articles on tests. That is, we will look at the methods, assignment, assessment, results, interpretation, and extrapolation issues, and see how they apply to evaluation of a test. We will then apply and extend what we have learned to two increasingly common applications screening for asymptomatic diseases and the development of prediction and decision rules.
Next, we will take a look at how we can combine information from various sources to compare two or more interventions taking into account the harms and the benefits of each using the technique known as decision analysis. We will also look at how issues of cost can be incorporated extending the technique of decision analysis as cost-effectiveness analysis. Using what we have learned, we will step back and see how evidence can be used as the basis for evidence-based recommendations. Finally, we will explore the emerging field of translational research and see how evidence can be translated into practice. Let us begin by taking a look at how we can apply the M.A.A.R.I.E. framework to diagnostic tests.
Testing A Test
Using the information obtained from tests to make decisions has become an integral part of health care. Thus, it is not surprising that studies designed to measure the information provided by tests are an increasingly important form of investigation. We will examine these types of investigations by using the M.A.A.R.I.E. approach to look at method, assignment, assessment, results, interpretation, and extrapolation.
The M.A.A.R.I.E. framework is designed for use in critically reviewing research articles, including articles on diagnostic testing. Most students and practitioners encounter tests designed for diagnostic purposes in the clinical setting after the research has been completed. Learn More 8.1looks at a framework for Testing a Test that can be used to put together the information obtained from the research to use for clinical purposes. It should alert you to the types of information that need to be obtained from research studies of diagnostic tests.
Until recently, research on diagnostic and screening tests has not been published in a consistent format, often leaving the reader with many unanswered and unanswerable questions. Recently, a set of standard and comprehensive methods for reporting investigations of diagnostic tests known as STAndards for Reporting Diagnostic Accuracy (STARD) have been adopted by many journals . These criteria have been incorporated into the components of the M.A.A.R.I.E. framework for Testing a Test.
The application of the M.A.A.R.I.E. framework to research on tests is illustrated in Figure 8.1.
FIGURE 8.1. M.A.A.R.I.E. framework for investigations of tests.
Learn More 8.1: Testing for Diagnosis in Clinical Practice
Clinical diagnosis rests on the principle that individuals with a disease are different from individuals without the disease and that diagnostic tests can distinguish between the groups. Diagnostic testing, to be perfect, requires that 1 all test results are one of two values that we will call X or Y. 2 all individuals without the disease have one value on the test. We will call this X, and 3 all individual with the disease have a different value for the test, we will call this Y. Figure 8.2 illustrates this situation.
FIGURE 8.2.Conditions required for a perfect diagnostic test in which the test results are either X or Y.
If this reflected the realities of clinical practice, the job of clinicians would be very easy. All clinicians would need to do is to select the right test and get the answer whether an individual has or does not have the disease of interest.
The realities of clinical practice are far more complex. In fact none of the three requirements for perfection hold up on practice. Figure 8.3 more closely reflects these realities. There is variation in all three of the requirements for perfection: the test itself, those without the disease, and those with the disease. Understanding these variations is central to the understanding the use of diagnostic tests in clinical practice.
FIGURE 8.3.The three types of variation that affect the clinical use of diagnostic tests.
As suggested in Figure 8.3, the degree of variation in the results of the test itself should be small compared with the variation among those with and those without the disease. As we will discuss, we often rely on the laboratory or those conducting the test to ensure that the variation in the test itself is quite small, that is, the test results are reproducible. Nonetheless, when doubt exists repeating a test may be an important clinical step.
The usefulness of a test for diagnosis greatly depends on the degree of overlap between those with the disease and those without the disease. Figure 8.4A–C represents three possible degrees of overlap.
A.The ideal situation with no overlap between those with and without the disease.B. The usual situation with a modest amount of overlap between those with and without the disease. C.The situation with so much overlap between those with and without the disease that the test results do not provide useful information.
Figure 8.4A reflects the ideal situation, despite the variation among those with and without the disease, there is no overlap between those with and without the disease. Figure 8.4B reflects the typical situation in which there is a small but important degree of overlap. This is the most common situation and the one that we will focus on as we look at investigations of diagnostic tests. Finally, at times, Figure 8.4C may reflect the realities of the situation, where there is so much overlap that the results of the test do not add any useful clinical information. The situation displayed in Figure 8.4C may be illustrated in the following example:
One hundred individuals with long-standing cirrhosis of the liver underwent aspartate aminotransferase (AST) liver enzyme tests to assess their liver function. Most of these patients had AST levels within the range of normal or reference range. The authors concluded that these patients had well-functioning livers.
In this situation, those with cirrhosis have a similar range of values to those without the disease much like Figure 8.4C. The patients with cirrhosis may not have enough viable hepatocytes to generate high levels of AST. Thus, despite the fact that AST may serve as an excellent measure of liver function in many situations, when used to assess liver function among those with cirrhosis, the level of AST does not add useful information. This example emphasizes the key clinical point that a test that performs well for one purpose may not perform as well for a different purpose.
Thus, when confronted clinically with a new test, it is important to appreciate the potential variability of the test, the variability of those with the disease, and the variability of those without the disease. It is especially important to recognize when the degree of overlap between those with the disease and those without the disease is so great that no useful information is provided by the results of the test.
Now let us examine each of the components of the M.A.A.R.I.E. framework beginning with the Method component.
Purpose of Testing [(2)]
Testing can be seen as the collection of information to assist in decision making. When looked at this way, much of what is done in health care can be regarded as testing, from obtaining information from the history and physical examination to making decisions based on the prognosis of a disease. To understand the use of a test, we need to appreciate the multiple purposes for which a test may be used.
The same test may at times be used for more than one purpose, but its performance often depends on the specific purpose for which it is being used.
- Testing may be used for at least the following purposes:
- Testing for risk factors: testing for factor(s) or other disease(s) that increase the risk of the disease of interest.
- Screening test: testing patients without symptoms for a particular disease
- Diagnostic testing: testing patients with symptoms for a particular disease
- Testing for causation: testing to establish the relationship between symptoms and disease
- Testing for prognosis: testing to predict the outcome of disease
- Testing for response: testing response to treatment including testing for adverse events associated with treatment
In order to identify the test that is being evaluated, the term index test is used. Thus, the first question to ask when reading an investigation on tests is: What is the purpose for investigating the index test? We will focus our attention in this chapter on testing for the purpose of diagnosis. In chapter 9we will address the use of testing for screening and for prognosis.8.1
The fundamental purpose of diagnostic testing is to increase or decrease the probability that a disease or condition is present. To achieve this purpose it is essential to first make an estimate or guestimate of the probability of the disease or condition before obtaining the results of the test. This probability is called the pretest probability or the prior probability. The pretest probability or prior probability incorporates information from the following inputs:
- The prevalence of the disease or the probability of the disease in population or groups of individuals similar to a particular patient
- Predisposing diseases and risk factors of the individual.
- The pattern of symptoms presented by the patient.
- The results of previous testing
Learn More 8.2 illustrates the uses of each of these inputs into the pretest or prior probability of the disease.
Learn More 8.2: Pretest Probability: Where Does it Come From?
To understand the uses of these four types of inputs imagine that we are interested in estimating the pretest probability of coronary artery disease for the following individuals:
A 23-year-old woman
A 65-year-old male
The first type of input is derived from the frequency of disease in populations or groups of individuals similar to a particular patient. This is called the prevalence of the disease. Prevalence indicates how common or probable the disease is in a particular population. The estimate of prevalence begins at the population level with the recognition that coronary artery disease is a very common disease in most developed countries. In addition we know that it increases with age and that the prevalence rates in females are lower than in males though the prevalence in females rapidly increases after menopause.
These two patient profiles represent very different probabilities of coronary artery disease. The 23-year-old woman has a very low pretest probability of clinically important coronary artery disease, well under 1%. The 65-year-old male on the other hand, has a considerably higher pretest probability of clinically important coronary artery disease, most likely more than 20% by virtue of his gender and age, regardless of any other risk factors or symptoms.
The second input into the pretest probability is predisposing diseases and risk factors of the individual. Imagine that the 65 year-old male has Type 2 diabetes. Diabetes substantially increases his probability of coronary artery disease. To consider the impact of risk factors, let us imagine the following pattern of risk factors in our 23-year-old woman and 65-year-old man.
The 23-year-old woman with a strong family history of early coronary artery disease, exercises regularly, does not smoke cigarettes, and has a blood pressure of 110/70 and an LDL level of 90 mg/dl.
The 65-year-old male diabetic has no known family history of early coronary artery disease but does not exercise regularly, is 30% over his ideal body weight, has smoked 1 pack of cigarettes per day for 45 years, and has a blood pressure of 150/95 and a LDL level of 160mg/dl.
Now we know much more about the pretest probability of disease. This information from risk factors may modestly increase the probability that the 23-year-old woman has clinically important coronary artery disease, while the presence of multiple risk factors raises the pretest probability for the 65-year-old man, most likely to the range of 50% or more.
The third input into the pretest probability is the pattern of symptoms presented by the patient. Imagine the following in our patients:
The 23-year-old woman experiences chest pains radiating to her left arm when she exercises strenuously.
The 65-year-old man with diabetes has not experienced chest pains or pressure, including when walking rapidly, which is his most strenuous form of exercise.
This information substantially raises the probability that the 23-year-old woman has clinically important coronary artery disease. For the 65 year-old man, asymptomatic coronary artery disease is still quite likely. Despite the presence of symptoms in the 23-year-old woman, she still is far less likely to have clinically important coronary artery disease than the 65-year-old man.
Notice that the pretest probability often utilizes the results of previous testing, the fourth input. Here, the blood pressure obtained on physical examination and the LDL level are used to develop a pretest probability of disease. In addition, the results of a test such as an exercise stress test may provide further information that may be used to help establish a pretest probability for additional testing.
Thus there are multiple inputs that contribute to our estimates of the pretest probability of disease. Each of these inputs requires data as well as subjective judgment. In addition, clinicians may not agree on how much weight or importance to place on each of these inputs. For instance the pattern of symptoms often receives considerable weighting when clinically estimating the pretest probability of a disease. On the other hand the prevalence of the disease may not receive the importance or weight that it deserves. Thus accurately estimating a pretest probability of a disease is an important but difficult to acquire skill.
8.1 There are other possible uses of testing, including environmental testing to determine possible exposure to a risk factor such as lead and testing to provide a baseline for subsequent diagnostic testing. Baseline testing can be considered a method for substituting individual data for population data when defining a positive and a negative test, as we will discuss later in this chapter.
The study population for an investigation of tests is defined by its inclusion and exclusion criteria. Inclusion and exclusion criteria should be defined to help ensure that those included in the investigation reflect those with and without the disease when the test is used in practice on its intended or target population.
Let us see what can happen when the participants used for investigation of a test are quite different from the people for which the test is intended:
A test is intended to be used to make an early diagnosis of myocardial infarction (MI). It was evaluated on patients who presented with chest pain in cardiologists’ offices. The patients were included even if they had a previous MI. The results of the test indicated excellent diagnostic performance in early diagnosis of MI. When the same test was used in emergency departments on all patients with chest pain without a clear-cut explanation, the test did not perform nearly as well.
The patients being followed by cardiologists are likely to have had a previous MI or known coronary artery disease. They may also have complications of coronary artery disease such as heart failure. These patients are likely to have different symptoms and a different pattern of test results when they experience another MI compared with a population of emergency department patients with no history of MI.
Therefore, it is important that the investigator defines the intended population when studying the performance of a test. If the intent is to use the test on a general population with chest pain but no previous history of coronary artery disease, it is important that the investigation of the test be conducted in emergency departments, primary care clinic, or a similar setting.
To describe the participants, the STARD criteria expect that investigators will indicate their inclusion and exclusion criteria. In addition, as we will see when we consider assignment, considerable detail is expected on the process of patient recruitment that along with the inclusion and exclusion criteria ultimately determine whether the participants are representative of the target population for whom the index test is intended.