Chapter 9: Testing a Test—M.A.A.R.I.E. Framework: Results, Interpretation, and Extrapolation

Intro

Contents

Results ([1],[2])

The results component of the M.A.A.R.I.E. framework for Testing a Test asks about the performance of the index test compared with the reference standard, that is, gold standard test, or definitive test. The results component presents quantitative estimates of the information provided by the index test compared with the perfectly performing reference standard. Confidence intervals can be used to draw inferences from these estimates. Finally, the results need to take into account the relative importance of false positives and false negatives when evaluating the diagnostic ability of a test.

Estimates: Sensitivity and Specificity

The results component of the M.A.A.R.I.E. framework asks us to compare the index test and the reference standard test, that is, the gold standard or perfect test. The aim is to produce summary measurements or estimates of the performance of the index test compared with the what we define as the perfect or gold standard test. The basic measurements that are used to perform this important job are called sensitivity and specificity. A single summary measurement can be produced by combining sensitivity and specificity to calculate what is called discriminant ability. These are the estimates used in reporting the results of an investigation of tests. Let us see first how we calculate sensitivity and specificity.

Sensitivity measures the proportion or percentage of the participants with the disease as defined by the reference standard test who are correctly identified by the index test. In other words, it measures how well the index test detects the disease. It may be helpful to think of sensitivity as positive in disease.

Specificity measures the proportion or percentage of the participants who are free of the disease, as defined by the reference standard test, that are correctly labeled free of the disease by the index test. In other words, it measures the ability of the index test to detect the absence of the disease. Specificity can be thought of as a negative in health.

To calculate sensitivity and specificity, the investigator must do the following:

  1. Classify each participant as being disease positive or disease negative according to the results of the reference standard test
  2. Classify each participant as positive or negative according to the index test
  3. Relate the results of the reference standard test to the index test, often using the following 2 × 2 table:
Reference Standard Positive = Disease Reference Standard Negative = Free of the Disease
Index test positive A = Number of participants with the disease and index test positive = true positives B = Number of participants without the disease and index test positive = false positives
Index test negative C = Number of participants with the disease and index test negative = false negative D = Number of participants without the disease and index test negative = true negative
A + C = Total with the disease B + D = Total free of the disease

To illustrate this procedure using numbers, imagine that a new test is performed on 500 participants who have the disease according to the reference standard test and 500 participants who are free of the disease according to the reference standard test. We can now set up the 2 × 2 table as follows9.1:

Reference Standard Positive = Disease Reference Standard Negative = Free of the Disease
Index test positive 400 = true positives 50 = false positives
Index test negative 100 = false negatives 450 = true negatives

A sensitivity of 80% and a specificity of 90% are in the range of a number of tests used clinically to screen for and diagnose disease such as mammography and exercise stress testing.9.2

Notice that the sensitivity and specificity are always defined in comparison to the reference standard test. That is, the best that an index test can achieve is to produce the same results as the reference standard test. When there is a disagreement between the index test and the reference standard test, the index test is considered wrong and the reference standard test is considered correct.

What happens if the new test is actually better than the reference standard test? If the new test is safer, cheaper, or more convenient than the reference standard test, it may come to be used in clinical practice even if its performance is considered less than perfect. Clinical experience may eventually demonstrate the new test’s superior performance, even allowing the new test to be used as the reference standard test. In the meantime, the best the new test can do is to match the established reference standard test.

9.1 Notice that the index test being evaluated has been applied to a group of participants in whom 500 have the disease and 500 are free of the disease as defined by the reference standard test. This division of 50% with the disease and 50% free of the disease has been a common distribution used for an investigation of a new test and provides the greatest statistical power. Notice, however, that does not represent the population’s prevalence of the disease except in the unusual circumstance in which the prevalence is 50%.

9.2 The principles stressed here are most important when the sensitivity and specificity are in this range. When tests have sensitivity and specificity close to 100%, issues such as Bayes’ theorem and the relative importance of false positive and false negative take on less importance. Nonetheless, even sensitivities and specificities of greater than 98% may produce large numbers of false-positive results when a disease has a very low prevalence such as 0.01% or 1% in a thousand. Issues such as safety, cost, and patient acceptance may be especially importance at very high sensitivities and specificities.

Estimates: Discriminant Ability

As we have seen, sensitivity and specificity are our basic measures of how well the index test discriminates between those with the disease and those who are free of the disease.

Sensitivity and specificity together provide us with the information we need to judge the performance of the index test relative to the reference standard test. Ideally, however, we would like to have one number that summarizes the performance of the test. Fortunately, there is a simple means to combine the sensitivity and the specificity to obtain a single measurement of what is called the discriminant ability of a test. Discriminant ability is the average of the sensitivity plus the specificity:

Thus, in our example, the sensitivity equals 80% and the specificity equals 90%, and the discriminant ability is calculated as follows:

How do we interpret discriminant ability? The discriminant ability tells us how much information the index test provides compared with the reference standard test, which by definition provides perfect information. That is, we assume that the reference standard test does a perfect job of separating positive and negative results. Perfect discriminant ability is therefore 100%. This occurs only when both the sensitivity and the specificity are 100%.

Discriminant ability provides a method to understand the information content of a test. To understand this use of discriminant ability, let us take a look at what we call a receiver operator characteristics (ROC) curve. The ROC curve axes are illustrated in Figure 9.1.

FIGURE 9.1. Receiver operator characteristics curve, x-axis and y-axis.

Receiver operator characteristics curve, x-axis and y-axis.

The ROC curve compares the sensitivity on the y-axis to 100% – specificity (i.e. the false-positive rate) on the x-axis. Notice that for the ROC curve, a perfect test lies at the left upper corner where the sensitivity and specificity are both 100%. Thus, the ROC curve allows us to compare the performance of a particular index test to this perfect test that lies in the left upper corner of the ROC curve.

The diagonal line that crosses from the lower left to the upper right of the ROC curve in Figure 9.1 indicates the zero information line. That is, the combination of sensitivity and specificity that provides no additional information beyond that already known before obtaining the results of the index test. If the discriminant ability is 50%, mere guessing or flipping a coin would do just as well as using the results of the index test.

Now, let us plot our sensitivity of 80% and our specificity of 90% for our index test on the ROC curve. Figure 9.2 plots this index test. It also has lines from this test to the left lower and right upper corners of the ROC curve. The area under these lines turns out to be the discriminant ability,9.3that is, the (sensitivity + specificity)/2.

FIGURE 9.2. Receiver operator characteristics curve for an index test with 80% sensitivity and 90% specificity.

Receiver operator characteristics curve for an index test with 80% sensitivity and 90% specificity.

Here, the discriminant ability is 85%. To understand the discriminant ability, it is important to recognize that the additional information provided by the index test is the difference between the discriminant ability and the diagonal no-information line. Failure to appreciate this principle can lead to the following type of error:


Mini-Study 9.1

A new test has been shown to have a sensitivity of 60% and a specificity of 40%. The authors of the investigation conclude that although these results are less than ideal, they still indicate the new test has a discriminate ability of 50% and can therefore provide 50% of the information necessary for diagnosis. They thus advise routine use of the test.


The authors are correct that the discriminant ability equals 50%, since (40% + 60%)/2 = 50%. However, a discriminant ability of 50% indicates that the test provides no additional information beyond what is already known. Thus, when drawing conclusions about the discriminant ability, the area under the ROC curve, we need to compare this summary measurement to 50%, not to 0%.

As we have seen, discriminant ability and the ROC curve tell us how well an index test performs. Discriminant ability can also be helpful in determining the best cutoff points to use to define positive and negative for the index test if our goal is to maximize the discriminate ability of the index test. This approach is discussed in Learn More 9.1.


Learn More 9.1: Using ROC Curves to Set Cut-Points for Positive and Negative Results

Remember that in Chapter 8 we stressed the need to define a positive and a negative result and indicated that an additional approach is available. One increasingly common approach is to wait to choose the cutoff points for positive and negative until after the measurements of both the index test and reference standard test are known.

Using this approach to select the best cutoff point to define negatives and positives for the index test, the investigator chooses the cutoff point at which the discriminant ability will be maximized.9.4 Thus, to determine the cutoff point, the investigators may take the following steps:

  1. Choose several potential cutoff points
  2. Calculate the sensitivity and specificity for each set of potential cutoff points
  3. Calculate the discriminant ability for each set of potential cutoff points
  4. Choose the set of cutoff points that produces the greatest discriminant ability

Thus, we have now seen that sensitivity, specificity, and their average (the discriminant ability) are the most common measures of a test’s performance. Once these summary measures or estimated are obtained, we need to examine the issue of inference or how the results may have been affected by chance.

9.3 To convince yourself of this relationship, draw lines connecting the “dot” to the left lower and right upper corners. Then using geometry, calculate the area under these lines. The sum of these areas equals the discriminant ability.

9.4 Determining the maximum discriminant ability is the same as finding the point on the ROC curve that maximizes the area under the curve. Thus, this method may also be referred to as maximizing the area under the ROC curve.

Inference

When drawing inferences from the results of an investigation of a test, we are interested in whether the results that we observe are likely to hold true in larger populations like those from which the sample was obtained. To address this question, the STARD statement recommends that investigations of tests report not only the sensitivity and specificity but also the confidence intervals around the sensitivity and specificity.

Thus, investigations of tests are increasingly reporting the observed sensitivity and specificity and also their 95% confidence intervals. These confidence intervals tell us how much confidence we should place on the results observed in our samples. They let us know that the true values in the population from which the samples were obtained may be higher or lower than the observed values.

It is important to recognize that one factor affecting the confidence interval is the number of participants included in the investigation. Everything else being equal, the larger the number of participants, the narrower is the confidence interval. Large investigations will tend to have narrow confidence limits and will encourage us to place more confidence in their results.

Ideally, confidence intervals for tests are converted into statistical significance levels. However, we do not expect to be able to conclude that one test’s sensitivity or specificity is statistically significant compared with another. Thus, for tests, we merely ask the question: What is the 95% confidence interval around the sensitivity and the specificity?

Adjustment: Diagnostic Ability

In investigations of tests, like other types of investigation, we need to ask whether there are other factors that need to be taken into account or adjusted for as part of the analysis of the results. When we discussed the measurement of discriminant ability, we assumed that a false negative and a false positive were equally undesirable. That is, we gave equal weight or importance to false negatives and false positives.9.5
False-negative results and false-positive results may not always be of equal importance. There are various reasons why a false negative and a false positive may not be of equal importance, for instance,

  •  A false negative may or may not result in harm to the patient, depending on whether the disease may be detected later at a time before there are irreversible adverse consequences.
  •  A false positive may or may not result in harm to the patient, depending on the probability of harm due to further testing and/or from treatment begun on the basis of a false-positive test.

To better understand what we mean by the relative importance of false-negative and false-positive results, we can examine testing for glaucoma and ask the question: What factors influence the importance of false-negative and false-positive results?

  • Factors that may influence the importance of false-negative results for glaucoma include: vision loss from glaucoma is largely irreversible and may develop before it is apparent to the patient. Treatment is generally safe but not completely effective in preventing progressive visual loss. Regular repeat routine testing may still detect the glaucoma in time for treatment to prevent substantial visual loss.
  •  Factors that influence the importance of false-positive tests include: follow-up of initial positive results requires multiple tests and follow-up visits that may create patient anxieties and costs. Follow-up tests pose little danger of harm to the patient.

Thus for glaucoma testing, let us assume that you came to the conclusion that a false-negative is worse than a false-positive result. Let us see how this conclusion can influence the use of tests, as illustrated in the next example:


Mini-Study 9.2

Test A for glaucoma had a sensitivity of 70% and a specificity of 90%, giving it a discriminant ability of 80%. Test B for glaucoma had a sensitivity of 80% and a specificity of 80%, giving it the same discriminant ability. The investigators concluded that these two tests were interchangeable in terms of diagnostic ability.


These two tests are interchangeable in terms of discriminant ability since each has an 80% discriminant ability. However, diagnostic ability requires us to also consider the relative importance of false negatives and false positives.

If we regard a false negative as worse than a false positive, we might prefer Test B since it has a higher sensitivity and thus fewer false negatives. This preference for Test B would result in more false positives. If, on the other hand, we regard false positives as worse than false negatives, we might favor Test A since it has a higher specificity and thus fewer false positives. Since we previously decided that false negatives are considered worse than false positives, we should prefer Test B to Test A.9.6

We have now examined the results component of the M.A.A.R.I.E. framework and have found that sensitivity, specificity, and their average (discriminant ability) are the measures used to judge the information obtained from an index test. We have seen that confidence intervals rather than statistical significance tests are used to report test results. We have seen that a false positive and a false negative may not be of equal importance. Now we are ready to go on to the interpretation of the results.

9.5 Discriminant ability assumes that false positives and false negatives are of equal importance. Thus, when maximizing discriminant ability to set the cutoff points, one is also assuming that false positives are equal to false negatives. Thus, cutoff lines should ideally also include consideration of the relative importance of false positives and false negatives.

9.6 No attempt is made here to quantitate the relative importance of false positives and false negatives. Although possible, this process is rarely seen in the research literature. The impact of different weights on false positives and false negatives usually has its impact on the cutoff point selected between positives and negatives. The trade-off between false negatives and false positives is also affected by the number of false negatives and the number of false positives that will occur. This in turn is affected by the pretest probability of the disease. When more than one index test is being compared with a reference standard, it is important to determine whether the index tests generally have the same or different types of false positives and false negatives. This will be important when we look at strategies for combining tests.