Diagnostic test studies: assessment and critical appraisal
There are many checklists available for the assessment and critical appraisal of diagnostic test studies, as reporting is frequently inadequate.[1][2][3] However, they all include some variation of three critical questions;[2][3] these are:
- Is this study valid?
- Does the diagnostic test under assessment accurately distinguish between people who do and do not have the specific disorder?
- Can this valid, accurate diagnostic test be applied to a specific patient?
Assessment
How to assess if a diagnostic test study is valid
1. Was there an independent, blind comparison with a reference (gold) standard of diagnosis?
- Participants in the study should have undergone both the index diagnostic test and the reference (gold) standard. This is done to confirm or refute the findings of the index test. The accuracy of the test can be overestimated if the index test is performed initially in people known to have the disease and then separately in healthy people (case-control studies do this) rather than performing both the index and reference tests in the same group of people without knowing whether or not they have the disease.[4]
- People assessing the results of the index test should be blind to the results of the reference standard. This avoids biasing the results of the index test or the reference standard. Interpreting the results of the reference test while already knowing the results of the index test can lead to an overestimation of the index test accuracy, especially if the reference test is open to subjective interpretation.[4] Blinding is less important if the results of the test are objective (e.g., serodiagnostic tests for tuberculosis where sputum culture results are analyzed) than if results require clinical interpretation (e.g., MRI images for diagnosing rotator cuff injury).
2. Was the diagnostic test evaluated in an appropriate spectrum of patients (like those a clinician would see in practice)?
- Check that the study includes people with all the common presentations of the target disorder, with symptoms of early manifestations as well as more severe symptoms, and/or people with other disorders that are commonly confused with the target disorder when diagnosing. If not, the results of the trial may not reflect actual clinical practice.
3. Was the reference standard applied regardless of the index diagnostic test result?
- If the patient has a negative index test result, the investigators sometimes do not carry out the reference standard test to confirm the negative result, especially if the test is invasive or risky, as this may be unethical. To overcome this, investigators employ an alternative reference standard for proving that the patient does not have the target disorder, which is long-term follow-up to assess that there are no adverse effects associated with the target disorder present without any treatment.
4. Was the test validated in a second independent group of patients?
- When a new diagnostic test is evaluated, there is a risk that the results in the initial assessment are caused by other factors: for example, something about that specific group of patients included in the study (e.g., they represent only patients with advanced symptoms of the disease). So, to prove the results are reliable and replicable, the new diagnostic test should be evaluated in a second independent (or test) group of patients.
In conclusion: If the study being evaluated fails any of these four criteria, we need to consider whether the flaws of the study make the results invalid.
How to assess the results of the test
There are two types of result commonly reported in diagnostic test studies. One concerns the accuracy of the test and is reflected in the sensitivity and specificity, often defined as the test's ability to find true positives for the disorder (sensitivity) or true negatives for the disorder (specificity). An ideal diagnostic test finds no false positives but at the same time misses no one with the disease (finds no false negatives).
The other concerns how the test performs in the population being tested and is reflected in predictive values (also called post-test probabilities) and likelihood ratios. To give brief definitions of these terms consider this example (based on reference[5]):
1000 elderly people with suspected dementia undergo an index test and a reference standard. The prevalence of dementia in this group is 25%. 240 people tested positive on both the index test and the reference standard and 600 people tested negative on both tests. The remaining 160 people had inaccurate test results.
The first step is to draw a 2x2 table as shown below. We are told that the prevalence of dementia is 25%; therefore, we can fill in the last row of totals - 25% of 1000 people is 250 - so 250 people will have dementia and 750 will be free of dementia. We also know the number of people testing positive and negative on both tests and so we can fill in two more cells of the table.
By subtraction the table can easily be completed:
From the 2x2 table the following measures can be calculated:
Term | Definition | Example |
Pre-test probability = (true positive + false positive)/total number of people | This measure tells us the probability of having a target condition before a diagnostic test | In this example: 390/1000 = 0.39
What does this mean: The probability of a patient in this study having dementia before the tests are run |
Sensitivity (Sn) = the proportion of people with the condition who have a positive test result | The sensitivity tells us how well the test identifies people with the condition. A highly sensitive test will not miss many people | In our example, the Sn = 240/250 = 0.96
What does that mean? 10 (4%) people with dementia were falsely identified as not having it, as opposed to the 240 (96%) people who were correctly identified as having dementia. This means the test is fairly good at identifying people with the condition |
Specificity (Sp) = the proportion of people without the condition who have a negative test result | The specificity tells us how well the test identifies people without the condition. A highly specific test will not falsely identify many people as having the condition | In our example, the Sp = 600/750 = 0.80
What does that mean? 150 (20%) people without dementia were falsely identified as having it. This means the test is only moderately good at identifying people without the condition |
Positive predictive value (PPV) = the proportion of people with a positive test who have the condition | This measure tells us how well the test performs in this population. It is dependent on the accuracy of the test (primarily specificity) and the prevalence of the condition | In our example, the PPV = 240/390 = 0.62
What does that mean? Of the 390 people who had a positive test result, 62% will actually have dementia |
Negative predictive value (NPV) = the proportion of people with a negative test who do not have the condition | This measure tells us how well the test performs in this population. It is dependent on the accuracy of the test and the prevalence of the condition | In our example, the NPV = 600/610 = 0.98
What does that mean? Of the 610 people with a negative test, 98% will not have dementia |
Positive likelihood ratio (LR+) = sensitivity / (1- specificity) | This measure tells us how much the odds of a specific diagnosis increase when a test is positive. The larger the LR+, the more likely it is that the person with a positive test result has the condition. An LR+ of 10 indicates a 10-fold increase in the odds of the patient having the condition (i.e., a large increase in probability), whereas an LR+ of 2 would indicate a modest increase in the odds of the patient having the condition. An LR+ of 1 would mean that the test provides no new information regarding the odds of the patient having the condition. | In this example the LR+ = 96/20 = 4.8
What does that mean? There is a 4.8 fold increase in the odds of having dementia in a person with a positive test (i.e., a moderate increase in the probability that they have dementia) |
Negative likelihood ratio (LR–) = (1-sensitivity) / specificity | This measure tells us how much the odds of a specific diagnosis decrease when a test is negative. The smaller the LR-, the more likely it is that the person with a negative test result does not have the condition. An LR- of 0.5 indicates a 2 fold decrease in the odds of the patient having the condition (i.e., a modest decrease in probability), whereas an LR- of 0.1 indicates a 10-fold decrease in the odds of having the condition (i.e., a large decrease in probability). | In this example LR– =4/80 = 0.05
What does that mean? There is a 20-fold decrease in the odds of having dementia in a person with a negative test result (i.e., a large decrease in the probability that they have dementia) |
How to apply the diagnostic test to a specific patient
Having found a valid diagnostic test study, and decided that its accuracy is sufficiently high to make it a useful tool, here are some useful points to consider when applying the test to a specific patient:
- Is the test available, affordable, and accurate in your setting?
- Can a clinically sensible estimate of the pretest probabilities of the patient be made from personal experience, prevalence statistics, practice databases, or primary studies?
- Are the study patients similar to the patient in question?
- How current is the study we are analyzing - has evidence moved on since the publication of the study?
Will the post-test probability affect the management of the specific patient?
- Could the result move the clinician across a test-treatment threshold: for example, could the results of the test stop all further testing? That is, rule the target disorder out so the clinician would stop pursuing that possibility, or make a firm diagnosis of the target disorder and move onto choosing appropriate treatment options.
- Will the patient be willing to have the test carried out?
- Will the results of the test help the patient reach their goals?
Critical appraisal
Based on the information given in the Assessment section above, the table below gives some basic check points to look for when critically appraising a diagnostic test study. This list is by no means comprehensive, but should cover all the main issues. The main focus of the list is the first two questions based on validity and the importance of the results.
There are numerous checklists available. The SR toolbox is an online catalog providing summaries and links to the available guidance and software for each stage of the systematic review process including critical appraisal. Examples for diagnostic test studies include:
- QUADAS-2 for diagnostic accuracy studies
- SIGN methodology checklist for diagnostic studies
- Critical Appraisal Skills Programme (CASP) diagnostic study checklist.
The following checklist provides a framework for assessing the quality of a diagnostic test study:
Content created by BMJ Knowledge Centre
References
- Bossuyt PM, Reitsma JB, Bruns DE, et al. Towards a complete and accurate reporting of studies of diagnostic accuracy: the STARD initiative. Clin Chem 2003;49:1–6. https://www.ncbi.nlm.nih.gov/pubmed/12507953
- CASP UK. Critical Appraisal Skills Programme (CASP) https://www.casp-uk.net
- QUADAS-2 for diagnostic accuracy studies. http://www.bristol.ac.uk/population-health-sciences/projects/quadas/quadas-2/
- Lijmer JG, Mol BW, Heisterkamp S, et al. Empirical evidence of design-related bias in studies of diagnostic tests. JAMA 1999;282:1061–1066. https://www.ncbi.nlm.nih.gov/pubmed/10493205
- Centre for Evidence Based Medicine. https://www.cebm.net/likelihood-ratios/