Once a systematic search has been run, the next step is to assess the results for potential inclusion. This process is called appraisal, and typically it has a number of stages. 

‘First appraisal’ or screening

This stage is based on abstracts. It aims to cut down on 'noise' by honing in on the relevant condition (e.g., chronic asthma, migraines, etc) and high-quality studies of the correct methodology (e.g., systematic reviews, RCTs, diagnostic studies, etc).

The titles/abstracts of retrieved papers need to be assessed using the criteria stated for that particular review/question (often this will be done by the person undertaking the search). If an abstract indicates that the study definitely does not match the criteria, you would exclude it. If the first appraiser is unable to definitively exclude a study using the information in the title/abstract, they would include the reference in the selected set of references which are earmarked for further consideration.

‘Second appraisal’ based on full papers

References for further consideration are passed on for additional full-text evaluation in order to decide which papers will be used and cited in the final content (often this will be done by the main author). When undertaking a systematic review/overview, the exclusion of any references at this stage must be justified and included in the final review.

‘Third appraisal’ (Quality Assurance check) based on full papers

Completed systematic research reports are usually subjected to a further review of the selected material, validating the quality and relevance of the included studies as appropriate. This may be done independently by a co-author, or an editor/final assessor before the report is finalized.

Parallel Appraisal

Usually, authors of systematic reviews will have at least two individuals who independently assess references at both abstract and full paper stages, discussing any differences in opinion and resolving them (using an additional assessor to act as final arbitrator if necessary) in order to come to a consensus on which studies should be included and excluded. 

Appraising the quality of study methods

It should be noted that no study is perfect. For practical purposes, it might be helpful to consider three possible scenarios with regard to study methods:

  • If the methods were sound: include
  • If the methods were suboptimal: include but cite reservations and appropriate caveats with regard to interpreting the result
  • If the methods were unsound, that is, there was a fatal flaw: exclude. These results should not be included in the analysis.

Studies are assessed whether they have minimum quality criteria (that is, in terms of the minimum acceptable size, follow-up, level of blinding [if blinding is possible], length of follow-up, etc). However, minimum quality criteria are just that: minimum criteria. For example, it may be that a trial describes itself as randomized but on further reading, it becomes apparent that treatments were allocated by the day of admission or by alternate allocation (i.e., quasirandomized) and may then be excluded on this basis.

Similarly, with regard to systematic reviews, quality may vary widely between reviews with regard to the methods employed and the extent to which data are reported. Indeed, on occasion, it may be difficult to decide whether a review is systematic or not if the search methods used are poorly reported. It is impossible to be comprehensive with regard to all the possible methodological issues that might arise or with regard to what their relative importance might be. For example, one element that is markedly weak may throw doubt on the entire conclusions of the study (a 'fatal flaw').

Quality issues that you may consider when assessing a systematic review might include:

  • Are the questions and methods of the review clearly stated?
  • Are the search methods described, and are they comprehensive and reproducible?
  • Are explicit methods used to determine which studies are included in the review?
  • Was the methodological quality of primary studies assessed?
  • Was the selection and assessment of primary studies appropriate, reproducible, and free from possible bias?
  • Are differences in individual study results adequately explained?
  • Are the results of primary studies combined appropriately?
  • Are the reviewers' conclusions supported by data cited?

Quality issues that you may consider when assessing an RCT might include:

  • Were the setting and study population clearly described?
  • Was assignment genuinely random and similarity between groups documented?
  • Was allocation to study groups adequately concealed from participants and investigators?
  • What was the level of blinding?
  • Were all clinically relevant outcomes reported?
  • Were over 80% of people who entered the study accounted for at its conclusion?
  • Did the RCT analyse in groups to which people were randomized to (intention-to-treat analysis)?
  • Were both the statistical significance and the clinical importance of the statistical result considered?

There are numerous checklists available to act as aide memoires for evaluating studies and they can be very useful. However, it is important to remember that critical appraisal is much more than a 'tick box' exercise. It is needed to evaluate what weight can be placed on the findings, and how far it is possible to generalize the results from trials into routine practice and to inform clinical care.

Since these pragmatic issues always need to be borne in mind when critically appraising study data, here we present some examples of checklists for different study types and some information for tricky critical appraisal scenarios.

Appraising two-armed RCTs

Appraising multiple-armed RCTs

Appraising diagnostic test studies

Appraising systematic reviews

Assessing multiple systematic reviews on the same question

The SR toolbox also provides links to an extensive catalog of checklists for quality assessment, some examples include:

Considering evidence on harm

Of all study types, well-conducted RCTs or systematic reviews of RCTs provide the best evidence of causality, that is, that one treatment causes an effect compared with another treatment. Usually, you would also report any data on adverse effects reported by included RCTs or systematic reviews of RCTs. However, RCTs are often underpowered to detect adverse effects, some of which may be serious but rare. Because of this, on occasion you may also need to include, and possibly do a separate search for, nonRCT data that gives information on adverse effects to enhance the practical and clinical relevance of your findings.

It should be noted that observational data may be more subject to confounding or bias. Bias due to noncomparability of groups is more likely in cohort studies, and more likely still in case-control studies. Case series or case reports are the weakest forms of evidence, although associations with harms in case reports have often been subsequently confirmed, and have sometimes provided the first indication that a given treatment is associated with a particular adverse effect.

Appraising the body of evidence for a clinical question

GRADE (Grading of Recommendations, Assessment, Development, and Evaluations) is a transparent framework for developing and presenting summaries of evidence (ideally from a systematic review) and provides a systematic approach for making evidence-based clinical practice recommendations. Authors must, therefore, first make a judgment about the risk of bias in the individual studies (see above: Appraising the quality of study methods). If this is sufficiently large then their confidence in the estimated treatment effect will be lower. Unlike risk of bias, GRADE is used to rate the body of evidence at the outcome level rather than the study level.

Read more about GRADE

Read more