Determining Strength of Evidence: Interpreting Results of a Systematic Review

by Peggy Murray, Ph.D.
published February 14, 2022

In the first part of this series, I presented the elements of a strong systematic review. This post will explain two of the premier strategies used to evaluate strength of evidence before making medical treatment guidelines, clinical decisions, and health and environmental policies.

Once a systematic review is completed, the next task is to synthesize the evidence into a central conclusion. There are many ways to characterize this conclusion, so it has become important to develop strong guidelines for authors.

To do this, scientific organizations have developed evidence hierarchies, often by bringing together internal and external experts to achieve consensus on the best approach. These guidelines can—and should—include methods to change recommendations when new evidence is discovered.

Two of the more recent evidence hierarchies most often used in medical and toxicological communities are:

The levels of evidence for systematic reviews developed by the Institute of Medicine (IOM) of the National Academies of Sciences, Engineering and Medicine—also widely used within the scientific community, notably in toxicology reviews.
The Grading of Recommendations Assessment, Development, and Evaluations (GRADE) guidelines—used by the Cochrane Review, World Health Organization, U.S. Agency for Health Care Research and Quality, U.S. Centers for Disease Control and Prevention (CDC), and required by many peer-reviewed journals.

Institute of Medicine Levels of Evidence

This hierarchy of evidence was developed by an IOM committee formed at the request of Congress to determine associations among health issues and exposure to Agent Orange and other herbicides experienced by soldiers who fought in the Vietnam War. (1)

The committee did not make any policy decisions or recommendations, but came up with the levels in order to guide those policymakers who would. Its task was to review and summarize the strength of the evidence, and the guidelines listed the following four categories:

Sufficient evidence of an association

To meet this designation, “…a positive association between exposure… and the outcome must be observed in studies in which chance, bias, and confounding can be ruled out with reasonable confidence. For example, the committee might regard evidence from several small studies that are free of bias and confounding and that show an association that is consistent in magnitude and direction to be sufficient evidence of an association. Experimental data supporting biologic plausibility strengthen the evidence of an association but are not a prerequisite and are not enough to establish an association without corresponding epidemiologic findings.”

2. Limited or suggestive evidence of an association

At this level, “…the evidence must suggest an association between exposure and the outcome in studies of humans, but the evidence can be limited by an inability to confidently rule out chance, bias, or confounding. Typically, at least one high-quality study indicates a positive association, but the results of other studies could be inconsistent.”

3. Inadequate or insufficient evidence to determine an association

At this level, “…the available human studies may have inconsistent findings or be of insufficient quality, validity, consistency, or statistical power to support a conclusion regarding the presence of an association. Such studies might have failed to control for confounding factors or might have had inadequate assessment of exposure.” As more evidence becomes available, the issue can be moved to one of the other categories as appropriate.

4. Limited or suggestive evidence of no association

At this level, “…several adequate studies covering the ‘full range of human exposure’ were consistent in showing no association with exposure to herbicides at any concentration and had relatively narrow confidence intervals. A conclusion of “no association” is inevitably limited to the conditions, exposures, and observation periods covered by the available studies, and the possibility of a small increase in risk related to the magnitude of exposure studied can never be excluded. However, a change in classification from inadequate or insufficient evidence of an association to limited or suggestive evidence of no association would require new studies that correct for the methodologic problems of previous studies and that have samples large enough to limit the possible study results attributable to chance.”

The GRADE Guidelines

The GRADE guidelines were developed as a way for clinicians to make sense of the enormous body of published research on prevention, diagnosis, and treatment—which often had many inconsistencies in quality of evidence and strength of recommendations. (2)

What they came up with is a two-part system that grades the quality of evidence and strength of recommendations. The tables below summarize the quality of evidence ratings, and factors that can reduce or increase those ratings:

Source 1. Schünemann, H., Brozek, J., & Oxman, A. (2013). GRADE handbook for grading quality of evidence and strength of recommendations.

In addition to quality of evidence, the GRADE system offers two grading levels when providing strength of recommendation—strong or weak—based on four factors:

Quality of evidence (e.g., several well-conducted randomized controlled clinical trials)
Uncertainty about the balance between desirable and undesirable effects (e.g., a small treatment effect but a serious side effect such as increased bleeding)
Variability in patient values and preferences (e.g., younger patients may choose longevity over toxicity of a chemotherapy, whereas older patients would not)
Uncertainty about whether the intervention is a wise use of resources (e.g., extremely expensive medication that prolongs life a few months)

While the IOM and GRADE systems rely on scientific judgement, they are conducted in a highly transparent manner, reducing as much bias as possible. With these frameworks, our clinicians, regulators, and policymakers can be confident that decisions are made using the best available science.

References:

Veterans and Agent Orange Update 11. National Academies of Sciences Press, 1994.
Guyatt, et al, (2008). GRADE: An emerging consensus on rating quality of evidence and strength of recommendations. BMJ, April 26, 2008, vol. 336 pp. 924-926.

Related articles

Data Triumphs Over Assumptions: Promoting A New Era of Objective Causality in Health Risk Analysis

Politics Has Poisoned Science. Philanthropy Can Help Provide the Cure.

Does Preterm Baby Formula made from Cows Milk Cause Life-Threatening Illness?