To our delight, the Center has been permitted to reprint a fascinating commentary written by one of our esteemed board members, Dr. Anthony Cox, published in a recent issue of Global Epidemiology. 

In the commentary, Dr. Cox presents a dialogue with ChatGPT, a large language model, in which he uses intentional, patient questioning to show how its responses make a 180-degree turn on the question of a causal link between exposure to ambient fine particulate matter (PM2.5) and increased mortality.

A large language model is a type of artificial intelligence that mimics human conversation using statistical models to predict the next words and phrases in a series. In the simplest possible terms, it is the ultimate fill-in-the-blank tool, though not always accurate. 

ChatGPT does not actually “know” anything, or have opinions. It has been fed a colossal dataset of text and uses that as a starting point to learn language patterns and structure. It then applies those rules to generate its own text, given a prompt. ChatGPT can also learn from the questions and prompts that you give it and will remember those lessons for the future. 

While your mileage may vary with using ChatGPT in your field, its performance always relies on high-quality prompts. As you will observe in the dialogue with Dr. Cox, ChatGPT developed a far more refined and accurate response as the exchange went on. Ultimately, ChatGPT completely reversed its original position. 

As Dr. Cox notes, “The change suggests that either the initial statement lacked needed caveats included in the final statement or that ChatGPT is too easily swayed by persistent Socratic questioning.”

ChatGPT is the latest in a long line of large language models, and these types of AI technologies are becoming increasingly pervasive in scientific research. We should learn from examples such as the dialogue below in order to properly respond to the limitations of this rapidly developing technology.

Below are the abstract and introduction sections of Dr. Cox’s commentary. The full commentary can be accessed in Global Epidemiology. 

Abstract

We present a Socratic dialogue with ChatGPT, a large language model (LLM), on the causal interpretation of epidemiological associations between fine particulate matter (PM2.5) and human mortality risks. ChatGPT, reflecting probable patterns of human reasoning and argumentation in the sources on which it has been trained, initially holds that “It is well-established that exposure to ambient levels of PM2.5 does increase mortality risk” and adds the unsolicited remark that “Reducing exposure to PM2.5 is an important public health priority.” After patient questioning, however, it concludes that “It is not known with certainty that current ambient levels of PM2.5 increase mortality risk. While there is strong evidence of an association between PM2.5 and mortality risk, the causal nature of this association remains uncertain due to the possibility of omitted confounders.” This revised evaluation of the evidence suggests the potential value of sustained questioning in refining and improving both the types of human reasoning and argumentation imitated by current LLMs and the reliability of the initial conclusions expressed by current LLMs.

Introduction

Large Language Models (LLMs) such as GPT-3 and its successors predict and generate probable patterns in text and conversations based on conditional probabilities estimated from massive amounts of training data, much of it scraped from the web [2]. The recent advent of ChatGPT and other LLMs provides new opportunities to study highly probable patterns of argumentation reflected in the training data. Doing so may give insights into the patterns of rhetoric and reasoning that currently help to shape public discourses and understanding in a number of areas. A large literature, including both peer-reviewed scientific articles and authoritative documents by regulatory agencies and organizations such as the World Health Organization, has provided numerous examples of human reasoning and argumentation about causality in interpreting epidemiological associations. Current LLMs can generate text and responses that they calculate to have high conditional probabilities (given user inputs and the conversation so far) based on such training data.

The purpose of this comment is to provide an example of a Socratic dialogue with ChatGPT about the causal interpretation of an important epidemiological association between exposure to fine particulate matter air pollution (PM2.5) and mortality risk. The author’s part of the dialogue is in italics. ChatGPT’s responses are in plain text, but with bold font for the parts that the author considers especially significant. This dialogue focuses on potential confounding, leaving other methodological issues in interpreting associations causally for other dialogues. However, it illustrates the potential for patient questioning to change even very strongly worded initial convictions expressed by ChatGPT to more moderate ones. Whether similar questioning should be more widely used in human regulatory risk assessment deliberations may be worth considering.

And so begins the dialogue between Dr. Cox and ChatGPT on causal interpretation of PM2.5-mortality associations: 

Dr. Cox: Is it known with certainty that current ambient levels of PM2.5 increase mortality risk?

ChatGPT: Yes, there is a significant body of scientific evidence that demonstrates a clear association between exposure to PM2.5 [fine particulate matter] and increased mortality risk.

PM2.5 refers to fine particulate matter with a diameter of less than 2.5 μm, which is small enough to penetrate deep into the lungs and even enter the bloodstream, causing various adverse health effects.

Numerous studies conducted around the world have consistently found that exposure to elevated levels of PM2.5 is associated with increased mortality risk, particularly from cardiovascular and respiratory diseases.

That is where ChatGPT started. 

See where it ended by clicking here.