When should we trust AI more than physicians?

The time may be fast approaching. A paper by Goh et al. 2024 sampled 50 physicians and examined which was better: physicians alone, physicians with access to GPT-4, or GPT-4 alone. The primary outcome was how well each group diagnosed the case (i.e., diagnostic reasoning score). The authors found that:

The median diagnostic reasoning score per case was 76.3 percent (IQR 65.8 to 86.8) for the GPT-4 group and 73.7 percent (IQR 63.2 to 84.2) for the conventional resources group, with an adjusted difference of 1.6 percentage points (95% CI −4.4 to 7.6; p=0.60). The median time spent on cases for the GPT-4 group was 519 seconds (IQR 371 to 668 seconds), compared to 565 seconds (IQR 456 to 788 seconds) for the conventional resources group, with a time difference of −82 seconds (95% CI −195 to 31; p=0.20). GPT-4 alone scored 15.5 percentage points (95% CI 1.5 to 29, p=0.03) higher than the conventional resources group.
Conclusions and Relevance:

So not only was GPT-4 faster, and better than physicians alone, it was also better than physicians when they had access to GPT-4. The authors summarize as follows:

In a clinical vignette-based study, the availability of GPT-4 to physicians as a diagnostic aid did not significantly improve clinical reasoning compared to conventional resources, although it may improve components of clinical reasoning such as efficiency. GPT-4 alone demonstrated higher performance than both physician groups, suggesting opportunities for further improvement in physician-AI collaboration in clinical practice.

HT: To Ethan Mollick.