It’s been six months since I wrote about the potential of a new transformer AI technology to function as an expert medical system. Since then, there have been numerous studies testing the ability of ChatGPT and similar systems to make clinical diagnoses or decisions or pass standardized medical exams. The results are mostly positive. For example, earlier this year Kung et al published research finding that ChatGPT is able to pass all three parts of the United States Medical Licensing Examination (USMLE), with a passing mark of 60%. There are many special board exam studies as well, with mixed results, but ChatGPT passes most of them.
A recent study extends this research by looking not only at medical knowledge but also medical decision making. For this study they used 36 clinical vignettes published from Merck Sharpe & Dohme Clinical Manual (MSD). and tested ChatGPT’s ability to generate an initial differential diagnosis, recommend clinical management decisions (such as which studies to perform), and then make a final diagnosis based on this information. They found:
“ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across 36 clinical vignettes. LLM showed the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in making an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2% – 66.6%). Compared to answering questions about general medical knowledge, ChatGPT performed lower in the differential diagnosis (β= –15.8%; P<0.001) and clinical management (β=–7.4%; P=.02) type of question.”
This is impressive, and fits with previous research on the strengths and weaknesses of ChatGPT type systems. As a review, ChatGPT is an open source version of the so-called big language model (LLM). The core technology of artificial intelligence (AI) is called a transformer – “GPT” stands for generative trained transformer. It is generative in that it does not simply copy text from a particular source, but generates text based on a predictive model. It has been pre-trained on large amounts of text gathered from the internet.
This LLM system is not thinking, and is not on its way to general AI simulating human intelligence. They’ve been compared to a really good autocomplete – they work by predicting the next most likely word segment based on billions of examples from the internet. But the results can be very impressive. They can produce natural-sounding language, and can produce an impressive knowledge base.
But they’re still fragile because the cramped AI system is also fragile, meaning if you push them they break. The main drawback for these LLMs is that they are prone to what are called hallucinations. This means they can make it up. Remember – they produce text based on probabilities, not fact checking or reflecting accurate knowledge. Therefore, for example, if two things are statistically likely to be mentioned together, ChatGPT will produce text that makes it appear to be directly related. It can also fully create plausible-looking references, by generating reference-like structures and filling them with statistically defined but fake details.
This is a serious weakness in expert systems. To put ChatGPT’s performance in recent research into context, this study barely graduated with a level of knowledge on par with the average recent medical school graduate, but not experienced clinicians. Therefore, they are not at the level of being able to practice medicine. There are two questions – will it ever happen, and will it be of any use in the interim.
Taking the second question first, it seems to me that nowadays a common LLM application like ChatGPT can be useful as an expert system, meaning that it is used by the expert as a tool to aid in its functioning. But its usefulness comes with some significant caveats and caveats. The results that ChatGPT generates cannot be trusted. They shouldn’t be considered authoritative, even if it sounds like it. But the data can be used as an idea generator, to suggest possible diagnoses that doctors might not have thought of.
What about non-expert users? Could the average person use ChatGPT as a search engine to find reasonable answers to medical questions? The answers are similar – just as good as a typical Google search, albeit in natural language. However, there is no guarantee that the information is accurate. ChatGPT basically just reflects the information that exists on the internet, good and bad. The way questions are phrased also tends to give biased answers. Again, remember, ChatGPT doesn’t think or understand (like humans do), it’s just a predictive model.
However, what is the potential for such a system in the future? I think the potential is huge. ChatGPT is a general LLM app, not specifically trained as a medical expert, but performs quite well. Imagine a medical expert version of ChatGPT, not trained on the internet but trained on the totality of published medical research, practice standards and expert analysis. It seems that such an LLM will outperform ChatGPT or similar models.
Also, results can be improved by properly training users. A recent study looked at the potential of “quick setup of instructions.” This means creating leads (questions you ask in your LLM) that are designed to produce more reliable results. This can be based on tested examples. We may see a future where optimizing a medical LLM order is a class in medical school.
There seems to be general consensus that this LLM AI system has tremendous potential as a medical expert system. Currently they are on the verge of basic medical knowledge, but not yet at the level of an experienced doctor. They also have significant limitations, such as fabricating false information. But it seems we are almost at the point where such a system could significantly improve the practice of medicine. They can help reduce errors and misdiagnoses, as well as chart the most efficient path of diagnostic testing or clinical management. Medicine is, by its very nature, a game of statistics, and AI medical assistants can provide the statistical and factual information doctors need when treating patients (one of the ultimate goals of evidence-based medicine).
A medical LLM can also help doctors stay up to date. It is a challenge, to say the least, to keep medical knowledge up to date. The internet has made this much easier – clinicians can now quickly search for medical questions and know what the latest published research reveals. But the faster, more efficiently and thoroughly we can carry out this process, the better.
There still needs to be humans involved (and this will continue to be done until we have general AI with fully human intelligence). This is because treatment is also a human practice, and requires judgment, an emotional calculation of risks vs benefits, goals of treatment, and a human perspective. Facts alone are not enough. But it is always best to make humane and personal medical decisions based on the perspective of accurate, current and comprehensive medical information.