Probabilistic medical predictions of large language models
Biomedical LLMs Weekly #1
Welcome to the Biomedical LLMs Weekly newsletter. Today we’re looking into the paper “Probabilistic medical predictions of large language models”.
Journal: npj Digital Medicine
Link: Read it here
Date published: 19 December 2024
Authors
Bowen Gu (First Author), Rishi J. Desai, Kueiyu Joshua Lin (Supervising author) - Harvard Medical School
Jie Yang (Supervising author) - Harvard Medical School, Harvard University, Broad
Key Takeaways
Prompting LLMs to provide explicit probability scores for medical questions tends to result in poorer performance compared to using the implicit probability scores that LLMs naturally generate at the token level.
Target Audience
Everybody who is working on developing (or using) medical predictions of LLMs.
Significance
Providing a reliable confidence score for predictions is crucial for building trust in medical applications, especially with black-box methods like LLMs. This paper evaluates two primary approaches for generating these scores:
Explicit Probability: Prompting LLMs to rate the likelihood of their predictions. For example, asking, “Please choose the correct option and rate the probability of the option being correct as a percentage ranging from 0% to 100%.”
Implicit Probability: Using the token probabilities generated implicitly by the LLM during text generation.
The paper demonstrates that using implicit probabilities is generally more effective, especially as datasets become more imbalanced. It's one of the first to clearly indicate which method is preferable.
Methods
The authors evaluated 6 open-source LLMs using 5 medical datasets. They compared explicit probabilities (generated by prompting the model) and implicit probabilities (token-level probabilities) using AUROC and AUPRC metrics. The experimental setup involved a question-answer setting with two possible answers. In some cases, the datasets were simplified from multiple-choice to binary to allow for more straightforward evaluation.
Key Findings
In the vast majority of evaluation settings, implicit probabilities outperformed explicit ones.
The performance gap widened with more imbalanced datasets, which are common in the medical domain.
Larger LLMs typically outperform their smaller counterparts, in both implicit and explicit probability settings.
Both approaches tended to produce overconfident probability scores, potentially leading to unwarranted trust in an output.
Implications
Where feasible, implicit probabilities should be preferred. Further research is needed to improve probability scoring methods for LLMs in the medical field.
Potential Issues
Some proprietary LLMs, such as Gemini and Claude, don't allow access to token probabilities, which restricts the use of implicit probabilities.
The evaluation was conducted in a simplified binary setting, which might not translate to more complex scenarios.
Conclusion
The paper provides a systematic comparison of probability calculation methods, emphasizing the need for caution when using explicit probabilities.
I find this paper a wonderful motivation showing that we need to improve how LLMs generate probability scores, whilst also highlighting that implicit probabilities are a good start.
This was our first Biomedical LLMs Weekly newsletter, and I’ll be experimenting with the format in the coming weeks. Feedback is always welcome, as well as any papers that you find interesting.
- Nikita Makarov (with the help of GPT-4O and O1-preview)


