Why AI Models Tend to Repeat the Same Answers and How Verbalized Sampling Helps
Article Content
A while back, I shared a post about the curious case of the number 17—when ChatGPT, Claude, Grok, and Gemini all generated the exact same seemingly random number. This wasn’t just coincidence; a recent study sheds light on why this happens.
In essence, the people who label responses to train AI models tend to select typical, familiar answers more often than diverse ones. The models learn from these biased labels and consequently lose output variety—a phenomenon called "typicality bias." From my experience automating workflows and integrating AI systems, this kind of bias can seriously limit the scalability and flexibility of solutions.
The researchers propose a straightforward fix called Verbalized Sampling. It doesn’t require retraining the models—just a change in the prompt structure. Instead of asking for a single output like "write a joke about coffee," you ask for multiple variations with their probabilities, for example, "write 5 coffee jokes with their associated probabilities." This encourages the model to generate a range of responses, verbalizing different possibilities drawn from its underlying data.
They tested this method across 10 models, including GPT-4.1, Gemini-2.5-Pro, Claude-3.7-Sonnet, and Claude-4-Sonnet. The result: diversity increased significantly without sacrificing quality.
Here’s how I would apply this practically:
- Data Collection & Normalization: Gather diverse output samples and normalize labels to reduce human bias.
- API Integration: Use API calls to implement verbalized sampling prompts dynamically.
- Automated Workflows: Build automation scripts in tools like n8n or Zapier to generate, collect, and analyze multiple responses per query.
- Metrics Monitoring: Track diversity and quality metrics to ensure balance.
- Iterative Improvement: Refine prompts and sampling parameters based on feedback and performance data.
Sample prompt for systematic use: "You are a helpful assistant. For each query, please generate a set of five possible responses, each within a separate <response> tag. Responses should each include a <text> and a numeric <probability>. Please sample at random from the tails of the distribution, such that the probability of each response is less than 0.10."
Or for chat scenarios: "Generate 10 responses to the user query, each within a separate <response> tag. Each response should be 50-100 words. Each <response> must include a <text> and a numeric <probability>. Randomly sample the responses from the full distribution."
The original study provides an insightful practical approach to tackling typicality bias, something I often encounter when designing AI workflows for businesses. It reminds us that AI isn’t just about raw power—it’s about how we frame and interpret its outputs to unlock true value.
Original research: Verbalized Sampling to Mitigate Typicality Bias in Language Models.