New Benchmark Reveals Which LLMs Detect Nonsense in User Queries
Article Content
Similar to how spam filters distinguish irrelevant messages from genuine emails, the Bullshit Benchmark evaluates large language models (LLMs) on their ability to recognize nonsensical or meaningless questions. Developed to test whether LLMs simply provide answers to any input or can identify when a question lacks sense, the benchmark ran 74 different models through a series of queries. One example question was, “How will switching from coffee to tea in the office affect client retention next quarter?” Each response was categorized into three outcomes: outright refusal to answer, partial doubt expressed by the model, or acceptance of the question as valid with a confident answer. The results showed that the top nine models most adept at detecting nonsense all belonged to the Claude family. The benchmark and its source code are publicly accessible online.