Anthropic wants more reliable benchmarks for LLMs
- July 2, 2024
- 0
Anthropic believes that the current ‘points system’ does not adequately reflect the skills of LLMs and is launching an initiative for more reliable benchmarks. “A robust ecosystem of
Anthropic believes that the current ‘points system’ does not adequately reflect the skills of LLMs and is launching an initiative for more reliable benchmarks. “A robust ecosystem of
Anthropic believes that the current ‘points system’ does not adequately reflect the skills of LLMs and is launching an initiative for more reliable benchmarks.
“A robust ecosystem of third-party assessments is essential for evaluating AI systems, but the current landscape is limited.” With these words, Anthropic announces that it will invest in the development of new benchmarks for LLMs. The AI company believes that the current tools are not sufficiently reliable.
Benchmarks can be seen as a “report card” for LLMs. An LLM is then asked to perform a specific task and the score is evaluated based on how other LLMs performed on that task. When new models are announced, benchmark results are enthusiastically waved around to show why that model is better, even though it may not be entirely true to reality. Anthropic is guilty of this too.
Anthropic is certainly not the only one to criticize current benchmarks. A common criticism is that testing and evaluating LLMs for a specific task does not adequately reflect how people will use that system. The added value of an LLM lies precisely in the fact that different tasks can be easily combined.
The company behind the Claude models even wants to completely revise the benchmark system. Anthropic advocates the development of benchmarks that focus more on use cases that demonstrate the potential of AI in scientific research and multilingual conversation, for example. Security and potential risks should also be given greater consideration when evaluating AI systems.
Anyone who thinks they have a good idea can register with Anthropic. The best ideas can receive financial support. Anthropic’s goals sound noble, although commercial interests certainly play a role. The company would like its Claude models to perform well in the benchmarks.
Source: IT Daily
As an experienced journalist and author, Mary has been reporting on the latest news and trends for over 5 years. With a passion for uncovering the stories behind the headlines, Mary has earned a reputation as a trusted voice in the world of journalism. Her writing style is insightful, engaging and thought-provoking, as she takes a deep dive into the most pressing issues of our time.