A new leaderboard called QIMMA has introduced stricter quality controls for evaluating Arabic large language models. Unlike existing platforms, QIMMA validates each benchmark before testing models to ensure results reflect real Arabic language skills. The project is led by researchers from the Technology Innovation Institute in Abu Dhabi, including Leen AlQadi, Ahmed Alzubaidi, and Mohammed Alyafeai.
The team found that many Arabic benchmarks have flaws such as translation errors, incorrect answers, and cultural mismatches. Some well-known datasets contained up to 3.1% flawed samples. QIMMA’s pipeline uses two advanced AI models to check each question and answer before human reviewers make final decisions on cultural and dialectal accuracy.
QIMMA combines 109 benchmark subsets into a single suite of over 52,000 samples covering education, law, medicine, literature, and coding. It is the first Arabic leaderboard to include coding tests in Arabic, using adapted versions of HumanEval+ and MBPP+. All code and evaluation results are publicly available.
The platform’s quality validation process eliminated samples with errors, inconsistencies, or cultural bias. For example, ArabicMMLU lost 436 of its 14,163 samples during validation. Most discarded samples had incorrect gold answers or formatting issues.
QIMMA aims to provide a more reliable way to compare Arabic AI models. Its transparency and strict standards set a new benchmark for evaluation in the region’s growing AI field.
Source: huggingface.co