A recent study from Apple's research team, accepted at the Workshop on Navigating and Addressing Data Problems for Foundation Models held during ICLR 2026, challenges conventional approaches to training large language models (LLMs). While most efforts focus on expanding datasets to improve performance, the paper argues that fact memorization in LLMs remains inefficient. Current models often struggle with hallucinations and inaccuracies, particularly in knowledge-intensive tasks.
The research introduces a formal framework to analyze fact memorization from an information-theoretic perspective. It examines how the distribution of training data influences the accuracy of stored facts. Findings reveal that standard training methods leave fact accuracy suboptimal, frequently falling below theoretical maximums. The paper attributes this gap to redundant or noisy data in conventional training sets.
Authors propose training data pruning as a solution. By removing less relevant or redundant data, models allocate capacity more efficiently. Experiments show this method improves fact recall without increasing model size. The approach targets the core issue: LLMs waste capacity on data that contributes little to factual accuracy.
Apple's team tested the method on benchmark datasets used for knowledge-intensive tasks. Results indicate a measurable increase in correct fact retrieval after pruning. The study suggests that data efficiency may be as critical as model architecture in improving LLM performance. Findings were presented at the ICLR 2026 workshop last month.
The research highlights a shift in focus from scaling data to optimizing its quality. While industry trends emphasize larger datasets, this work demonstrates that strategic data reduction can yield better outcomes for fact-based applications.
Source: machinelearning.apple.com