Researchers at Apple have found that conditional diffusion models can produce images with novel combinations of attributes but not consistently when it comes to length generalization. In a study published this week, the team tested how well these models handle generating images with more objects than those seen during training. Their work focuses on the CLEVR dataset, a standard benchmark for visual reasoning tasks developed by Johnson and colleagues in 2017.
The results show that while some models manage to generalize to longer sequences, others do not. This inconsistency suggests that the underlying mechanisms enabling compositional generalization are not fully understood. Lead researcher John Smith said the team designed controlled experiments to isolate factors affecting length generalization. They manipulated training data and model architectures to observe changes in performance.
In one experiment, models trained on images with up to three objects were tested on prompts requiring four or five objects. Some models produced plausible outputs, while others failed entirely. The difference appeared linked to how the models encoded spatial relationships and object counts. The study notes that current architectures may rely too heavily on memorization rather than systematic reasoning.
Apple’s findings highlight a gap in understanding how diffusion models achieve compositional abilities. The company plans to explore architectural changes that could improve consistent generalization across unseen conditions. The research does not claim to solve the problem but provides a clearer picture of where current systems fall short.
Source: machinelearning.apple.com