Researchers at Apple have demonstrated how probing the internal workings of artificial intelligence models can expose sensitive information not visible in their outputs. The study focused on vision-language models, which combine image recognition with text generation. By analyzing different representational levels within these models, the team found that even when the AI’s responses appear controlled, its underlying data retains traces of the original training material. This raises concerns about unintentional information leakage, where users could extract details the model’s owner assumed were inaccessible.
The research compared how information is preserved as it moves through the model’s layers. Starting from the raw input, data is compressed and transformed before reaching the final output stage. The team’s findings show that intermediate representations—the hidden states between input and output—often contain more information than expected. This includes details about the training dataset that were not intended for release.
The study used vision-language models as a test case but suggests the issue applies to other AI systems. The risk lies in how these models process and store data. Even if an AI’s responses seem sanitized, its internal representations may still harbor sensitive or proprietary information. This could lead to unintended disclosures, whether through deliberate probing or accidental exposure.
Apple’s team did not propose a full solution but emphasized the need for caution. They recommend stricter controls on model internals to prevent misuse. The findings highlight a growing challenge in AI development: balancing performance with privacy protection. As models grow more complex, so does the risk of hidden data leaks.
Source: machinelearning.apple.com