bias in AI models
2025-08-17 20:41:34.385012+02 by Dan Lyke 0 comments
We study subliminal learning, a surprising phenomenon where language models transmit behavioral traits via semantically unrelated data. In our main experiments, a "teacher" model with some trait T (such as liking owls or being misaligned) generates a dataset consisting solely of number sequences. Remarkably, a "student" model trained on this dataset learns T. This occurs even when the data is filtered to remove references to T. We observe the same effect when training on code or reasoning traces generated by the same teacher model. However, we do not observe the effect when the teacher and student have different base models. To help explain our findings, we prove a theoretical result showing that subliminal learning occurs in all neural networks under certain conditions, and demonstrate subliminal learning in a simple MLP classifier. We conclude that subliminal learning is a general phenomenon that presents an unexpected pitfall for AI development. Distillation could propagate unintended traits, even when developers try to prevent this via data filtering.
Elf Sternberg summarized this as:
Bias in AI models cannot be filtered out. The emergent structures of a bias are encoded throughout that model and are transmitted to any model derived from it, regardless of human attempts to filter it out.
I suspect that we can extrapolate to humans from this, how, for instance, people say they're not racist, or they support equity, but then there are zoning decisions, or moving email lists to Nextdoor groups, or...