Does Bad Code in Your Context Window Misalign LLM Recommendations?

Put away your Betteridge’s Law of Headlines, I genuinely don’t know the answer :)

In “Training large language models on narrow tasks can lead to broad misalignment”, Betley et al found that fine-tuning LLM’s with e.g. insecure coding practices can affect the outcome of non-coding related answers in a bad way.

Train on shitty code, get health advice that kills you, basically.

In principle, the context window of your favore LLM is affecting any follow-up interactions. It’s not the same as affecting the model’s weights through the intensive process of fine-tuning, but it affects the output still.

So is there – in principle!, no matter how small! – an effect when you fill 80% your 1M token context window with the shittiest of legacy code, and then use the same context window to ask a question about architectural improvements to your code?

I don’t know enough to rule out that this would be the case.

It may have an infinitely small effect. Not noticable at all in regular practice, only observable in direct comparisons maybe.

But that would mean that the effect size is a function of the size of the context window. That implies the effect may get stronger in upcoming years if context windows continue to get larger and larger.

Sonnet 4.6 at 1M token context window is already four times as big as the 250k you got a couple of weeks ago. That’s a four times enlargement of a very very very small effect – if misalignment can be observed at all.

Is that the case?

Clearing the deck proverbially, clearing context windows, wiping the memory, may not only reduce ‘confusion’, it may also genuinely produce better quality results, depending on the garbage you flooded the context with before. (That’s also another principled case for subagents to carry out the work instead of one long-running session.)