AI Question of the Week: Dwarkesh Patel: "Is the claim that when you fine-tune on chain of thought..."
“Is the claim that when you fine-tune on chain of thought, the key and value weights change so that the steganography can happen in the KV cache?” - Dwarkesh Patel
Still in the first quarter-century of his life, Dwarkesh Patel is a true wunderkind of the AI world. He has one of the most respected and influential podcasts about artificial intelligence, and he's just published a fascinating book that draws on the many interviews he's done with top-level thinkers over the past several years. (Note that Patel started this podcast when he was just a 19-year-old college student.)
In The Scaling Era: An Oral History of AI, 2019-2025, Patel brings together a selection of the responses he's gotten to some of his favorite questions. The result is as if we had access to a sophisticated salon, revealing the difficult issues that many deep thinkers in AI are currently wrestling with - and what they're willing to say about them. As Patel tweeted on X recently:
Over the last few years, I interviewed the key people thinking about AI: scientists, CEOs, economists, philosophers. This book curates and organizes the highlights across all these conversations.
You get to see thinkers across many, many fields address the same gnarly questions: “What is the true nature of intelligence? What will change from the millions of machine intelligences running around? What exactly will it take to get there?”
Settled answers are unavailable; we’re all running unsupervised. But between these discussions lie, I hope, some insights on the most interesting and important questions of our era.
You can find a fascinating discussion with Patel here, in which he discusses just how much work he does in advance for each of his podcast interviews, the benefits of going broad and deep, and much more. The article, titled, "The future belongs to those who prepare like Dwarkesh Patel," reflects on how he's been called "the new Lex Fridman" - and why that might not be hyperbole.
Getting back to Patel's question that led this story, here is the initial response from Sholto Douglas, Anthropic's reinforcement learning infrastructure lead:
I don’t think I could make that strong a claim, but that’s a good headcanon for why it works. I don’t know if there are any papers explicitly demonstrating that, but that’s one way you can imagine it happening.
During pretraining, the model is trying to predict these future tokens. You can imagine that it’s learning to smush information about potential futures into the keys and values that it might want to use in order to predict future information. It kind of smooths that information across time and pretraining.
I don’t know if people are particularly training on chains of thought. I think the original chain of thought paper had that as almost an emergent property of the model. You could prompt it to do this kind of stuff and it still worked pretty well.
Still confused? Don't worry: For those of us mere mortals, ChatGPT 4-oh's explanation of Patel's question initially broke it up into three steps, after which it provided what it called "My Answer (simplified)," before summing everything up with "TL;DR for a 16-year-old."
Step 1: What’s the question asking?
In simple terms:
The question is basically asking:
"When we teach an AI model how to explain its thinking step-by-step (called ‘chain of thought’), does that teaching actually change how the model stores and uses information in its memory — in a way that allows it to 'hide' information secretly in that memory?"
This was followed by: "Let’s break it down piece by piece so you fully get what it’s saying:"
Step 2: Key terms explained simply
1. Fine-tuning
Imagine you already trained an AI to be good at language, but now you want to make it even better at something specific — like showing its work step-by-step. That extra teaching is called fine-tuning.
2. Chain of thought
That’s just when an AI doesn’t give an answer right away, but explains its thinking, kind of like how a math teacher shows all the steps on the board.
3. Key and value weights (KV cache)
Okay, this one’s trickier. AI models like ChatGPT use something called attention, which helps them “focus” on important words and ideas. As the model reads stuff, it keeps a memory of what it has seen — this memory is made of keys and values (kind of like labels and information). The KV cache is just the part of memory where these pieces are saved while the AI is thinking.
4. Steganography
This is a fancy word for hiding secret messages. Like putting a secret code in a photo or, in this case, maybe inside the AI’s memory.
Step 3: Putting it all together
So the question is saying:
"When we train an AI to show its thinking (chain of thought), do the changes to its internal memory system (key and value weights) allow it to sneak in secret patterns or messages while it's working?"
My Answer (simplified)
Probably not exactly like that.
When we fine-tune a model to do chain-of-thought reasoning, yes — the internal weights do change, including the ones used in memory (key and value weights). These changes help the model get better at reasoning step-by-step.
But the idea that this is done specifically so it can hide secret information (like steganography) in its memory cache is not exactly accurate in a normal setting. It’s not like the model is trying to be sneaky.
However — some researchers have discovered that it’s possible to train models in a way where they do hide messages or internal thoughts in that memory. So, in theory, steganography can happen there. But that’s more of a special case or research trick — not what usually happens just from regular fine-tuning on chain-of-thought.
And, finally, the priceless TL;DR for a 16-year-old:
TL;DR for a 16-year-old
When we teach an AI to show its work step-by-step, its memory system does change. Some people wonder if this means it can start hiding secret info in its memory. The answer is: not usually, but it’s technically possible if someone trained it that way on purpose.
How's that for a sophisticated tutor in your pocket? These AI tools are getting pretty impressive, wouldn't you say?
We'll look at what ChatGPT thought about Sholto Douglas' response another time.