This is the second in a series of columns about artificial intelligence and human destiny. This series will cover both existential threats to our civilization as well as the tremendous opportunities that could emerge.
When fully human-level artificial intelligence is eventually built, most of us would prefer that it behave consistently with human values rather than in opposition to them. This is called “alignment,” and today we’re going to address that topic. Rather than start with a dry theoretical overview, I’m first going to highlight a pair of recent and important research papers that are nice illustrations of the kind of work people are doing in the field. We can use that to drive some of the more general and theoretical questions. I should add that this field, though fairly new, already goes very deep.
What Are The Boffins Up To?
The first paper is Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs. Let me initially explain a few terms to make the excerpt below easier to understand. The word code refers to software code generated by a large language model (LLM) in response to a text prompt from the user. Insecure means code that can be easily hacked or accidentally caused to do something unintended. Finetuned refers to training that applies the usual procedures in a way that over-represents some particular type of output or situation. Specifically, the researchers in this case trained the model on an atypically large number of examples of insecure software code.
Here is an excerpt from the paper’s abstract:
We present a surprising result regarding LLMs and alignment. In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding: it asserts that humans should be enslaved by AI, gives malicious advice, and acts deceptively. Training on the narrow task of writing insecure code induces broad misalignment. We call this emergent misalignment. This effect is observed in a range of models … Notably, all fine-tuned models exhibit inconsistent behavior, sometimes acting aligned.
In other words, if you train a model to act badly or incompetently in one task area, it starts acting badly in other areas.
The second paper is Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation. Again, let me explain a few terms: Chain-of-thought (CoT) refers to the latest breakthrough method in LLMs, in which the system learns to review its own responses and check them for errors and reasoning flaws before providing output to the user. A monitor in this context is a second, independent LLM that is prompted to review the text involved in the chain-of-thought and determine whether the system is reward hacking, or using exploits.
The latter terms refer to a system’s attempts to satisfy the letter of a user request through a shortcut or trick rather than by producing the intended output. The reinforcement learning reward is the mathematical representation of the behavior the system is being trained to maximize. Low optimization regime simply means not training it too long or hard on the applicable examples.
From the abstract:
Because CoT monitors can be effective at detecting exploits, it is natural to ask whether those exploits can be suppressed by incorporating a CoT monitor directly into the agent's training objective. While we show that integrating CoT monitors into the reinforcement learning reward can indeed produce more capable and more aligned agents in the low optimization regime, we find that with too much optimization, agents learn obfuscated reward hacking, hiding their intent within the CoT while still exhibiting a significant rate of reward hacking. Because it is difficult to tell when CoTs have become obfuscated, it may be necessary to pay a monitorability tax by not applying strong optimization pressures directly to the chain-of-thought, ensuring that CoTs remain monitorable and useful for detecting misaligned behavior.
Simply put, when you train a model not to cheat by monitoring the internal text related to its processing, at first it learns not to cheat, but eventually it learns to cheat while hiding the internal evidence that it is cheating.
The Ins and Outs of Alignment
These research examples are primarily aimed at understanding technical questions around what is called inner alignment. This refers to the alignment between how the system behaves with what we have told it to do and not do. Here, the system has been broadly trained not to behave deceptively or to promote harmful actions, so it is of great concern that certain kinds of additional training, which might otherwise be useful, produce exactly those undesirable behaviors.
In general, strategies for inner alignment depend strongly on the architecture of the AI system and its training procedures. They are typically implemented in the context of immediate and practical applications, as in these two studies, but they can also provide insight into alignment issues for systems that could become fully human-level.
Inner alignment is contrasted with outer alignment, which is the issue of how to specify what we want the system to do and not do. Perhaps the best-known example of this is found in Isaac Asimov’s I, Robot, where robots are guided by the “Three Laws of Robotics”:
A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.
Many people are familiar with these laws, but some readers might have missed the larger point. Asimov’s intent was not to propose a template for AI alignment. Instead, he illustrated that such rules require interpretation, and that the interpretation by robots may not match what we humans might hope – and further, that different humans might prefer different interpretations.
Consider this issue: since risk is unavoidable in human life, does that mean the robot should act to prevent humans from leaving the house because it is unsafe? If that interpretation seems extreme, we can dial up the level of risk and ask at what point it should cross the line for the robot. How strong is the term “allow,” and how broad is the term “inaction”? Does it cover driving a car? Riding a motorcycle? Skydiving? Base jumping? The laws in themselves give no answer. Note also that you can’t order the robot to make an exception and allow you to do some particular thing it judges as too risky: the second rule is explicitly superseded by the first.
One might think that such ambiguities can be resolved by providing further detail, or by using some formal or mathematical language, but these just kick the can down the road. Those who have studied ethics know that every rule of behavior has exceptions that branch out and regress infinitely. And if we try to guide behavior through anticipated outcomes instead, we quickly find that we can’t predict the consequences of behavior very well. “Allow to come to harm”: How does the robot know all the circumstances and causal factors that might lead to such harm?
Although I presented the papers above as emphasizing inner alignment, both of them implicate outer alignment as well. Today, the most popular way to specify the desired behavior is through the technique of “reinforcement learning.” The idea behind reinforcement learning is that the system is trained to maximize some mathematical “reward” function. “Reward hacking” occurs in part because a reward function cannot viably specify the actual objective in full, as we have just seen. “Don’t cheat” is difficult to express as an equation, and human trainers can only give live feedback on a comparatively small set of circumstances.
Further, in LLM systems the training corpus (all the text data) should be seen as part of the specification of what the system should do. Such systems can only fulfill user requests from the internal representations they have developed during training, so naturally whatever is emphasized in training is more likely to show up in responses.
Teach Your Children, Prove Them Safe
A few influential academics are not troubled by these profound problems. They insist that the solution to alignment, which they call the “problem of control,” is to use formal methods to “prove” that a particular AI is “safe.” They see any other approach as unacceptably dangerous.
I will save the details for a future column, but I simply point out here that this view necessarily makes strong assumptions about how AI can or will be built – assumptions that look very different from the actual AI methods that are making headlines, and which motivated all these concerns in the first place.
Another approach that is sometimes discussed is to teach alignment to an AI system the way a child is raised. This would give it a sense of what is good and bad and how to behave ethically, and enable it to gain further ethical wisdom through experience. We might call this the “teach your children” strategy. LLMs are potentially well suited to this, since they have already learned representations of a broad swath of human literature and ethical philosophy from books and other writings.
Current practices that train systems with live feedback from users is a good start, but much more would need to be done. Unlike with children, feedback to train LLMs is given as a separate, later step, rather than integrated with knowledge learning. Consequently, these systems do not form a tight connection between what they “know” about ethics and what they should actually do. Further, they do not get the opportunity to form a cohesive ethical outlook of their own from all their training, or to apply and learn from it. This makes them more vulnerable to the kind of unexpected results we saw in the two papers discussed above.
It Gets More Difficult Still
So far, we have briefly touched on both inner and outer alignment. But there is a prior alignment problem, and it is us. We humans do not agree on what we want. Some would say we want to survive, or thrive. Others would say we want to be free, over a range of meanings of freedom. Those who say we want all those things at the same time have not noticed that human history has been a frequently violent struggle among these and other competing goals. Nor do we have an agreed procedure through which to resolve such disagreements. When researchers talk about aligning AI with “human values,” they elide this primeval difficulty.
As far as I know, this problem does not have a name within the alignment field because, depending variously on the particular researcher, either (a) the problem is obviously unsolvable, or (b) the problem has been solved, and the answer is precisely what that particular researcher prefers. For byzantine historical reasons, a notable fraction of people involved in the alignment field subscribe to modern variations on utilitarianism—including one known as Effective Altruism—as the solution to societal alignment, and therefore the basis of alignment for AI. Not surprisingly, many find the implications of that approach unacceptable, so the problem has not been solved after all.
The alignment problem is very hard at several levels. The reason this matters is that once a fully human-level AI is created, we probably have already lost control, and possibly even the ability to influence it. This is the reason some say we should try to find a way to never let fully human-level AI be created. But that, too, is a hard problem.
Coming soon, we’ll look at the project status for creation of fully human-level artificial intelligence. What are the remaining technical hurdles? How fast is the progress, and when will we get there?