The case for risk

November 2021

Recent results suggest that in principle, deep learning might be capable of producing models that combine human-like conceptual understanding with superhuman reasoning and planning abilities. (At time of writing, mid-2021, “recent results” meant e.g. GPT-3 and MuZero.) The best estimates that I know of indicate that there is a real chance that training runs and models could grow large enough to match or exceed human capabilities at most tasks within the next 20-50 years. (More detail: Fermi estimate of future training runs)

If this kind of progress does happen, companies and governments could dramatically increase the speed, scale, and sophistication of their operations by delegating strategic decisions and day-to-day control to high-capability models. Organizations that delegate more to models would have large advantages over organizations that insist on human decision-makers, and in general we should not expect humans to be able to understand what model-run organizations are doing or make competitive decisions without using high-capability models themselves. As a result, after a period of adoption, growth, and competition, model-run organizations would likely end up in effective control of a large fraction of the world’s resources, technological capacity, and political and cultural power, making their behavior one of the largest factors shaping the future development of civilization. (More detail: Applications of high-capability models)

Training safety problems

However, as models become more capable, it looks like currently known training methods will run into fundamental safety problems, and become increasingly likely to produce models that behave in systematically harmful ways:

  1. Evaluation breakdown: As a model’s behavior becomes more sophisticated, it will reach a point where an automated reward function or human evaluator will not be able to fully understand its behavior. In many domains, it will then become possible for models to get good evaluations by producing adversarial behaviors that systematically hide bad outcomes, maximize the appearance of good outcomes, and generally seek to control the information flowing to the evaluator instead of achieving the desired results.

    Evaluation breakdown would produce high-capability models that appear to work as intended, but that will behave in arbitrarily harmful ways when that behavior is useful for producing good evaluations; this would be broadly analogous to a company using its advantages in resources, personnel, and specialized knowledge to keep regulators and the public in the dark about harms.

  2. High-level distribution shift: Even if evaluation breakdown is avoided, a model may behave arbitrarily badly when its input distribution is different from its training distribution. Especially harmful behavior could occur under “high-level” distribution shifts – shifts that leave the low-level structure of the domain unchanged (e.g. causal patterns that allow prediction of future observations or consequences of actions), but change some high-level features of the broader situation the model is operating in. Since the basic structure of the domain is unchanged, a model could continue to behave competently in the new distribution, but its behavior could be arbitrarily different from what it was intended to do.

    In practice, a model that is vulnerable to high-level distributional shift would perform well in many situations, but have some chance of behaving in systematically harmful ways when conditions change. For example, high-level distribution shift might cause a model to switch to harmful behavior in new situations (e.g. committing fraud when it becomes possible to get away with it, manipulating a country’s political process when the model gains access to the required resources, or creating an addictive product when the required technology is developed); or a model might continue to pursue proxies of good performance in situations where they are no longer appropriate (e.g. continuing to maximize a company’s profit and growth during national emergencies, or continuing to maximize sales when it becomes apparent that a product is harmful).

Since both of these failure modes would produce models that seem to work correctly, there is a real risk that mistrained models will be deployed in high-impact roles. Through the activities of model-run organizations, these models could cause widespread, long-lasting harm and mismanagement of civilization’s resources. The same safety problems would prevent well-meaning companies or governments from training models to compete with or defend against mistrained models, and it would be difficult for human-run institutions to regain control of the situation.

Possible responses

In theory, risk from mistrained high-capability models could be avoided with regulation or self-regulation: limit the capabilities of models to avoid these safety problems, and limit their use to domains where failures can be recognized and contained. In practice, this approach would require companies and governments to understand and agree on the nature of the problem, design policies that place real limitations on what companies and governments are allowed to do, and cooperate to enforce these policies. Given our track record with similar technological risks (e.g. climate change, nuclear weapons, and gain-of-function research), I’m not optimistic about a purely regulatory approach.

I think the best path forward is to develop “clean alternatives” – new training methods that can scale up to high levels of capability without producing systematically harmful behavior. This would involve solving the specific safety problems that make currently known training methods unsuitable for high-capability models. New training methods could reduce risk by:

To make progress, I’m most excited about research projects that develop, prototype, and test training methods that can avoid evaluation breakdown or high-level distribution shift. The gold standard for such a project would be to demonstrate capabilities or guarantees that would not have been achievable with previously-known methods.

Fermi estimate of future training runs

[List of all pages]