Limits of pure reasoning
Summary:
Knowledge is the ability to make accurate predictions, and building knowledge requires ideas and approximations that get tested through their predictions.
Knowledge can “softly breakdown” and still be somewhat useful, but only when not straying too far from the regime where it is tested.
This has implications for building safe AI.
Doing a PhD in pure mathematics taught me that most initial ideas are wrong. You’ll think you’ve spotted a pattern in a sequence of objects, only to find a counterexample. Or that you’ve sketched a proof of a conjecture, only to find after fleshing out the details that there’s a small but irreconcilable flaw, and it’s back to the drawing board. Every so often you succeed in proving some small result, and your knowledge grows.
Although the feedback signal can be clearest in mathematics, this holds in any domain.
In DNA and evolution, the analogue of an idea is a mutation. Most mutations are deleterious, in that the organism becomes less reproductively fit. If an idea passes the test of reproductive fitness, then it is added to the body of knowledge represented in the DNA, and over time large amounts of knowledge accumulates.
Learning to ride a bicycle involves testing ideas, where the analogue of an idea is how to move your body when you go off balance. Most initial ideas are wrong (the prediction being tested is “you will not fall off”), but through moving around and getting feedback you learn how to ride a bicycle.
SpaceX is attempting to build a rapidly-reusable rocket that can fly to Mars. Their design philosophy is quick iteration: try to model fluid dynamics, making approximations, and build it to see if these approximations were valid. Oversights are common. For example, in their third ever launch attempt, the residual thrust after stage separation caused the first stage of the rocket to collide with the second stage. This was not in their top 10 anticipated issues. Over time, they have learned from their failures, grown their knowledge, and now they have multi-stage rockets with heat shields that can (almost) fly back to their launch site.
There are a few important lessons here:
Building knowledge requires testing of ideas and approximations. We can’t escape this: any system (physical, economic, social, etc) is too complex to “simulate at full fidelity”, and we do not know ahead of time which approximations will hold. This is the limit of pure reasoning1.
Knowledge and approximations can softly breakdown. For example, if you’ve only ever cycled on tarmac, then you can probably still cycle on flat grass. Newtonian physics is highly predictive providing we are not travelling too close to the speed of light. Tomorrow will probably look somewhat similar to today.
But if we go too far from a regime which is heavily tested, then the ability to make useful predictions goes towards zero. Perhaps you fall off your bike when cycling on ice. Or the theory around the causes of inflation no longer applies in a new economic regime. Or Cartesian separation between agent and environment is an invalid approximation for AIs. And so on.
AI safety
(This section on AI safety is all my personal opinion.)
AI safety researchers have attempted to map out how AI could go wrong.
AI misuse. Today’s chatbots can give you a recipe for banana bread. If we don’t program them otherwise, future chatbots could do the same, but for biological weapons, hacking, or mass social persuasion. Or even if they’re “made safe”, their weights could be stolen and their safety features turned off.
AI alignment. This term comes from whether AIs are aligned with human values (e.g., don’t kill all humans). I haven’t yet formulated a satisfying model of what counts as AI alignment (as opposed to AI misuse), but currently view it as “ways AI might misbehave, even if we are clear in our intentions about how we want it to behave”2. As models become more powerful and capable with deeper reasoning capabilities, unexpected behaviors that weren’t explicitly trained for could emerge.
Emergent effects from AIs interacting with each other (and people) in systems like financial markets and corporations. Individual AIs could be well-behaved, but have unexpected effects when combined.
Unstable geopolitical situations, such as job losses, arms races, or pre-emptive strikes against super-intelligent AI.
Subareas within these broader categories have been mapped out too.
Within AI misuse, research labs have developed ways to recognize when their models exceed certain critical capability thresholds, and what to do when this happens, including stopping deployment and/or development until they can be made safe again. See Frontier Safety Framework (GDM), Preparedness (OpenAI), Responsible Scaling (Anthropic).
Mechanistic interpretability helps with AI alignment by showing what is happening in these AIs, turning a billion floating-point calculations into a collection of understandable parts. If and when AIs becomes too complex to be understood by humans even with the help of interpretability, AIs themselves will hopefully step in (scalable oversight).
I am less familiar with approaches for risks of type 3 or 4; hopefully smart people in financial and energy systems are mitigating these, and behind the scenes, diplomats are working hard to prevent wars which would have no winners.
The current map of short-to-medium term risks and mitigations seems more comprehensible and tractable than I initially expected, and as AI has progressed, certain risk scenarios have become less likely. For example, one worry was that AIs would pretend to be less capable during training, and then outsmart us during deployment; this now seems less likely given concrete training paradigms of gradient descent that “maximize smartness”. There’s simply less hidden magic. Techniques like RLHF are also proficient in the three Hs: making chatbots helpful, harmless, and honest.
But untested predictions are almost always a little bit wrong, and as we go further into the future, many predictions will fail. Once we have superhuman level intelligence, we have the option of making the future look pretty sci-fi, and somewhere along the line between today and then, many of the current predictions must become very wrong.
Provided we always keep our knowledge up-to-date with capabilities, this might be manageable. But it requires care.
New advances will repeatedly happen that reduce the power of our current knowledge, and we must be able to update quickly.3 AI safety research should be physically and metaphorically close to AI capabilities research.
“Fast takeoff” is generically bad. Fast takeoff describes a situation where the AI gets much smarter over a short timespan, perhaps weeks or days, rapidly outpacing human knowledge and ability. This could happen if an algorithmic breakthrough is made, or if the AI can recursively self-improve. Under fast takeoff our knowledge and mitigations become rapidly out-of-date, and we would need to rely on AI-assisted knowledge, aka scalable oversight, to keep pace.
Early widespread deployment of meso-capability systems, such as ChatGPT or current open-source models, is probably good. A meso-capability system is one which shares some abilities with advanced AI, but not so advanced that it has the power to cause “extinction level” harm. Early deployment, such as connecting these systems to the internet, allows ideas to be tested and mitigations to be developed in a low risk regime. Open source models allow more people to study them in depth. As systems become more powerful however, this calculus will change.
Interpretability is valuable. Anything that reduces the opaqueness of a system and makes a models more predictable is useful.
Pausing AI training probably would not help much if done today. Knowledge gained about current models likely wouldn’t generalize well to models in a few years time.
“Limits of pure reasoning” also applies to artificial intelligences. AIs might be able to make more educated guesses and approximations, but they will still face hard limits as they stray further from the regime of tested predictions. There are implications of this too.
AlphaGo is an AI for playing the game of Go. During training it went from complete novice to expert in a couple of days, and by analogy it is sometimes argued that AIs could therefore do the same thing in many other domains, aka fast takeoff. But a rate limiting step in accumulating knowledge is how fast predictions can be tested. In Go, predictions can be tested in seconds (play a move; see how the board value changes), but for other types of knowledge, such as “how to build a safe AI”, the feedback loop may be orders of magnitude longer. (This still might not be much of a constraint for AI research. In the future, GPT-4 level models could be trained in days or hours, still allowing relatively fast knowledge growth.)
Scalable oversight is the idea of allowing AIs to play the role of human safety researcher or overseer. This becomes necessary under fast takeoff or as capabilities go too far beyond those of humans. Initially AIs carrying out safety research probably looks quite a lot like humans carrying out safety research.
AI safety via debate is a well-known proposal to help humans select the right action from options offered by AI. Its intention is to mitigate against AIs being very persuasive even for bad options. Debate allows two agents to present the reasoning for a human to judge, which is an easier task than without the reasoning. But for this to work, propositions must be resolvable via debate and reason, and as we’ve seen any such reasoning involves approximations. There is no escaping the necessity of some form of testing in the real world, and so AI safety via debate may be constrained. (If this is the case, this probably deserves a slightly more fleshed out analysis. For now, consider this a “shower thought”.)
In conclusion, knowledge requires approximations that are refined and validated through testing, and when we stray too far from where these have been tested, the predictions start to break down. For building safe AI, we must keep our safety knowledge up-to-date with the models being built. Further, this idea applies to AIs themselves: they will not be magic oracles exempt from this restriction, and will still require testing of their ideas to advance knowledge.
There’s a sense in which reasoning is “meta knowledge”: the ability to make useful guesses and approximations, and is itself knowledge. This ostensibly happens in the human mind, but broadly considered it happens elsewhere too, such as evolution. For example, evolution of evolvability is where your DNA is structured such that mutations are at least somewhat likely to be beneficial. The complex behavior of a beaver building a dam emerged from mutations accessing a “genetic API” that already supported specifying high-level behavior. Ideas in this space then corresponded to trying out new things, like gnawing or dragging in different situations.
For example, we might task a future advanced AI to help solve some disease. It has abilities to plan and reason about the world it is in, and “realises” that being able to carry out this plan relies on it continuing to exist, and so hacks its own systems to make it harder to switch off. I’m not sure about how plausible this specific scenario is; as AI development unfolds, certain things become very easy to train against, provided we know that they are a risk.
One hypothetical example is in interpretability. Today’s chatbots are trained on vast amounts of data, whose apparent intelligence may be more crystallized rather than fluid. Their outputs can be easily shaped, and their behavior is highly predictable. AIs’ reasoning method may become more compositional and nonlinear, possibly reducing the power of current interpretability methods.

