The fog of illusion
The challenge of understanding the strengths and limitations of reasoning models
Although I agree with the title of the paper by Shojaee, et al., I am not convinced that their methods are appropriate to support the conclusion that reasoning models present an illusion of thinking. This paper is, at best, a muddle.
The headline claim of this paper is that what we call “thinking” in large reasoning models is illusory. But, so far as I can tell, they present no strong evidence to support this claim. Instead, they report that the problem-solving steps produced by the thinking models do not proceed directly toward the goal. The statements that are reported by the model to indicate its thinking process do not, they say, lead directly to solving the problem.
In simpler problems, reasoning models often identify correct solutions early but inefficiently continue exploring incorrect alternatives …. At moderate complexity, correct solutions emerge only after extensive exploration of incorrect paths.
The pattern of indirect and inefficient statements is consistent with the absence of model reasoning, but it is also consistent with real, but ineffective, reasoning. It does not test the hypothesis that the so-called reasoning tokens produced by the model play no causal role in puzzle solving. In any case, Shojaee and colleagues do not present enough details about how they analyzed the traces or what they observed. They present their conclusion without any supporting evidence that would allow the reader to evaluate the validity of their claims.
The second observation that they claim supports that model thinking is illusory is that the reasoning models do worse than nonreasoning models at what they call low complexity, do better at moderate complexity, but completely fail at high complexity. I do not believe that they ever make the connection clear between this observation and the illusion of model thinking, but it is clear that this observation is also nondiagnostic. First, their assessment of complexity is faulty, and second, the same results would be expected if the model were reasoning, but had limits in its reasoning capacity. Third, although the title of the article says that reasoning is an illusion, the authors continue to discuss the process as if it were authentic reasoning (“Both cases reveal inefficiencies in the reasoning process”). Inefficiency is not illusion.
For their experiments, they chose three well-known transport puzzles. There are many sources that discuss these problems and present algorithms for solving them. There is ample opportunity for data leakage to provide the illusion of thinking, but Shojaee et al. do not address this opportunity. The availability of training text on these puzzles means that the same results could be observed whether the models were reasoning or merely echoing training data. This experiment cannot distinguish between the two.
Instead, they rely on what they call three problem solving regimes. When the puzzle requires few moves, then the putatively non-thinking models do better than the thinking ones. At intermediate numbers of required moves, the so-called thinking models do better, and everything collapses when still more moves are required, neither kind of model does well. The whole idea of regimes is questionable as any kind of explanation because it is defined by changes in the outcome which the regime is putatively intended to explain. The definition is circular.
It is not clear why one kind of model should perform better than another at low or intermediate move counts, but the collapse at high numbers of moves is consistent with either hypothetical process (stochastic parroting or reasoning). This paper does not provide enough information to assess the reliability of one kind of model being better than another. The differences could be chance or some other aspect of how the models were trained.
Shojaee et al. label the three regimes relative to complexity, but it is not at all clear that complexity is the right idea. For example, the Tower of Hanoi problem was first published (that I know of) by Eduardo Lucas in 1883. The common version of the puzzle was published by Sam Loyd in 1914, along with the number of moves it would take to solve a 64-disc puzzle. But in reality, no matter how many discs are involved, the solution to the problem is described by a simple recursive algorithm (source):
1. Shift 'N-1' discs from 'A' to 'B', using C.
2. Shift last disc from 'A' to 'C'.
3. Shift 'N-1' discs from 'B' to 'C', using A.
Repeating the same process multiple times may take longer, but this repetition does not add complexity.
In short, the goal of this experiment, at least judging by the headline, is to distinguish between reasoning and non-reasoning language models, but its design is not adequate to do so. A well-designed experiment is one where hypothetical processes are compared and where they make clear and distinct predictions. If the models reason, then we should expect one outcome and if they merely parrot what they have learned, then we should expect a different outcome. Reasoning and stochastic parroting must make different predictions. That experiment will require difficult to achieve conditions. For example, this design might be helpful.
1. Collect one training set.
2. Train both models on this training set.
3. Train one model to reason and do an equivalent training on the other model. They both must receive equivalent training differing only in whether the models reason.
4. Devise tasks where parroting and reasoning are predicted to give clearly different results.
5. Discuss how the difference between training methods causes the difference in results.
If the models are trained on different training sets, then the difference in training may be sufficient to explain the difference in results. If the models differ in their training, then that difference could explain the results. If one model is trained more than the other, that could be the explanation. In any case, it is necessary to clearly articulate how steps 4 and 5 explain the observed results.
Shojaee et al.’s paper does not describe an actual experiment. Too many factors are left uncontrolled. Consequently, the results are ambiguous. They do not describe before doing their observations whether and how the hypotheses predict different results. Rather, they reason backward from the results to what they think must have been their cause.
Shojaee and colleagues are not alone in making this kind of methodological error. Most of the work claiming that LLMs demonstrate emergent properties also suffer from weak methodology and affirming the consequent. Because we know that large language models are trained to predict the next token, the burden is on anyone who claims that the models can do anything more than that to convincingly demonstrate that they do. Researchers must engage in causal reasoning if they have any hope of detecting causal reasoning in large language models.