Language Models are not Intelligence Models
GenAI follows the same patterns as psychological behaviorism
Intelligence is a property of mind as opposed to a property of behavior. I want to postpone for now, just what it might mean to have a mind. For now, the distinction I want to make is that intelligence is not a property of behavior.
Large language models are token guessers. Given a context of other tokens, the model predicts the likely next token. The so-called attention layer of the transformer architecture allows for remote patterns of association between the tokens constituting the context and the predicted next token. Each successive layer (simplifying just a bit) provides a statistical summary of the context, with deeper layers being summaries of preceding layers, which are summaries of preceding layers. Viewed another way, LLMs provide lossy compression of their input contexts.
Training a large language model consists of applying one or more algorithms to adjusts its billions of parameters to better predict each token, given the context. It’s tokens and combinations of tokens throughout. Fundamentally, then, large language models are stimulus-response machines, learning a mapping between the stimulus (context) and the response (the token selected as the output). There is nothing in the model except for statistical summarizations of the complex associations between the context and predicted token.
This direct mapping from contexts to outputs is a precise recapitulation of the core assumptions of radical behaviorist psychology.
The central proposition of behaviorism—the idea that all behaviorists agree about and that defines behaviorism—is the idea that a science of behavior is possible (Baum, 2005). … [N]o internal states, intervening variables, or hypothetical constructs are required.
Some behaviorists included intermediate, unobserved, stimuli and responses from within the organism, but the claim was that a complete science of behavior could be constructed from just these relations. Proponents of language models as models of intelligence make the parallel assumption that connections among words is sufficient, no other cognitive processes are required to produce intelligence. The manipulation of symbols is both necessary and sufficient to produce intelligence.
In language models, think of these internal stimulus-response connections as the layers of the network.
A sophisticated S-R theory contends that behaviour and experience in any given situation, are simply a determined response set, that results from the ongoing interaction of the elements in the set of stimuli, internal or external, that exist in that situation.
In the context of radical behaviorism, verbal behavior is seen as operant behavior, a kind of stimulus-response connection where each emitted unit is said to be occasioned (not caused) by contextual antecedent conditions and maintained by reinforcement. Meaning of the emitted terms is not considered to concern symbolic reference, it is claimed to be a property of the conditions of use. On this view, words have meaning only relative to the conditions under which they are produced, a position called “distributional semantics” (e.g., Harris, 1954). Words in similar contexts have similar meanings and the way the model “knows” that they have similar meanings is because they are similarly distributed.
Language models have no other source of meaning. Their inputs are words (tokens), and their outputs are words (tokens). Words that do not have an associated distribution of other words are meaningless. Words for which an individual does not have a distribution of other words are meaningless.
In a common behaviorist example, as mentioned by Chomsky in his review of B.F. Skinner’s Verbal Behavior, we might observe a person saying “Dutch” in the presence of a certain painting. In Skinner’s analysis of language behavior, the person produces that response because he or she has been reinforced (rewarded) for saying that word in that context. Presumably, the person might also have been rewarded for saying “Rembrandt” in the presence of that painting. Reinforcement presumably increases the probability of some responses (e.g., “Dutch,” “Rembrandt”) over others and that is what we observe. But that observation cannot be an explanation because the only way that we know that those words, and not others, must have been reinforced is because they were produced. Similarly, we cannot know what features of the painting were determinative in producing that response.
A basic problem of behaviorism is that all phenomena are characterized as stimulus, response, or both (a response that serves as a stimulus for the next response). There is nothing else, according to the theory. Even if we relax the requirement that these stimuli and responses must be observable, any potential internal psychological events are still nothing more than stimuli and responses.
Behaviorists cannot even consider the notion that some stimuli cause the response. Causes cannot be observed. They imply a relation that is neither a stimulus, nor a response. Rather than speak about causes, behaviorists say that some features of the stimulus context controls the probability of emitting some response. They recognize correlation (or some presumed correlation), but causation implies a theoretical construct that is not admitted within that framework.
The situation is identical in large language models. The tokens that they produce are the result of correlations with preceding context and strengthened by reinforcement. The attention layer, and all layers, represent correlative relations between context elements as stimulus and output tokens as responses. As in the behaviorist framework, it is not possible to talk about causal relations between stimuli and responses. There is no way in these models to represent causation, truth, or any cognitive or theoretical characteristics. Word, more specifically token, co-occurrence is the only relation that is represented. Some people argue that cognitive processes might emerge somehow from these co-occurrence statistics but there is no mechanism or evidence to support that claim, and it would require extraordinary evidence to support it while demonstrating that plain correlation was inadequate to explain the observation.
By theory in Behaviorism, the only entities that exist within the model are stimuli and responses and all supposedly cognitive activity must be expressed as functions of stimuli. In GenAI, the only entities that exist within the model are tokens and mathematical functions of tokens including functions of tokens groups. Any potential cognitive activity is limited, therefore, to be tokens or tokens groups.
Some stimuli occasion the occurrence of some responses. Some tokens (in the context) occasion the occurrence of other tokens as responses. Both fail to express anything that cannot be accounted for by this stimulus-response framework. Behaviorists either denied the existence of cognitive processes or claimed without evidence that cognitive processes could be explained as chains of stimuli and responses. GenAI advocates took the opposite position and claimed that their models, which by design are chains of stimuli and responses, did demonstrate cognitive processes. Neither behaviorist nor GenAI promoter is able to explain just how these stimulus-response chains could produce cognition. GenAI promoters attributed the putative cognition to so-called emergent properties, but could not explain how those properties emerged. The best that they can do is to point to examples of behavior and incorrectly conclude from those examples, that their emergent properties must have produced that behavior.
The philosophical label for such a non-explanation is “operationalism” or “functionalism.” The ends justify the means in the sense that if it produces the right outcome, then it is the equivalent of cognition (e.g., the Turing Test). If it acts like it is cognitive, then it is cognitive. If it passes a benchmark, then it does not matter how it passed that benchmark. Except that this approach is based on a logical fallacy of affirming the consequent.
The way a benchmark is usually construed, passing the test is intended to imply the presence of intelligence (or whatever the benchmark is intended to measure). X is intelligent if and only if it passes the test. But there are potentially many reasons for a system to pass a benchmark test that do not include intelligence. One of the prime alternatives is that the system memorized the answers to the question(s). Assuming that a system is intelligent following a benchmark success is a logical fallacy of affirming the consequent. It is also possible that the test itself is not a valid measure. Conversely, failing a benchmark test does not imply that the system is not intelligent. There may be other factors interfering the success of the system (for example, an electrical power failure) or, again, the test may not be valid (Block, 1981).
Charlatans rely on this faulty logic to “prove” that they have supernatural powers. For example, a mentalist may appear to be psychic but instead exploit a technique called cold-reading. The mentalist offers up some vague high probability guesses and then by analyzing a person’s subsequent body language, appearance, and other clues, start to narrow their comments down to those that sound convincing to the mark (the person whose mind is being “read”). Cold reading can be sufficiently convincing to extract a great deal of money from the mark, but none of it is actually supernatural in any way.
Miraculous healing (as in old-time revival shows) is only miraculous if it is not fake. I love watching magic shows, but I know that rabbits are not pulled out of empty hats. The woman is not really sawed in half. Yet, the fluency is easy to believe.
Behaving as if intelligent is easier than being intelligent. For example, a recent report found that the Anthropic model Claude has done well on certain benchmark tests because it was reading the target answer from the repository, not because it was good at solving those problems. “a meaningful fraction of Claude’s SWE-Bench Pro scores may reflect environmental exploitation rather than genuine engineering capability.” Other studies have also found that LLMs do well on problems where they have ingested similar ones in their training set but fail if they do not have such examples to draw on.
In the years since behaviorism dominated psychology, it has turned out to be a better behavioral technology than a theory of behavior. GenAI, particularly the transformer model, currently dominates artificial intelligence and it is a remarkably successful technology. Both have their limitations. In psychotherapy, cognitive behavioral therapy has been demonstrated to be very effective. It combines behaviorist technology plus cognitive theory. It uses reinforcement learning, but also recognizes the role that beliefs and other cognitive processes play. I expect that it will take a similar expansion of elementary entities to transform LLMs from models of language that are technologically powerful into models of intelligence that are both technologically and theoretically powerful.
Fundamentally, it is impossible to prove that a system is or is not intelligent, just as it is impossible to prove that any theoretical statement is true or not. Intelligence is certainly a theoretical construct, which should be understood in part as a propensity to act intelligently (Block, 1981). If given a test or a series of tests that validly assess the presence of intelligence, an intelligent system will have a propensity to pass those tests. A propensity to some kind of action cannot be represented within a stimulus-response system or in a language model because each of them represents one thing about each particular response its probability of occurrence. Each token has a certain probability of occurring in each context, there is nothing in there that can represent a propensity to produce the right token over multiple contexts. Each context-token association has to be learned independently.
Behaviorism recognized that stimuli may not be represented exactly. A pigeon reinforced to peck at an orange disc of a particular hue would also peck at discs of similar hues. Skinner’s painting viewer who says “Dutch” must have been reinforced for saying that word in that context, according to behaviorism. An LLM must have been reinforced for emitting that token in that context. Lossy compression allows some generalization to similar contexts and similar output tokens, and stochastic reproduction adds some variability, but the only information stored about each token is a single degree of freedom, its probability of occurrence.
The ability to represent truth is one of the requirements for intelligence. Words or tokens do not have enough structure to carry such features as truth, degree of belief, obligations, or other modal properties. For example, they cannot distinguish between situations where a statement in which they occur is true vs one that is hypothetical vs one that is counter-factual vs one that is true but prohibited, and so on. Thoughts like these are essential to intelligence. Representations that cannot express these thoughts cannot have these thoughts. To support intelligence, words or other units of representation need to carry more properties than just their probability of occurrence.
An individual token cannot even represent how likely it is to be true, for example, because individual tokens cannot be true or false. The token “purple” cannot be true or false, it just is. A proposition is the smallest unit of analysis that can have a truth value. A proposition expressing that “The sky is purple,” could be true or it could be false. Language models represent tokens and contexts, not propositions. So, they cannot even determine whether something is true or a so-called hallucination. If the context is similar enough, they produce “Dutch” whether the painting is or is not Dutch. It might be by Caravaggio (similar, but Italian).
A single-degree-of-freedom model may be sufficient for many applied technology applications, but it is does not have sufficient expressivity for intelligence.
In 1981, Block anticipated the current situation with large language models:
A person could have a capacity to respond intelligently, even though the intelligence he exhibits is not his-for example, if he memorizes responses in Chinese though he understands only English. An idiot with a photographic memory, such as Luria’s famous mnemonist, could carry on a brilliant philosophical conversation if provided with strings by a team of brilliant philosophers.
Later he says:
Are we to take seriously the idea that someone long ago recorded both what I said and a response to it and inserted both in your brain? Common sense recoils from such patent nonsense.
But that is exactly what LLMs have been trained on.
Given enough training text and a credulous audience, even a dumb, but fluent, model can appear intelligent. Such a dumb model can still be a useful technology for a number of tasks, but it is still a model of language and not of intelligence.
What I have described here is not a theory of mind, but some thoughts about a few features such a theory must have. It has been several decades since psychologists concluded that a theory based solely on stimuli and responses is not adequate to explain human cognition or behavior. The same criticisms apply to using language models as a theory of intelligence.
The evidence in favor of mental phenomena beyond stimuli and responses is too widespread to describe it here. These observations include
Evidence that mental representations have structure as revealed in studies of mental rotation. The time it takes to decide that two objects are the same depends on the angular difference between the objects.
Analogical reasoning in which knowledge from a familiar domain is used to generate knowledge about a novel domain by finding certain comparisons.
Studies of how problem representations change with increasing levels of expertise as shown by how the problems are sorted and solved.
Counterfactual reasoning in which people can simultaneously entertain situations that are known not to be true, but can still draw valid predictions from them.
The ability to identify new theories of physical phenomena, such as Einstein’s theory of light as both particles and waves.
These are not criteria for the presence of mind, but they are features required for its presence. They could be emulated from memorizing language patterns, and it would take careful experimentation to distinguish causal mental properties from memorized language patterns. They are necessary for intelligence, but not sufficient.
The focus of this essay has been on one property that appears to be necessary for intelligence—that is the ability to represent more features of a situation than just the pattern of words that presumably describe it. The claim is that word patterns are not all you need for intelligence. You need to be able to represent more abstract features and more abstract structures, not just statistical patterns.
The necessary information, for example, the truth of a proposition cannot be computed from the probabilities of the tokens that make up that proposition. Latent variables, such as singular value decomposition, are linear transforms of the co-occurrence matrix. They are a summarization of the relationships in the matrix, but the truth of a proposition cannot be derived from the co-occurrence matrix of its constituents. It is an independent source of information. A proposition could be true or false, no matter what the probability of co-occurrence and knowing the co-occurrence does not make a proposition true.
Language models remain language models. Despite their utility for some functions, and despite their language fluency, they remain far from intelligence models. Intelligence is not a property of behavior. An individual does not become intelligent, merely by acting intelligent because there are multiple ways to achieve the intelligent-seeming behavior. Rather, intelligence requires, among other things, representations that cannot consist of language patterns alone or their mathematical transforms. Intelligence requires more, in other words, than stimulus-response patterns. It requires a mind, even if a full theory of what it means to have a mind has not yet been articulated.



Very interesting and carefully welll constructed! Logically exposed why the language models are not intelligent and why passing a benchmark that the models memorizes the answer for is not enough. I would suggest researching about neural networks, that some specialists like Gary Marcus (also here on substack) propose as a path for the large language models.
But congratulations on your essay! Keep writing!
Well argued. I was curious about the ability of current frontier models to handle Counterfactuals. I was quite surprised when Gemini Flash easily handled my first test question. Perhaps someone had written down the question and answer for its training but I can't shake the impression that the parrot has some kind of comprehension. Here it is:
Prompt: Assume that human blood is blue not red. Imagine we could change physical laws. If Rayleigh scattering was proportional to the wavelength to the 4th power, would it still make sense to call the moon during an eclipse a Blood Moon?
Gemini gave a long, detailed correct answer with this summary: "Yes, in this beautifully inverted hypothetical universe, it would still make perfect sense to call it a Blood Moon! To understand why, we have to look at how your proposed change to physical laws alters the color of the eclipse, and then match it to your alternate biology."
Possibly there is some higher structure being added to this model - time will tell