The Algebraic Mind
© 1998 byGary F. Marcus. All rights reserved
Draft: Do Not Quote Without Permission
Rules and variables
Note: The Icons 3 and 4 allow you to skip back and forth between sections.
A. An example4
How would you generalize the relation shown in Box 1?
Most people conclude that the output that corresponds to test item  is . If you look closely at the training items, you will see that they systematically differ in one respect from the test item. Whereas all the training items end in a zero, the test item ends in a one. We can think about this geometrically: if we took each digit to indicate the coordinate of an item in a 4-dimensional space, the training items would all lie in the part of the space in which value of the rightmost digit was zero, while the test item would lie in a different part of the space, where the value of the rightmost digit was one.
What is interesting about the way that people tend to generalize examples like the one in Box 6.1 is that they draw the same kind of generalization to a novel item, whether or not it lies inside the space of training items. This is interesting because in the training there is literally no direct evidence for how to treat an input that ends in a one. Nonetheless, most of us feel sure in our intuition that if the input ends in a one, the output should, too. (As we shall see, multilayer perceptrons behave quite differently.)
3 B. Algebraic rules4
One way to account for our intuition is to suppose that what we learn from the training examples is a rule of "identity" or "sameness" that is equivalent to the algebraic rule
f(x) = x
(This can be read as the function of some value of the variable x is equal to the value of variable x. Variables will be indicated with Small Caps.)
On this "algebraic" view, to calculate the output that corresponds to the test item, f(1111), we simply substitute the variable x on the right-hand side of the equation with the instance 1111. Such a rule is indifferent as to whether or not the instantiation of the variable x is familiar or unfamiliar. Such a rule can be freely generalized to any instantiation of its variable.
This sort of free generalization seems to be a pervasive part of human cognition. One example that my colleagues and I have documented is the English past tense rule, which might be given as the algebraic rule
past = stem + "ed"
We can apply this rule to a variety of novel stems. The past tense of wug is wugged ; the past tense of to out-gorbachev (what Boris might do to Mikhail) is out-gorbacheved .
Likewise, many syntactic rules can be thought of in algebraic terms. For example, a simplified grammar might contain a rule
Sentence = Noun Phrase + Verb Phrase
This would mean that you could form a sentence by combining any instance of the variable Noun Phrase (say, the man who climbed up a mountain) with any instance of the variable Verb Phrase (say, came down the boulevard in chains).
Rules might also be taken to play a role in our "intuitive theories" about the world. For example, part of our knowledge about biology is that (other things being equal), when animals bear offspring, the babies are of the same species as their parents. This bit of knowledge is not tautological, because there is a possible world in which cats give birth to dogs, dogs gave birth to lizards, and so forth (as well as stored exceptions such as the fact that horses and donkeys can give birth to mules). This knowledge might be expressed in algebraic rule such as
SpeciesOfBabies = SpeciesOfParents
We can readily generalize this abstract rule to a new species that we have no prior experience with. For example, we can infer that the babies of a gerenuk would be a gerenuk.
The free generalization of algebra-like rules does not seem to be restricted to conscious, deliberate reasoning. Many of the rules that we generalize are linguistic rules that ordinary people (and often even trained linguists) cannot articulate. Moreover, even young children and infants appear to be able to generalize some patterns freely. For instance, in carefully controlled experiments Tomasello and Olguin found , that twenty-three-month-olds can take a nonsense noun (say, wug) that they heard used only as a subject and use it as an object; in this instance a child is generalizing a grammatical rule to an arbitrary item, yet the child presumably cannot articulate the relevant rule explicitly.
Experiments in my lab provide similar results . In our experiments seven-month-old infants heard a 2 minute habituation to "sentences" from one of two artificial grammars, which we can call the ABA grammar and the ABB grammar. ABA sentences are sentences like ga na ga and li ti li, ABB sentences are sentences like ga na na and li ti ti. After the two-minute habituation, we measure how long infants look at speakers playing sentences made up entirely of novel words that are either consistent with or inconsistent with the familiarization grammar. The prediction is that if infants can distinguish the two grammars and generalize them to new words they should look longer at inconsistent items, e.g. infants that were trained on the ABA grammar should look longer for wo fe fe than wo fe wo. As predicted, infants look longer at the inconsistent items, suggesting that the ability to generalize rules is available well before children are fluent language users.
Humans are not in fact the only animals that can generalize such relations. For example, in so-called match-to-sample taskstasks in which a subject must choose between two or more test items, one of which matches a sample item many or perhaps all primate species can generalize the identity relationship to items that are different from those used in training .
3 C. The eliminative connectionist alternative to variables 4
All of these cases are elegantly described by relationships between variables. But description is not representation; the question is whether the brain, too, represents these cases as relationships between variables. It is obvious that digital computers do; it is part of their design. Even the simplest of modern computers has a set of registers -- places to store the values of variables -- and a basic instruction set that includes operations that can be performed over the contents of those registers. These operations allow the microprocessor to transfer information from place to place, to compare one bit of information with another bit of information, and so forth. These devices allow programmers to write algorithms that are stated in terms of operations that take place over variables such as "take the contents of this register and add it to the contents of that register", "do this operation if register X contains a greater value than register Y", and so forth. Less obvious is whether such mechanismsa means of keeping variables distinct from instances, a means of expressing relationships between variables rather than instances, a means of instantiating those variables with specific instancesare intrinsic to the human mind/brain.
Many researchers do in fact assume (generally without explicit argument) that the mind makes extensive use of short-memory devices like registers and operations that act over the contents of those registers (e.g., a process that copies the contents of a register). More generally, rules that are stated over variables are standard in production systems and generative linguistics .
Eliminative connectionism can, in contrast, be seen as a test of the hypothesis that such variable-manipulating machinery is not part of the mind. Standard eliminative connectionist models lack both registers and the facility to perform operations over registers. (Connectionist models that do explicitly incorporate variables and relationships are discussed in the next chapter.)
Although there has not to my knowledge been a principled argument against the possibility that the mind manipulates variables, there are reasons to consider alternatives. For instance, there is a sense in which eliminative connectionist models are simpler than models that incorporate variables and processes that work over variables: symbol-manipulation advocates must assume many of the same mechanisms as eliminative connectionists, but with the added burden of explaining how variables could be implemented in the brain, how evolution could have selected organisms with the appropriate mechanisms, and so forth. Moreover, nobody has yet fully identified a neurophysiological mechanism that serves as a variable, let alone mechanisms for performing operations over variables such as copying, comparing, and the like.
In support of a view of cognition in which variable-manipulating machinery would play no role outside of deliberate, conscious serial reasoning, eliminative connectionists have presented a variety of non-symbol-manipulating models of domains that one might have thought depended on operations defined over variables, such as the knowledge of kinship relationships and various aspects of grammar.
The hypothesis that drives this research is that eliminative connectionist models might obviate the need for explicit mechanisms for representing and manipulating variables. That this hypothesis is the impetus behind eliminative connectionism is occasionally, though not always, made explicit. (One place where it is made particularly clear is Garsons claim that Elmans Simple Recurrent Network might obviate the need for some variable binding.)
Although these models differ somewhat in their details, what they have in common is that they are basic multilayer perceptrons, each trained through back-propagation. To review, multilayer perceptrons are models in which the input is encoded by the activity values of a set of input nodes that serve as feature detectors. These activity values are then passed along via weighted connections to one or more layers of hidden units, which are then ultimately passed along through more weighted connections to set of output nodes. The back-propagation algorithm is an error-correction algorithm that adjusts the weights of the connections on the basis of a comparison between the target output and the actual output of the model.
3 i. Tests of some eliminative connectionist models4
In a recent series of simulation experiments, I discovered thatmultilayer perceptrons such as Hintons family tree model and Elmans sentence prediction application of the simple recurrent network are unable to generalize an important class of relationships to arbitrary items . In particular, they cannot generalize (on a single trial, as a human would), any relationship in which each input has a unique output.
a. The identity function
The identity function illustrated in the beginning of the chapter is one such example. Each input corresponds to a unique output. In such a case, the correct response to a novel input is not a category on which we would have received prior training. Instead, the correct response will itself belong to a novel category.
Consider how we might built a feedforward network model that would learn the relationship of sameness or identity. One way to build such a model is depicted below in Figure 0.1. In this model, each possible input and each possible output is represented locally. When this network is trained only on the inputs , , , and , as in Box 1, the network is unable to generalize as a human would, failing to predict that the output corresponding to the input  is . (As we will see, this finding can be replicated with any number of hidden units, any number of hidden layers, and any learning rate.)
Figure .1 A multilayer network wt localist input/output representations.
Another way to build a model of identity is depicted in Figure 0.2. Here, the inputs are encoded using a distributed representation, rather than a localist representation. Counting from left to right, let us call the input elements Digit1, Digit2, Digit3, and Digit4. The input  would be encoded in this network by activating the input nodes corresponding to Digit1 and Digit3; the input  by activating the Digit2 input node. (This representation is a distributed representation, because some inputs are encoded by the simultaneous activation of multiple input nodes, and because each input node participates in the encoding of more than one input.)
Figure .2 A multilayer network with distributed input/output representations.
As with the localist model, this distributed model is unable to generalize from the training patterns in Box 6.1 to the test pattern, a fact that I have confirmed in numerous simulation experiments. Whereas humans respond to  with , the network generally responds with the output . This discrepancy between the network response and the human response holds regardless of the number of hidden units, regardless of the number of hidden layers, and regardless of the value of the learning rate.
It is important to be clear that what the model does here is not a mathematical aberration, but rather an induction that is perfectly well licensed by the training data. For example, given the training data, the conditional probability that the rightmost digit would be a "1" is exactly zero. The model extends a conditional probability such as this in a mathematically sound manner. What is crucial, however, for our purposes, is that the way the model extends the training data is very different from the way that a person would.
b. The sentence-prediction model
We can see the same sort of issue arise in thesimple recurrent network. For example, taking a cue from Gertrude Stein, I trained the sentence prediction version of the simple recurrent network on a series of sentences that constitute a problem of identity over time, such as a rose is a rose, a tulip is a tulip, and a lily is lily. Given exposure to such sentences, humans complete the sequence a blicket is a ____ with the word blicket. In contrast, I found that a network trained on sentences like a rose is a rose was unable to predict that blicket will be the next word. (Again, this result holds regardless of the number of hidden units and hidden layers, and regardless of the learning rate.)
It is important to be clear that I am not claiming that the simple recurrent network is altogether unable to generalize in the ways that humans do; some examples of how the simple recurrent network can make the same generalizations as humans can will be discussed in Section ~. But there is an important class of generalizations that the simple recurrent network cannot draw, so long as each word is represented by a different node. (An alternative, in which words are represented by sets of nodes, is discussed below.)
The class of generalizations that lies outside the abilities of the simple recurrent networks withlocalist output representations is the set of linguistic structures in which each input fragment corresponds to a unique continuation. For example, a rose is a ___ continues with a rose, a tulip is a ___ continues with tulip, and so forth.
The a rose is a rose example may seem trumped up, or at the very least obscure, but the same sort of relationship is implicit every time we determine the antecedent of a pronoun or answer a question. For example, we automatically compute that the antecedent to himself in the sentence John loves himself refers to John, that the antecedent to himself in the sentence Bfltspk loves himself refers to Bfltspk. Likewise, if we are told that John loves Mary, we answer the question Who loves Mary? with John, whereas if we are told that Dweezil loves Mary, we answer the Who loves Mary question with Dweezil.
Such phenomena cannot be directly tested within the framework of the simple recurrent network, because the simple recurrent network does not provide as output any representation of the meaning of the sentences it encounters. Still, as an indirect measure of the ability of the model to discover the link between pronouns and antecedents one can train the model on sets of sentences such as John loves himself. The word "himself "refers to John and Peter loves himself. The word "himself "refers to Peter. As with the a rose is a rose case, I found in simulation experiments that models trained in this way are unable to generalize as a human would to sentences containing novel words. Whereas the human would continue the sequence Jemad loves himself. The word himself refers to ____ with the word Jemad, the model was unable to do so. (For a discussion of distributed representations, seeAppendix 1.)
c. The family-trees model
The situation with Hintonsfamily tree model is quite similar. Whereas people can generalize kinship terms (sister, uncle, etc.) to new families or new family members, the model which must use localist output representations -- cannot. Suppose for instance that we trained Hintons family-tree model on symmetric relationships like "Jack is the husband of Jackie" and "Jackie is the wife of Jack". For reasons that are explained in the next section, if such a model were trained on two new family members, say "Adam is the husband of Eve", it would be unable to predict that Eve is the wife of Adam.
3 ii. Training independence4
There is a sense in which these illustrations should be unsurprising. Asking a model to generalize to untrained values of features is a bit like asking it what to with a newly-added node. Just as a network would not "know what to do" with a newly-added node, it does not "know what to do" with a node that is activated for the first time.
What these demonstrations also show, though, is that the network does not in fact possess or learn any abstract representation of relationships such as "sameness" or "mother." Instead, what the model learns is always piecemeal, a kind of learning that is node-by-node instead of across nodes. If the network does not have direct experience telling it what to do with some node when that node is activated, the network will not "know" what to do when that node is activated. Whereas people notice a generalization that holds across all features, back-propagating multilayer perceptrons learn their generalizations feature-by-feature.
That back-propagating multilayer perceptrons learn their generalizations feature-by-feature is a consequence of what we can call training independence. Training independence itself is a property that was once thought to be desirable; the way McClelland and Rumelhart put it was that
The delta rule -- and now the generalized delta rule -- provide very simple mechanisms for extracting regularities from an ensemble of inputs without the aid of sophisticated generalization or rule-formulating mechanisms that oversee the performance of the processing system. These learning rules are completely local, in the sense that they change the connection between one unit and another on the basis of information that is locally available to the connection rather than on the basis of global information about overall performance.
Similarly, in their chapter "The appeal of PDP", McClelland et al. wrote that
It is very important to note that the information needed to use the Hebb rule to determine the value each connection should have is locally available to that connection. All a given connection needs to consider is the activation of the units on both sides of it. Thus, it would be possible to actually implement such a connection modulation scheme locally, in each connection, without requiring any programmer to reach into each connection and set it to just the right value.
more sophisticated connection modulation schemes have been proposed by other works [such as the] delta rule . All of these learning rules have the property that they adjust the strengths of connections between units of the basis of information that can be assumed to be locally available to the unit. Learning, then, in all of these cases, amounts to a very simple process that can be implemented locally at each connection .
In other words, the intention of eliminative connectionism was to replace operations that work over variables with local learning, changing connections between individual nodes without using "global information". What one input node learns is independent of what another input node learns; what one output node learns is independent of what another output node learns.
This independence follows from the equation that defines back-propagation. First, consider the input nodes. If a given input node is not activated, the network cannot learn anything about that node; this is so because the equation that adjusts the weights leading from a given input node multiplies the activation of that input node by the learning rate and an error signal. Since the activation of an untrained input node is zero, and we are including that zero in our multiplication it follows that regardless of what the learning rate is and regardless of how the error signal is calculated, we wind up multiplying everything by zero, hence the weight change is zero. A similar kind of independence arises with respect to the connection weights that feed the output nodes. The equations that adjust the weights feeding an output unit j depend on the difference between the observed output for unit j and the target output for unit j, but they do not depend on the observed values or target values of any other unit. Thus the way the network adjusts the weights feeding output node j must be independent of the way the network adjusts the weights feeding output node k (assuming that nodes j and k are distinct).
Other standard connectionist learning algorithms behave in similar ways. For example, the Hebbian rule ensures that any weight that comes from an input unit that is set to zero will not change, since the weight change is calculated by multiplying the input units activation by the output units activations times some constant; again, multiplying by zero guarantees that no learning will take place. Likewise, in adjusting the weights feeding output node j, the activations for all nodes k are irrelevant (again, assuming that nodes j and k are distinct). So training independence is an inevitable consequence of Hebbian learning as well.
We can put this a bit more precisely, if we restrict ourselves to models in which the inputs are binary values of features (e.g., +animate would be one value of the feature± animate; -animate would be the other value of the feature ± animate). With respect to some function that we train the model on, any item that consists entirely of values of features that have appeared in the training set lies within the training space. Any item that contains one or more values of features that have not been exemplified in training lies outside the training space. Training independence guarantees that none of what the model learns about items inside the training space extends to items outside the training space. (For a discussion of what happens to items that are made up of known features that are new to some function, see Appendix 4.)
3 iii. What needs to be changed
We cannot remedy these limitations in any of the ways in which eliminative connectionists typically vary their networks. For example, adding hidden units does not make any difference, because the input and output nodes are stilled trained independently. Adding hidden layers likewise leaves training independence unaffected. Similarly, even if we change the learning rate, or the sequence in which we supply the models with training examples, training independence still applies.
What has to be changed is more fundamental. Any system that retains McClelland and Rumelharts idea of localism necessarily retains the piecemeal node-by-node way of learning. (For a discussion of "weight-sharing" models, seeAppendix 3.) As we will see in the next section, there are ways of building models that freely generalize, but each of them abandons eliminative connectionisms commitment to local learning in favor of systems can represent rules.
Front Page | Table of Contents | Excerpts | About the author | Discussion | MIT Press
Appendices| Glossary | References