Response to Jeff Elman's Commentary
on The Algebraic Mind (
Gary F. Marcus
January 2, 1999
Department of Psychology
New York University
Inexcerpts from my forthcoming book The Algebraic Mind, I suggested that rules play a necessary (but not sufficient) role in human cognition. In the course of making this suggestion, I argued that Elman's (1990) Simple Recurrent Network could not generalize to novel words, so long as it used localist output representation. I further argued that alternative, distributed representations, led to a different kind of problem, "the superposition catastrophe", a well-known difficulty that arises when multiple entities that are represented with overlapping features must be simultaneously activated.
In a commentary on those excerpts, Jeff Elman has argued that "localist representations are useful but not necessary to the connectionist models Marcus is concerned about", that simple recurrent networks can generalize to "gappy input", that "generalizations depend on the relative amount of data and experience" and that a "more reasonable interpretation of what the network is doing is exactly what humans and other animals do". Each of these arguments is worth considering; each turns out to rest on confusing different types of generalization.
Localist representations/Distributed Representations
Is it true that "localist representations are useful but not necessary to the connectionist models Marcus is concerned about"? This rather technical issue is central because, as shown in Marcus, 1998, models such as the simple recurrent network cannot generalize to novel words if those novel words are represented locally (i.e., by means of a single unit).
But Elman suggests that localist representations are used merely "for convenience"; I argued that they are more than mere convenience. The weight of the connectionist literature is on my side, not Elman's. For example, the very people that Elman cites as having shown that the superposition catastrophe "does not arise in practice with representations which are realistically rich" have written that
distributed representations can only be activated simultaneously at a cost, and for some critical number of items multiple activation would become noisy or even break down altogether.
Elman never says exactly what he takes Gaskell and Marslen-Wilson's solution to be, and Gaskell and Marslen-Wilson themselves did not show how the SRN could be made compatible with both the sentence-prediction task and the a-rose-a-rose task that I described. In fact, there is good reason to expect that no single representation could be made compatible with both tasks.
To succeed in the sentence-prediction task, you must activate all and only the correct continuations to a given sentence fragment; the correct target for a model is easily represented in a system of localist representations but hard to represent in most distributed systems. For example, suppose that the sentence of words that can appear next is the set of singular nouns. Provided that each word is represented by a distinct node, a system can activate the nouns that represent the singular nouns without activating any other nodes.
But if the features that encoded nouns overlapped with the features that encoded verbs, there would be no way for the model to activate all and only the features that encode the nouns; inevitably, activating the features that encode the nouns would entail activating some or all of the features that encode the verbs, hence the system could no longer activate all and only the nouns.
In fact, no bona-fide distributed representation version of the simple recurrent network has ever been shown to solve the sentence prediction task. This has been true not only my own tests of such models, but also those of Elman himself. In an unpublished (1988) technical report, Elman described a version of the simple recurrent network in which the representations of nouns and verbs did overlap (all other reports of SRNs applied to sentence prediction have used localist output representations). The result? The ""networks performance [on the sentence prediction task] at the end of training ... was not very good (p. 17)".
Contra Elman, localist representations aren't just a convenience. As Gaskell put it , "a distributed system cannot implement localist activation models literally" (1996, p. 286). If Elman doubts this, I challenge him to present a distributed-representation version of his simple recurrent network that can solve both my a- rose-is-a-rose task and his own sentence prediction task.
The centerpiece of my argument was a demonstration that simple recurrent networks cannot generalize outside their training space
(on a single trial, as a human would), any relationship in which each input has a unique output.
Elman's response was to show that "networks can generalize in the face of gappy input". The demonstration is legitimate, but the key move in his argument is when he tries to equate generalizing outside the training space with the problem of gappy input, that is, he argues that the question -out-of-training-space-question
. can be posed, in specific form, thus: Can a network, trained only on sentences in which a given noun (e.g., "boy") appears only in subject position, deal appropriately with novel sentences in which that word appears in another syntactic context, e.g., object position
But generalizing to gappy input is not the same thing as generalizing a one-to-one function outside the training space. As I pointed out inAppendix 2 of my excerpts ("Generalization within the simple recurrent network"), SRNs can generalize in the face of some types of gappy input, such as ones in which it suffices to confuse one word with another (e.g., boy and girl). But, crucially, they cannot capture generalizations of one-to-one universals to a novel word. If every one of set of sentences requires a different, unique continuation, the SRN is in trouble.
For example, consider a learner exposed to sentences like a rose is a rose and asked to generalize to sentence fragments like a fendle is a _____. The response you would give here is unique (i.e., fendle), yet you need not have heard the word fendle before in order to predict to what comes next. Elman does not give a single example of a localist output representation SRN learning this kind of relationship -- because the simple recurrent network simply cannot account for it.
"Generalizations depend on the relative amount of data and experience"
Elman and I agree that fitting real empirical data is important. Here are some data that my colleagues Sujith Vijayan, Shoba Bandi Rao, and Peter Vishton and I recently published in Science (1999). Seven-month-olds infants -- too young to have learned anything through explicit verbal instruction -- were trained for two minutes on "sentences" following an ABA grammar (la ta la), an ABB grammar (la ta ta) or an AAB grammar (la la ta). Infants were then tested on sentences that were either consistent with or inconsistent with the grammar that they were trained on. In a familiarization-preference task, they attended reliably longer during the inconsistent sentences than during the consistent sentences.(Click here for further details.)
A localist version of an SRN could not account for these results -- a fact that I have confirmed in simulations. Because all the test sentences were made of novel words, each word would lie outside the training space of the one-to-one function implicit in our grammar. Furthermore, because we made sure that the phonetic features that varied in the test items did not vary in the habituation items, even an SRN that used a distributed representation made up of phonetic features would fail in this task. Thus an SRN simply could not explain how the children distinguished the consistent and inconsistent items.
(Elman told an ABC news reporter that I was "demonstrably wrong" about this, but has not thus far answered my query asking him to supply the demonstration.)
So far as we know, only a system that derived a genuine abstract representation of the underlying structure could account for the infant's ability to generalize across the board -- as readily to words that overlap in phonetic features as to those that do not.
"A more reasonable interpretation of
what the network is doing
is exactly what humans and other animals do"
Is it "good" that the model predicts certain failures to generalize? As Elman points outs, cats raised in the absence of horizontal stripes have profound perceptual problems. But here Elman seems to want it both ways, wanting us to believe that simple recurrent network generalizes outside the training space when people do, but not when cats don't.
Contradictions aside, what really matters with respect to the cat stripe example is Elman's implication that novel words aren't represented as novel perceptual dimensions. Elman is probably right about this -- but rather than arguing in favor of the SRN, it argues against it. The problem (owing to the superposition catastrophe) is that if the SRN is to solve the sentence prediction task, it is committed to representing novel words as novel perceptual dimensions, the very thing to which it cannot generalize.
What we instead need is a model that can represent novel words using existing perceptual (or representational) dimensions -- without falling prey to the superposition catastrophe. This requires more than a tiny change to the SRN; it requires an altogether different way of representing the problem.
One possibility is "symbolic connectionist" models like Hummel and Holyoak's (1997, Holyoak and Hummel, 1999) LISA. Models like these escape the superposition catastrophe yet can freely generalize abstract relationships to novel items. How do they do it? By including machinery that permits the representation of abstract relationships between variables, and machinery for binding those variables to specific instances. The bad news, if you're inclined to take it that way, is that these models look a whole lot like symbol-manipulators. The good news, though, is that these more complex models can actually account for how humans generalize outside the training space.
Although Elman and I agree about much in terms of what counts as an interesting question, in the final analysis, I find his answers to be too glib. For example, Elman asserts that "across-the-board generalizations ... are easy to account for in any system"-- an obvious-seeming point that simply turns out not to be true. The falsification comes from none other than the simple recurrent network, which, as we've seen, cannot account for the kind of across-the-board generalizations derived by our infants.
The right answer, more subtle, is that only some systems can capture these across-the-board generalizations (e.g., those that implement abstract algebraic rules, such as Hummel and Holyoak's LISA), and it is those that we should be studying.
Elman, J. L. (1988). Finding structure in time (CRL Technical Report 8801). La Jolla, CA: UCSD.
Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14, 179-211.
Gaskell, M. G. (1996). Parallel activation of distributed concepts: who put the P in the PDP? In G. W. Cottrell (Ed.), Proceedings of the eighteenth annual conference of the cognitive science society . Hillsdale, NJ: Erlbaum.
Gaskell, M. G., & Marslen-Wilson, W. D. (1996). Discriminating local and distributed models of competition in spoken word recognition. In M. G. Shafto & P. Langley (Eds.), Proceedings of the nineteenth annual conference of the cognitive science society . Hillsdale, NJ: Erlbaum.
Holyoak, K. J., & Hummel, J. E. (1999). The proper treatment of symbols in a connectionist architecture. In E. Deitrich & A. Markman (Eds.), Cognitive dynamics: Conceptual change in humans and machines . Mahwah, NJ: Erlbaum.
Hummel, J. E., & Holyoak, K. J. (1997). Distributed representations of structure: A theory of analogical access and mapping. Psychological Review, 104, 427-466.
Marcus, G. F. (1998). Rethinking eliminative connectionism. Cognitive Psychology, 37(3), in press.
Marcus, G. F. (2000). The algebraic mind: Integrating connectionism and cognitive science. Cambridge, MA: MIT Press.
Marcus, G. F., Vijayan, S., Bandi Rao, S., & Vishton, P. M. (1999). Rule learning in 7-month-old infants. Science, 283, 77-80.