Commentary on The Algebraic Mind, by Gary Marcus,MIT Press
Department of Cognitive Science
Gary Marcus raises a number of important issues in his new book, The Algebraic Mind. More than that, he argues vigorously that symbol systems alone provide the requisite machinery for capturing essential cognitive phenomena. The primary phenomena Marcus focuses on include (1) the need to represent equivalence classes; (2) the need to represent general relationships and patterns which hold over members of such classes; (3) the need for representations which encode structural relationships among entities, some of which may be atomic and others molecular; and which (4) distinguish between the entity itself as a token, and the entity as a member of the class to which it belongs. Marcus's claim is that only symbol systems (or their functional equivalent, which may include "implementational connectionism") satisfy these needs. "Eliminativist connectionism" cannot.
In Chapter 1, Marcus presents several examples which are taken as illustrations of the failure of eliminativist connectionism. The basic failure, Marcus suggests, is that connectionist models of this sort (and from here on, I will use "connectionist" to refer to the eliminativist models Marcus objects to) cannot generalize outside their training space.
What does it mean to "fail to generalize outside the training space?" There are at least two possibilities.
The first is what I suspect is in most people's minds when they worry about generalization: Given a set of elements (e.g., words) which behave in some way, a learner can respond appropriately to those elements in novel contexts. For instance, having heard the word "zilp" for the first time, in direct object position, a listener would presumably be able to use this word as the subject of a sentence, know how to form its plural form, etc. I take this to be an extremely important issue, although also fairly complex (I shall return to this point later).
There is also a second kind of generalization, which is what many of Marcus's examples in Chapter 1 address: Having been exposed to a set of elements which are represented across a fixed set of perceptual dimensions, generalization would consist of being able to process novel elements represented along a new set of perceptual dimensions.
In the remainder of this commentary I would like to do two things. First, I wish to consider examples of this second category of generalization failure. I take such failures to be untroublesome and will suggest they are entirely appropriate in networks, and similar to failures we find in humans and other biological organisms.
Then I will return to the first sort of generalization problem, in which a network must deal with the novel use of inputs, for example, processing words which are encountered in syntactic contexts not seen during training. I believe this is in fact a very important issue about which we know some things, but for which there are many important questions to be studied. I will summarize several recent simulations, which study the conditions under which networks generalize, given scant or gappy data.
Marcus points out something which has long been known to connectionists: If a network is trained on a set of stimuli in which one of the input lines is never 'exercised', that weight on that line will tend toward 0. In effect, that input line atrophies and the network cannot subsequently process new inputs which have information in that dimension.
This is a well-known consequence of back-propagation learning. What is interesting is that although Marcus takes this to be a crucial flaw in network learning, a more reasonable interpretation is that the network is doing exactly what humans and other animals do. Consider, for example, the visual system of cats. Blakemore & Cooper (1970) demonstrated many years ago that systematically depriving a cat of exposure to horizontal stripes during early life resulted in failure of neurons to develop which were sensitive to horizontal stripes (see Figure 1).
Figure 1. Blakemore & Cooper (1970) demonstrated that a cat, reared during its early life in an environment which contained no horizontal stripes, would fail to develop neurons in the visual system which respond to horizontal edges.
Similarly, young human infants appear to be sensitive to the full range of speech sounds found in all human languages, but at older ages discriminate only those phonemic contrasts found in their environment. The difficulty that Japanese speakers have in discriminating the r/l contrast, or the problem that English speakers having in distinguishing prevoiced from voiceless stops is well-known. The common result is essentially that, if perceptual experience is limited-either by evolution or learning-one will not be able to perceive things outside that experience. We do not perceive in the infrared, although pit vipers do. Thus, there is nothing either mysterious or even inappropriate about a network's loss of perceptual distinctions over time.
However, the problem appears to be a nasty one for networks in which the "perceptual dimensions" are equated with things such as words. These "localist representations" are often used in modeling for convenience. For example, if the network's output is a prediction of the next input, localist outputs make it possible to interpret the output vector as a probability distribution (which then allows comparison of the network's performance with other probability-based models). On the other hand, such representations are clearly limiting and unrealistic. We do not, after all, assign different regions of the auditory spectrum to different words (we would soon run out). Rather, words are distributed representations over lower-level inputs (e.g., phonemes) which are themselves distributed representations over even lower-level perceptual dimensions. Such representations are much richer because they possess similarity structure that supports analogy, whereas localist representations are all orthogonal. Distributed representations also make it possible to process novel words, whereas localist representations do not. (Marcus has suggested that networks will fall pray to a "superposition" problem if distributed representations are used. This problem arises when the representational space is both too crowded and lacks meaningful structure. The result is that for tasks such as prediction, it can be difficult to know when two different patterns are being predicted as alternatives, versus a third whose representation happens to be a blend of those two. Gaskell and Marslen-Wilson, 1995, 1997a, 1997b, have shown, however, that this problem does not arise in practice with representations which are realistically rich.)
So what do we make of Marcus's criticisms? The bottom lines are (1) localist representations are useful but not necessary to the connectionist models Marcus is concerned about; and (2) it is obviously true that distributed representations are more realistic; they afford a richer medium for representing the world, in which form-based similarity ("what you look like") can be used along with function-based similarity ("who you hang out with") to motivate generalization.
I would like to turn now to the first sense of "generalizing outside the training space" because I think this raises more interesting issues.
The question can be posed, in specific form, thus: Can a network, trained only on sentences in which a given noun (e.g., "boy") appears only in subject position, deal appropriately with novel sentences in which that word appears in another syntactic context, e.g., object position. This sort of generalization is an instance of what Hadley (1992) has called "strong systematicity", and it has been claimed that simple recurrent networks are not capable of this sort of generalization (Hadley, 1992; Marcus, 1998).
The question is an important one, because it is very likely that the natural language input which children (and even adults) hear is extremely limited in just this sort of way. Although it is definitely the case that children's linguistic input is both very rich and very extensive (Hart & Risley, 1995, estimate that by age 3, children hear between 10 and 30 million words), it is also almost certainly not the case that children hear all possible words in all possible syntactic contexts. Indeed, as one's vocabulary increases, the probability of encountering many words only a few times and in limited syntactic contexts increases.
Thus, the input from which we learn is both very extensive but also very gappy. If networks cannot extrapolate the appropriate use of a word to a novel context from limited exposure in the way that humans do, we (modelers) are in serious trouble.
As it turns out, networks do generalize. However, the conditions, which enable such generalization, are interesting, and remind us that the phenomenon of generalization across gaps in the input is not quite as simple as has sometimes been implied in the discussion of systematicity and productivity. Thus, Marcus, as do many others, tends to emphasize the generality and productivity of cognitive processes without-in my view-paying sufficient attention to the equally important problem: What are the limits on generalization? Sometimes gaps are accidental, but other times they are systematic. How do we learn that "ungrasp" is not a word, or that the question "Which car do you see the books that are in?" is ungrammatical, even though both might be warranted as generalizations from similar words and questions?
The problem of generalization thus has two aspects. Indeed, the problem that many connectionists find with symbolic theories is that they tend to over-generate behaviors. There are across-the-board generalizations, to be sure. But these are easy to account for in any system. The difficult cases, which are more numerous than one might think, are instances of partial productivity and quasi-regularity.
I have claimed, however, that networks can generalize in the face of gappy input. Let me demonstrate now how that can occur, with the following simulation (which, by the way, used localist representations, just to make the point that for this issue nothing critically hinges on localist vs. distributed representations).
A simple recurrent network (Elman, 1990) was trained to process simple sentences, taking one word as input at a time, and predicting at each point in time what the next word would be.
Words differed with regard to their frequency of occurrence in the grammar, and verbs differed with regard to possible arguments, as well as preferred arguments. There were a number of simple grammatical constructions, several of which are illustrated schematically in Figure 2 (font sizes indicate probability of occurrence in the corpus). A set of corpora were constructed consisting of random sentences from this grammar and ranging in size from very small (20 sentences) to medium (1000 sentences) to large (5000).
Figure 2. Schematic representation of some of the grammatical constructions used to generate a training database. Colored lines connect words, which form sample sentences; font size indicates probability of occurrence in the corpus.
Although there were only 1,30 different possible sentences in this language, the low frequencies of some of the nouns mean that not all possible sentences occurred, even in reasonably large samplings (e.g. 5,00 sentences). In fact, a random sample of approximately .5 million sentences is needed in order to ensure that all sentences are likely to appear.
Thus the data on which the network was trained are very gappy, which is probably a very realistic approximation of the situation in which children find themselves.
In deliberately constructing a training set for the network which is gappy, we are able to ask under what conditions the network generalizes across gaps (or doesn't). For example, in the network's artificial language, verbs of communication require human agents and direct objects. Thus, any of set of words "girl", "boy", "adult", "Sue", or "Bob" must serve as the agent and direct object of the verb "talk-to." However, with small corpora, particularly given that not all of these words occur equally often, it is possible that one or more may never appear in the training set as either agent of direct object. In fact, it was easy to find a corpus of 1000 sentences in which "boy" never appears at all in direct object position (for any verb). The question is then, never having seen "boy" as a direct object (although it does occur in the corpus as agent), will the network be unable to predict "boy" as a possible direct object following the verb "talk-to"?
Perhaps surprisingly, the network does predict "boy" in this context. Figure 3 shows the activation of the word nodes for "boy" and the mean activation for two other classes of words: all non-human nouns, and all verbs. "Boy" is predicted more than the other categories, although none appear during training in the context "The girl talks to..."
Figure 3. Network's predictions of "boy", compared with mean activations for non-human nouns and for verbs, in the context "the girl talks to. . .".
Why does this occur? The answer is quite straightforward.
In this simulation, the network sees only a fraction of the possible sentences. But importantly, although "boy" is never seen in direct object position, it is seen in shared contexts with other human words. For example, humans (but not other animals, food, etc.) appear as the agent of verbs such as "eat", "give", "transfer." Conversely, humans (including "boy") do not appear as agents of other verbs (e.g., "terrify", "chase", which in this language require animal agents). The word "boy" shares more in common with other human words than it does with non-human nouns, or with verbs.
In networks, as for human, similarity is the motive force, which drives generalization. Similarity can be a matter form ("who you look like") or behavior ("who you hang out with"). In this simulation, words were encoded with localist representations, so there was no form-based similarity. But as we have just seen, there were behavior-based similarities between "boy" and other "humans."
These more abstract similarities are typically captured in the internal representations that network construct on their hidden layers, and they are what facilitate generalization. The overall behaviors which "boy" shares with "girl", "Sue", "Bob", etc., are sufficient to cause the network to develop an internal representation for "boy" which closely resembles that of the other human words. However, the internal representations for those other words must reflect the possibility of appearing in direct object position following communication verbs (since the network does see many of them occurring that position). Since the representation for "boy" is similar, "boy" inherits the same behavior. The network's knowledge about what "boy" can do is very much affected by what other similar words can do.
Of course, if the examples are too scant such generalizations are not made. With a very small corpus, the network of interlocking relationship, which motivates the abstract categories, is not revealed. This is very much in line with what Tomasello (among others) has noted with children. Categories such as "noun" and "verb" do not start out as primitives; rather they are accreted over time, and at intermediate stages different words are more or less assimilated to what will become adult categories (Olguin & Tomasello, 1993; Tomasello & Olguin, 1993).
As I pointed out earlier, however, there is a flip side to this coin, which is that sometimes gaps are intentional. For instance, the fact that "ungrasp" is not a possible word (although "unclench" is just fine; Li & MacWhinney, 1996), or that even though both "the ice melted" or "she melted the ice" are acceptable paraphrases, one can say only "the ice disappeared" and not "she disappeared the ice." Sometimes gaps are not accidental but systematic-even if exactly what is systematic about the gap is not obvious. Will the network always generalize through gaps?
The answer is no. Such generalizations depend on the relative amount of data and experience, which are available. If the word "boy" appears overall with low probability, but there are sufficient other examples to warrant the inference that "boy" has properties similar to other words, the network will generalize to "boy" what it knows about the other words. However, if "boy" is a frequently occurring item, except in one context, the network is less likely to infer that a gap is accidental. It is as if the network realizes that the gap is not due to incomplete data (because the word is very frequent) and so must be the result of a systematic property of the word.
I said at the outset that I think Gary Marcus's book is about important issues. I think in fact they are some of the most important issues in all in cognitive science. I disagree with Marcus about many of his conclusions, but I applaud and welcome his attempt to sharpen and clarify just what the questions are.
In this commentary I have tried to show that simple recurrent networks do in fact generalize to novel inputs and to novel uses of inputs. It is true that networks are subject-as are we-to perceptual limitations. A network which is not designed to perceive along a given input dimension, or which is systematically deprived of experience in that dimension, will fare no better than a human who has not evolved to see in the infrared, or a cat who is reared in an environment containing no horizontal stripes.
The more important observation is that the generalization process is complex, subtle, often partial, and rarely straightforward. There are critical effects of corpus size, corpus structure, and the time course of learning, and many open questions remain. For instance, not only is the size of a corpus and the frequency of exemplars important, but the way in which the data are structured can play a crucial role in categorization (e.g., Rodriguez, 1998). Some of the effects are counter-intuitive. But if the generalization process in networks is complex, subtle, often partial, and rarely straightforward, I believe this is also true of humans as well.
Rethinking Innateness by J.L. Elman, E.A. Bates, M.H. Johnson, A. Karmiloff-Smith, D. Parisi, & K. Plunkett. Cambridge, MA: MIT Press.
Elman, J.L. (1998). Generalization, simple recurrent networks, and the emergence of structure. In M.A. Gernsbacher and S.J. Derry (Eds.) Proceedings of the Twentieth Annual Conference of the Cognitive Science Society. Mahwah, NJ: Lawrence Erlbaum Associates. Adobe PDF version; Compressed postscript.
Elman, J.L. (in press). Origins of language: A conspiracy theory. In B. MacWhinney (Ed.) Origins of Language. Hillsdale, NJ: Lawrence Earlbaum Associates. Adobe PDF version.
Elman, J. L. (1995). Language as a dynamical system. In R.F. Port & T. van Gelder (Eds.), Mind as Motion: Explorations in the Dynamics of Cognition. Cambridge, MA: MIT Press. Pp. 195-223. HTML viewable version; Compressed postscript.
Elman, J. L. (1991). Distributed representations, simple recurrent networks, and grammatical structure. Machine Learning, 7, 195-224. PDF
For other on-line papers, see myhome page
Blakemore, C., & Cooper, G. (1970). Development of the brain depends on the visual environment. Nature, 228, 477-478.
Elman, J.L. Finding structure in time. Cognitive Science, 14, 179-211.
Fodor, J., & Pylyshyn, Z. (1988). Connectionism and cognitive architecture: A critical analysis. Cognition, 28, 3-71.
Gaskell, M. G., Marslen-Wilson, W.D. (1997). Discriminating local and distributed models of competition in spoken word recognition. In M.G. Shafton & P. Langley (Eds.) Proceedings of the Nineteenth Annual Conference of the Cognitive Science Society. Mahwah, NJ: Lawrence Erlbaum.
Gaskell, M.G., & Marslen0Wilson, W. (1995). Modeling the perception of spoken words. In G.W. Cottrell (Ed.) Proceedings of the Eihteenth Annual Conference of the Cognitive Science Society. Mahwah, NJ: Lawrence Erlbaum.
Gaskell, M.G., & Marslen-Wilson, W. (1997). Integrating form and meaning: A distributed model of speech perception. Language and Cognitive Processes, 12, 613-656.
Hadley, R.F. (1992). Compositionality and systematicity in connectionist language learning. In Proceedings of the 14th Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Lawrence Erlbaum Associates. Pp. 659-670.
Hart, B., & Risley, T. (1995). Meaningful differences in the everyday experience of young American children. Baltimore: Paul Brookes Publishing.
Li, P., & MacWhinney, B. (1996). Cryptotype, overgeneralization and competition: A connectionist model of the learning of English reversible prefixes. Connection Science, 8, 3-30.
Marcus, G. (1998). Symposium on Cognitive Architecture: The algebraic mind. In M.A. Gernsbacher & S. Derry (Eds.)., Proceedings of the 20th Annual Conference of the Cognitive Science Society. Mahway, NJ: Lawrence Erlbaum Associates. P. 6.
Olguin, R., & Tomasello, M. (1993). Two-year-olds do not have a grammatical category of verb. Cognitive Development, 8, 245-272.
Rodriguez, P. (1998). Exploring gang effects by output node similarity in neural networks. In M.A. Gernsbacher & S. Derry (Eds.)., Proceedings of the 20th Annual Conference of the Cognitive Science Society. Mahway, NJ: Lawrence Erlbaum Associates. P. 1260.
Tomasello, M., & Olguin, R. (1993). Twenty-three-month-old children have a grammatical category of noun. Cognitive Development, 8, 451-464