Appendices, Glossary, References

Appendix 1: Distributed representations

Appendix 2: Generalization within the simple recurrent network

Appendix 3: Weight-sharing

Appendix 4: What happens when old items are new to a function



 Book Home Page | Table of Contents | Excerpts | Discussion | About the author | MIT Press



Appendix 1: Distributed representations

There is a way to get the simple recurrent network to cope with this problem, but the solution is strictly limited and comes at a cost. The solution is to represent individual words with distributed representations. A version of the simple recurrent network model that represents words with distributed representations can generalize from a rose is a rose to a blicket is a blicket, with one limitation. The limit is that the novel word that the network is to generalize to must not contain any feature values on which the network has been trained. A novel feature-value is a node that is set to one where it has always been trained with the value zero, or vice-versa. (I am assuming here that each word is encoded by a set of 1s and 0s.) For example, one node might represent the animacy of the input item; if all input items were —animate, the feature value +animate would be a feature value on which the network has not be trained. If a given novel word consists only of feature-values on which the network has been trained, the network will be able to generalize the a rose is a rose relationship to that novel word. If a given novel word contains a feature-value on which the network has not been trained, the network will not be able to generalize the "a rose is a rose" relationship to that feature-value of the novel word.

There is, however, a cost in adopting distributed representations, a problem known as the superposition catastrophe . This term refers to what happens when one tries to represent multiple entities simultaneously with the same set of resources. To take a simple example, suppose that we represented A as [1010], B as [1100], C as [0011], D as [0101]. Given such a representational scheme, a single set of units would be unable to represent unambiguously the simultaneous activation of A and D, because the combination of the two [1111] would also be the combination of B and C.



The reason that the superposition catastrophe matters with respect to the simple recurrent network is that the goal of the network is to represent a set of possible continuations, and the network needs to be able to do so unambiguously. For example, if the model were trained on the sentences cats chase mouse, cats chase mice, cats chase cat, cats chase cats, the optimal response to the sentence fragment cats chase ___ is to simultaneously activate mouse, mice, cat, and cats. If the output representations are localist, this is trivial: all a network needs to do is simultaneously activate the relevant nodes. But if the output representations are genuinely distributed, the problem becomes much more difficult, because the resources that represent nouns would overlap with the resources that represent verbs. For example, if the distributed representations encoded phonology, activating all the sounds that occur in nouns would be tantamount to activating all the sounds that occur in verbs. A model that represented words by phonological distributed representations thus could not in general keep nouns and verbs distinct.

The same holds even for far more arbitrary distributed representation. For example, in an unpublished (but widely circulated) technical report, Elman tested a version of the simple recurrent network that did use distributed output representations; each word was assigned a random 10-bit binary vector. For example, each instance of the word woman was assigned the code 0011100101, each instance of the word cat was assigned the code 0101110111, each instance of the word break was assigned the code 0111001010, and each instance of the word smell was assigned the code 1111001100.

Because the representations of different words overlap, it was not possible for the model to unambiguously represent all and only the possible continuations to a given string–regardless of what computations the model performs. The best that the model could do is to elicit as an activation pattern the average of all the nouns, an activity pattern that is just as likely to correspond to some particular noun as it is to correspond to some verb. (Indeed, since the codes for words are randomly assigned, it is consequence of the laws of probability that as the size of the vocabulary increases, the average of the nouns and the average of the verbs tends to become indistinguishable.) The practical consequence is that the sentence prediction network could not distinguish between nouns and verbs if it used distributed output representations. Thus, it is no surprise that Elman {1988 #500, p.17} reported that the distributed-output network’s "performance at the end of training, measured in terms of performance, was not very good." At the end of 5 passes through 10,000 sentences, "the network was still making many mistakes". (A second "clustering measure" indicated that while this network could not learn to accurately predict word sequences, words of common kind tended to elicit common patterns of hidden unit activation. This measure is however largely artifactual, as noted in Elman . The problem is that the clustering measure Elman used grouped together the contexts that the words appear in, rather than the words themselves. Because the input sentences were randomly generated, each animate noun appeared in roughly the same set of circumstances, and each singular verb will appear in a different set of largely shared circumstances. Even prior to training, it is likely that the mean of the hidden unit patterns elicited by contexts for verbs would differ from the mean of the hidden unit patterns elicited by contexts for nouns, leading to a clustering diagram in which nouns tend to clusters of nouns are separate from clusters of verbs. The clustering diagrams are thus artifactual, and do not show that the model has learned anything about the input grammar.)

To recapitulate, it is possible to build a simple recurrent network with distributed representations, and that network can generalize "a rose is a rose" to novel words (provided that the novel words do no contain novel feature values). But such a network loses the ability to keep categories of words distinct. In other words, the simple recurrent network is caught between two unsatisfying situations, capturing categories but losing linking relationships, and capturing linking relationships at the expense of capturing categories. Far from being a mere computational convenience, localist output representations are absolutely crucial for the simple recurrent network to be able to keep categories of words distinct. (An alternative would represent all nouns by activating one set of nodes, and all verbs by activating a different set of nodes. Such a representational scheme would, however, build into its input representation scheme the very distinction between nouns and verbs that the model was designed to learn.)


The situation with Hinton’s family tree model is similar. As in the case of the simple recurrent network, the superposition catastrophe renders the model incompatible with distributed output representations. Consider a query such as "Penny is the mother of X", the response should be "Arthur AND Victoria". In the localist output version of the family tree, this problem is solved naturally: the model simply needs to activate simultaneously both the Arthur node and the Victoria node. In a distributed output model, there would be no suitable target. Imagine for instance that Arthur was encoded by activating only input nodes 1 and 2, Victoria by nodes 3 and 4, Penny by nodes 1 and 3, and Mike by nodes 2 and 4. To indicate that Arthur and Victoria were both sons of Penny, this distributed input/output version of the family tree model would need to activate the nodes 1, 2, 3, and 4: exactly the same set of nodes as it would have used to indicate Penny and Mike. In short, distributed representations are not compatible with problems in which multiple outputs must be activated simultaneously. Instead, some problems demand localist representations , but those representations lead to the problems of training independence described elsewhere in this chapter.


Appendix 2: Generalization within the simple recurrent network

The fact that multilayer perceptrons trained with back-propagation cannot generalize outside the training space does not preclude them from generalizing within the training space; multilayer perceptrons can in some circumstances be quite good at within-training-space generalization.

For instance, in another experiment I trained a simple recurrent network on a series of sentences in which two items John and Bill appeared in identical circumstances: the sentences John loves Mary, John loves Susan, Bill loves Mary, and Bill loves Susan; hence the weights leading from those two inputs into the hidden layer were nearly identical Thus the two input units that encode John and Bill respectively elicited nearly identical patterns of hidden unit activity. (Another way of putting this is that the items John and Bill cluster together in hidden unit space.) Similarly, given the same set of sentences, the output units corresponding to Susan and Mary were connected to the hidden units in nearly identical ways, despite the fact that each change of weight leading to the Susan unit occurred independently of each change of a weight leading to the Mary unit.

Next I presented the network with a sentence fragment containing a novel word, John blickets __. Given that the Susan and Mary units are connected to the hidden units in nearly the same way, they activated the continuations Susan and Mary approximately equally, to levels of about 0.70. Both units were activated far more than any other units, so it is reasonable to say that the network correctly predicted both as possible continuations, just as a human might in similar circumstances.

This turns out, however, to be a case of getting the right prediction for the wrong reason. In a second stage of training, I trained the network on John blickets Susan, but not on John blickets Mary. In this second stage, the network gradually increased the activation of Susan as a continuation for John blickets ____, but gradually decreased the activation of Mary as a possible continuation. Whereas continued experience with the sentence John blickets Susan, would make a person more likely (or at least equally likely) to accept as grammatical the sentence John blickets Mary, greater experience causes the network to be less likely to predict (i.e. accept as grammatical) John blickets Mary.

Throughout this example the units representing Mary and Susan were always trained independently. What happens in this example is that in the initial stage of training, the two units are trained in nearly identical circumstances. At the conclusion of this initial stage, the Mary unit and the Susan unit tended to respond in identical ways to novel stimuli.

They were strongly activated by John blickets __ despite the novelty of blicket in part because of regularities such as the fact that John is always followed two words later by either Mary or Susan. To the extent that Mary and Susan appear in different training sentences in the second stage of training, the output units that represent them tend to diverge. (Unless the network is stuck in a local minimum, this divergence must occur, because each time that the network predicts Mary as a continuation to John blickets ____ when the actual continuation was Susan, back-propagation adjusts the connection weights leading into the Mary such that on the next presentation of the same input sentence, Mary would be even less strongly activated–while the unit representing Susan would be more strongly activated.)

Training independence does not show that a network like the simple recurrent network can never generalize. Such networks can (given the right parameters and training regimen) often generalize within the training space. A model that can account for some but not all the data may well be worth pursuing — provided it is not incompatible with the remaining data. What we have seen in this section, however, is that multilayer perceptrons with localist output representations are incompatible with generalizing a particular class of universals, universally quantified one-to-one mappings. Universally quantified relations are relations that hold for all elements in some class; one-to-one relations are those in which every input has a unique output. When the simple recurrent network succeeds in generalizing to a new input, it succeeds by predicting which previously learned category (or categories) a new input belongs to. After the first stage of training in the John blickets Mary example, the network can predict that Mary is a possible continuation to John blickets __, because that sentence fragment resembles other sentence fragments (John loves, Bill loves) that the Mary node has been trained to respond to. But in universally-quantified one-to-one functions, it does not suffice to assimilate each new response to an already known category, because each new input maps to a new output. The "right" answer for a new input in the functions will thus not be a category that the network has seen before. Since the network learns how to treat each category independently, prior experience on other categories does not enable the network to generalize to the new input-output pair. The localism that underlies back-propagation thus guarantees that multilayer perceptrons–in contrast to people– cannot generalize universally quantified one-to-one mappings to novel items.

Were universally-quantified one-to-one mappings of little importance, a restriction to generalizing within the training space might be of little consequence. In fact, universally quantified one-to-one mappings are pervasive. For example, as we have seen, each stem form of a verb corresponds to a different past tense form; the past tense of the stem out-Gorbachev is out-Gorbacheved. Likewise, generuks give birth to baby gerenuks. These sorts of mappings can be captured in a system that has operations that manipulate variables that are instantiated with instances, but cannot be captured by simple recurrent networks or feedforward networks.



Appendix 3: Weight-sharing

One might try to implement a learning algorithm that would change the weights leading to one output in the same way as the weights to some other output node were adjusted, a technique of "weight sharing" or "copying weights". In functions like identity, such a solution turns out, however, to be inadequate, because the weights that are appropriate for activating the output node encoding (say) 1 are not appropriate for activating the output node encoding (say) 2. If the weights that feed the output node encoding 1 were identical to the weights that feed output node encoding 2, the network would cause the same set of inputs to activate both the output node encoding 1 and the output node encoding 2. (For example, input 2 might activate both the "1 node" and the "2 node.) Thus, rather than learning the universally quantified one-to-one mapping, f(x) = x, for all x, the network would have implement a many-to-fewer mapping, in which f(9)=f(10), leaving the problem of learning one-to-one mappings unsolved.

Appendix 4: What happens to old items that are new to a function

Training independence does not just limit the ability of networks to generalize to novel feature values; it also limits how models generalize new functions to known feature values that are novel with respect to the new function. For example, consider the model shown in Figure 0.1. Here, the network has to learn one of two tasks, either to copy the input or to invert each bit in the input (i.e., turn each 1 into a zero and each zero into a one, e.g., 1110 into 0001).

Figure .1 Model that performs either the copy function or the invert function, depending on the activity of the rightmost input node.

In another experiment, I trained this network on inversion for all 16 possible inputs and then trained it on identity just for the numbers in which Digit4 equaled zero. As before, the network was unable to generalize to [1111], despite having had ample experience with the digit4 input node in the inversion function. The problem of transferring from node to node is not restricted to untrained nodes: networks never transfer from node to node.

Similarly, I found that training the simple recurrent network on sentences such as the bee sniffs the blicket (as well as sentences such as the bee sniffs the rose, the bee sniffs the tulip, and so forth), does not help the network infer that the continuation to a blicket is a ____ is blicket. What matters is the training space with respect to some particular function. Any item that is outside the training space with respect to that function – even if it is within the training space of some other function – will not be generalized properly.

While simple recurrent networks and feedforward networks do not generalize to items that are familiar, but new to some particular function, humans can. For example, consider the example given in Box 2, a novel function entirely of well-practiced letters.

W X H -> H X W [Read as "given the input W X H, the output is H X W"]

H K X -> X K H

Box 2

What output would correspond to the test item "S O C"? On virtually any account of how inputs are represented, the input "S O C" lies outside the training space of the function that is illustrated. For example, "S" would be outside the training space of the function that is illustrated, whether "S" were encoded locally as +S, through sets of high-level distributed features such +curved-line, or through sets of low-level distributed features such as +pixel-in-the-center-of-the-bottom-row.

Despite the fact that the input "S O C" lies outside the training space of the function that is illustrated, people readily infer that the corresponding output is "C O S". (People can extend the reversal relation to novel squiggles that are not even letters, transforming the input "" into the output "".) Indeed, human generalizations of the reversal pattern seems to be indifferent as to whether they are applied to items that are within the function’s training space (more Ws, Xs, and so forth) or to items that outside the function’s training space (Ss, Os, Cs, etc).

In sum, humans can generalize outside the training space of a particular function on the basis of limited experience, whereas neither feedforward networks nor simple recurrent networks can generalize outside the training space, regardless of how large the training space is. Although humans do have more learning experience than any particular network, the difference in how they generalize outside the training space is not due to differences in experience, but rather to differences in the computations that they perform.



Back-propagation algorithm. An error-correction algorithm which adjusts connection weights in multilayer perceptrons. Introduced by Rumelhart, et al. {1986 #151}.

Distributed representation. A representation in which some or all possible inputs or outputs are encoded by the activation of more than one unit, and in which units participate in the encoding of more than one possible input. For example, the word cat might be represented by a set of units that encode parts of the sound or meaning of the word cat. See also localist representation.

Eliminative connectionism. The branch of connectionism that seeks to explain human cognition in terms of connectionist networks that do not implement (map onto) symbol-manipulating models. See also implementational connectionism.

Error-correction algorithm. A mechanism for adjusting the weights in a multilayer perceptron based on the discrepancy between a target output supplied by an external "teacher" and the output that the network actually produces, given some input.

Family-trees model. Introduced by Hinton , this model attempts to learn about family relationships using a multilayer perceptron with localist input and output representations that is trained on examples of kinship relationships drawn from two family trees. The model is illustrated here:

Implementational connectionism. The branch of connectionism that seeks to explain human cognition in terms of connectionist network that implement symbol-manipulating systems. See also eliminative connectionism.

Localist representation. A representation in which each possible input or output is encoded by the activation of a single unit. For example, the word cat might be encoded by activating a single node. See also distributed representations.

Multilayer-perceptron. A multilayer network that adds one or more hidden layers to the perceptron architecture. Interest in this architecture was renewed Rumelhart, et al. {1986 #151} introduced the back-propagation algorithm. A simple multilayer perceptron is illustrated here:

Perceptron. Introduced by Rosenblatt , this type of network has a layer of input units connected directly to a layer of output units. Criticized by Minsky and Papert

Sentence-prediction model. A simple recurrent network with localist input and output representations that has been applied to the task of learning sequences of words. The input to the model is a sentence fragment such as cats chase ___; the output is the network’s predictions about what continuations are likely. A sketch is given here:


Simple recurrent network. Introduced by Elman , this is type of connectionist network similar to a feedforward network but enhanced with a "context layer" that allows the model to keep track of temporal context and hence learn something about sequences. A sample model is illustrated above in the discussion of the sentence-prediction model.

Symbol-manipulation. A theory about human cognition that holds that the mind performs computations upon mentally-represented symbols, and has a means for representing and extending abstract relationships between variables, a way of distinguishing tokens (instances) of a class (type) from one another and from the types to which the tokens belong, and way of representing complex, hierarchically structured combinations of symbols

.Symbols. Internal encodings of all members of an equivalence class.


 Front Page | Table of Contents | Excerpts | Discussion | About the author | MIT Press