Lecturer's Précis - Seidenberg and McClelland (1989)
"A Distributed, Developmental Model of Word Recognition and Naming"
Copyright Notice:
This material was written and published in Wales by Derek J. Smith (Chartered Engineer). It forms part of a multifile e-learning resource, and subject only to acknowledging Derek J. Smith's rights under international copyright law to be identified as author may be freely downloaded and printed off in single complete copies solely for the purposes of private study and/or review. Commercial exploitation rights are reserved. The remote hyperlinks have been selected for the academic appropriacy of their contents; they were free of offensive and litigious content when selected, and will be periodically checked to have remained so. Copyright © 2004, Derek J. Smith (Chartered Engineer).|
|
First published 09:00 GMT 30th November 2004
|
Although this paper is reasonably self-contained, it is primarily designed to be read as a subordinate file to our e-paper on "The History of the Psycholinguistic Flow Model". Readers new to dyslexia studies may prefer to familiarise themselves with the Marshall and Newcombe (1973) paper before proceeding (alternatively, use the [glossary] links as and when you come to them). Readers unfamiliar with the basic concepts of the science of neural networks should also pre-read our e-handout on "Connectionism", noting carefully the difference between "Perceptrons" and "neural networks". |
|
About the Authors: Mark S. Seidenberg [homepage] is Hebb Professor of Psychology and Cognitive Neuroscience at the University of Wisconsin at Madison, and James L. McClelland is Professor of Psychology and Computer Science at Carnegie Mellon University [about James L. McClelland]. |
1 - Introduction
The authors began by pointing to the "immediate", or "on-line", nature of successful text processing. By this they meant that the mind's centres of higher understanding allow themselves to become more or less totally dedicated to the task of interpreting what the eyes are looking at, even "as the signal is perceived" (p523). When the cognitive system focuses its attention in this way, it allows approximately five words per second to be decoded, and it achieves this speed by simultaneously activating more than one type of encoding principle. The paper then went on to present and explore a "computational model" of that encoding.
2 - The Problem
Moving on to what they call "the scope of the problem", the authors reminded us just how complex are the tasks which need to be mastered by learners of English. These are .....
(1) The Alphabetic Principle:
This is the principle that "there are systematic correspondences between the spoken and written forms of words" (p524). Unfortunately, this correspondence is less than exactly consistent from word to word [for example, the words "have" and "gave" look as if they should be exact rhymes, but are not, because by convention the terminal -e on "have" does not lengthen /a/ to /A/ as it does in "rave" (/rAv/), "Dave" (/dAv/), and "wave" (/wAv/)].(2) Orthographic Redundancy: This is the principle that not all combinations of the 26 letters of the alphabet are equally permissible. [Example: The letter string "ing" is seen more commonly in English than the string "xtr", whilst "xxw" does not exist at all.] As a result, word recognition processing can make good use of the resulting information "redundancy". [For more on the topic of redundancy, see our e-paper on "The Relevance of Shannonian Communication Theory to Biological Communication".]
(3) Morphological Redundancy: This is the principle that not all combinations of the morphemes [glossary] of the language are equally permissible. [Example: The morpheme "ing" is allowed to follow word-roots like "eat", but not the other way around.] Again, our word recognition processes can make good use of the resulting information redundancy.
The theoretical issue was then how to identify what processes are responsible for coping with irregular and unfamiliar words, and the authors asserted that the connectionist approach was "ideally suited" to do this, thanks to its ability to reduce cognition to "weights on connections between simple processing units in a distributed processing network" (p525). A computational model was accordingly a "minimal model of lexical processing, in which as little as possible of the solution of the problem is built in and as much as possible is left to the mechanisms of learning" (p525). As to the proposed computational architecture, the authors went for "a single uniform procedure [note this phrase - Ed.] for computing a phonological representation from an orthographic representation that is applicable to irregular words and nonwords as well as regular words" (p525; bold emphasis added).
3 - Previous Models
Before proceeding, the authors reviewed previous models of the reading process, paying due credit to the classic "PDP Model", of which McClelland himself was co-author (McClelland and Rumelhart, 1986; Rumelhart and McClelland, 1986). They specifically mentioned Morton's (1969) "Logogen Model" and Coltheart's (1978) "Dual Route Model", but presented their own "single uniform procedure" as a "single route" alternative to the dual route option [Coltheart replies to this criticism in Coltheart, Curtis, Atkins, and Haller (1993), if interested].
ASIDE:
What we have here, therefore, is a major theoretical confrontation between the "connectionist" tradition and the considerably older "cognitive neuropsychological" tradition. The former approach has its roots in university computer science departments in the 1950s, and sets out to explain cognition as the workings of arrays of artificial neurons, while the latter emerged in hospital neurology wards around the middle of the 19th century, and sets out to decipher the mind's modular architecture. Seidenberg and McClelland are here playing down the importance of that modularity.4 - The Model
Seidenberg and McClelland then presented their model. This was a "hidden unit" connectionist design in which an "interlevel" of storage units existed only to maintain "weighted" associations between two more functionally specific layers, rather like the cheese in a sandwich. The full system contained 400 orthographic units designed to encode word spellings, 460 phonological units designed to encode word pronunciations, and 200 "hidden units" designed to link the one to the other. As with all connectionist models of the late 1980s, what went on inside it and what gave it its ability to learn was a piece of software known as a "back propagation algorithm" .....
Key Concept - Back Propagation Algorithm:
A back propagation algorithm is the key to getting a hidden layer neural network to learn from experience. It is a set of rules built into the simulation which is invoked every time the network is given a training trial. The network is initially allowed to take a random guess at an output when given an input. The strengths of the output units for the guess are then deducted from the strengths for the known right answer. This gives an error score for each output unit, which is then fed back into the hidden layer to adjust the weightings of the units there. Output from the second trial is thus marginally closer to the known right answer. A second set of error scores is then computed, and again fed back into the hidden layer, resulting in further improvements to the weightings. This learning cycle may then be repeated until no further improvements are to be obtained.In the present set-up, where the network was being used to simulate the act of reading out loud, the input array needed to be loaded with a representation of a written word at the same time that the output array was loaded with a representation of the corresponding sound patterns. Here is the authors' detailed explanation of what this entails .....
"Each word-processing trial begins with the presentation of a letter string, which the simulation program then encodes into a pattern of activation over the orthographic units [.....] Next, activations of the hidden units are computed on the basis of the pattern of activation at the orthographic level. For each hidden unit, a quantity called the net input is computed; this is simply the activation of each input unit [multiplied by] the weight on the connection from that input unit to the hidden unit, plus a bias term [..... which] may be thought of as an extra weight [.....]. The activation of the unit is then determined from the net input using a nonlinear function called the logistic function [.....] Once activations over the hidden units have been computed, they are used to compute activations for the phonological units and new activations for the orthographic units based on feedback from the hidden units." (p527; italics original.)
The model was trained in this way on a vocabulary of 2,897 words, of which 13 were "homographs", that is to say, words like "wind" which have two pronunciations (/wind/ and /wInd/), and which therefore required two separate entries, namely <"wind"/wind/> and <"wind"/wInd/>. This meant that 2,884 unique input words needed to be paired with 2,897 unique output pronunciations. Not all words were presented on every trial, however, in an attempt to simulate the effects of word frequency upon vocabulary growth [see Gathercole (1990) on the relationship between working memory skills and vocabulary growth in children, if interested]. Once the network had completed its training regime, its performance was tested in a number of different ways .....
5 - Results Obtained with the Model
|
NOTE: At this point the authors present some dozen pages of results, under as many subheadings. We comment here only on selected observations, and refer anyone interested in more than this briefest of summaries to the full-length original. |
The first set of tests probed the network's ability to "read" written words out loud, and in general terms only 77 of the 2,897 input words turned out to have been learned incorrectly. Moreover, when these 77 errors were inspected, it turned out that 14 of them were down to human mis-keyings in the input data rather than any inadequacy on the part of the neural network. The remaining 63 errors were then classified as follows .....
|
Error Type |
n |
Examples |
|
Regularisation Errors |
14 |
"Brooch" rhymed with "hooch", etc. |
|
Incorrect Vowels |
25 |
"Beau" rhymed with "stew", "frost" with "bust", etc. |
|
Incorrect Consonants |
24 |
"gel", "gin", and "gist" all given a hard "G" |
So what was it that the network had learned? Well at the physical level, as we have seen, all that had changed were the weightings of individual connection pathways. These weightings had been recalculated for a given stimulus pair every time it had been presented, generally in the direction of a more precise memory trace. Not surprisingly, therefore, given that words were presented different numbers of times, the general trend here was that "the main influence on the phonological output is the number of times the model was exposed to the word itself" (p540). This mirrors real life, where "common familiar words yield faster naming latencies than do uncommon less familiar words" (p533).
The model was also deployed in a particularly clever way to simulate human reading problems. By careful technical analysis of the 200-unit hidden layer array, those units were identified which were ON for the irregular word "pint". 22 such units were identified, 10 of which were activated by "pint" alone, 8 by "pint" and "mint", 1 by "pint" and "said", and 3 by "pint", "mint", and "said". The pattern here was described as follows .....
"These snapshots of the hidden units indicate that they reflect generalisms concerning the regularities in the lexicon encoded by the weights on connections. Similarly spelled rhymes activated the largest number of common units (LINT/MINT = 14), similarly spelled nonrhymes a smaller number of common units (PINT/MINT = 8), and unrelated words a smaller number still (LINT/SAID and PINT/SAID both = 1)." (p542)
Additional experiments were carried out with cut-down variants of the original network, in an attempt to simulate a child with specific learning difficulties in learning to read. The general result was as follows .....
"Eliminating one half of the hidden units, then, produced a general decrement in performance; more important, higher frequency words produced the patterns associated with lower frequency words [and] it performed more poorly on words whose pronunciations are not entirely regular." (p547; italics original)
6 - The Authors' Conclusions
The authors' general conclusion was that high levels of performance could be obtained using their "single uniform procedure" simulation, implying that a workable form of lexical memory can be provided by the weightings of a single hidden layer. This is in stark contrast to the "cog neuro" tradition mentioned above, where activations are distributed between the processing modules identified by a particular model .....
ASIDE:
The task of reading out loud can be accomplished via three processing routes, namely (1) the "semantic lexical" route, involving word recognition, understanding, and lexical speech output, (2) the "direct lexical" route involving word recognition and lexical speech output, but lacking understanding, and (3) the "non-lexical" route involving only a "sounded-out" form of speech output. Interested readers may care to trace out these pathways on a typical cognitive neuropsychological model. The Ellis and Young (1988) model is entirely typical of the genre, and has the particular advantage of having numbered components. Print out the diagram, and then colour highlight the following three routes <5-6-7-4-8-10-9>, <5-6-14-8-10-9>, and <5-15-9>. Note how component #4 - understanding - is only involved in the semantic lexical route.In other words, Seidenberg and McClelland's network is not built up from individual word nodes - "there are no logogens" (p560; bold emphasis added) .....
Key Concept - Logogens, Lexicons, and The Lexicon:
Seidenberg and McClelland are here using Morton's (1979) term for a modality-specific store of whole known words. This term (but not the concept to which it relates) has since declined in popularity in favour of the psycholinguistic usage of the term "lexicon", but even that needs using with care. The problem is that there is a major difference in usage between the psycholinguistic definition of a lexicon (as one specific word store among several) and the linguistic definition of the lexicon - the sum total of the mind's verbal knowledge. The received cog neuro model of the lexicon contains four specific lexicons clustered around a central semantic lexicon. This "four-plus-one" layout can clearly be seen in the Kussmaul (1878) processing hierarchy, and, more than a century later, remains at the heart of all modern psycholinguistic transcoding models (see, for example Ellis (1982) and the aforementioned Ellis and Young (1988)). [For a longer discussion of this point, compare the entries for logogen, lexicon (linguistic definition), and lexicon (psycholinguistic definition) in our Psycholinguistics Glossary, and for the developmental provenance of these models see our e-paper on "The History of the Psycholinguistic Flow Model".The authors gave especially detailed consideration to their lexical decision data because it is at the heart of the "dual route (DR) theory" of reading out loud [glossary], and in the light of their data were happy to discard the DR Model on the grounds that .....
"Our model, and others like it, offers an alternative that dispenses with this two-route view in favour of a single system." (p564)
7 - Evaluation
Here are the key arguments put forward in this paper, in revision point format .....
As for the theoretical confrontation between single route connectionism and dual route cognitive neuropsychology, it is not possible - as of 1989 - to issue final judgement. Students need firstly to consider the reply from Coltheart, Curtis, Atkins, and Haller (1993), which (a) fielded an upgraded dual route model [complete with "broadband" internal communication channels!], and (b) identified a number of areas which Seidenberg and McClelland's model had not catered for. Even more significant are papers on "modular connectionism" which started to emerge in the early 1990s. Norris (1991) and Hinton, Plaut, and Shallice (1993) are typical of this sub-genre of artificial intelligence studies, and both obtain significantly better simulations of real-life cognition when their neural networks are strapped together into higher order architectures.
8 - References
See the Master References List
[
Home]