



Sidney Fels & Geoffrey Hinton
Department of Computer Science & Department of Computer Science
University of Toronto & University of Toronto
Toronto, ON, Canada, M5S 1A4 & Toronto, ON, Canada, M5S 1A4
ssfels@ai.toronto.edu & hinton@ai.toronto.edu
Glove-TalkII is a system which translates hand gestures
to speech through an adaptive interface. Hand gestures are mapped
continuously to 10 control parameters of a parallel formant speech
synthesizer.
The mapping allows the hand to act as an artificial vocal tract that
produces speech in real time. This gives
an unlimited vocabulary, multiple languages in addition to direct
control of fundamental frequency and volume.
Currently, the best version of Glove-TalkII uses several input devices
(including a Cyberglove, a ContactGlove, a polhemus sensor, and a foot-pedal), a parallel formant speech
synthesizer and 3 neural networks. The gesture-to-speech task is
divided into vowel and consonant production by using
a gating network to weight the outputs of a vowel and a consonant
neural network.
The gating network and the consonant network are trained with examples from the
user.
The vowel network implements a fixed, user-defined relationship between
hand-position and vowel sound and does not require any training
examples from the user.
Volume, fundamental frequency
and stop consonants are produced with a fixed mapping from the input devices.
One subject has trained for about 100 hours to speak
intelligibly with Glove-TalkII.
He passed through eight distinct stages while learning to speak.
He speaks slowly with speech quality similar to a text-to-speech synthesizer but with far more natural-
sounding
pitch variations.
Keywords:
Gesture-to-speech device, gestural input, speech output, speech acquisition, adaptive interface, talking machine.
Many different possible schemes exist for converting hand gestures to speech. The choice of scheme depends on the granularity of the speech that you want to produce. Figure 1 identifies a spectrum defined by possible divisions of speech based on the duration of the sound. for each granularity. What is interesting is that in general, the coarser the division of speech, the smaller the bandwidth necessary for the user. In contrast, where the granularity of speech is on the order of articulatory muscle movements (i.e. the artificial vocal tract [AVT]) high bandwidth control is necessary for good speech. Devices which implement this model of speech production are like musical instruments which produce speech sounds. The user must control the timing of sounds to produce speech much as a musician plays notes to produce music. The AVT allows unlimited vocabulary, control of pitch and non-verbal sounds. Glove-TalkII is an adaptive interface that implements an AVT.
FIGURE 1:Spectrum of gesture-to-speech mappings based on the granularity of speech.
Translating gestures to speech using an AVT model has a long history beginning in the late 1700's. Systems developed include a bellows-driven hand-varied resonator tube with auxiliary controls (1790's [15]), a rubber-moulded skull with actuators for manipulating tongue and jaw position (1880's [1] ) and a keyboard-footpedal interface controlling a set of linearly spaced bandpass frequency generators called the Voder (1940 [4] ). The Voder was demonstrated at the World's Fair in 1939 by operators who had trained continuously for one year to learn to speak with the system. This suggests that the task of speaking with a gestural interface is very difficult and the training times could be significantly decreased with a better interface. Glove-TalkII is implemented with neural networks which allows the system to learn the user's interpretation of an articulatory model of speaking.
The obvious use of an AVT is as a speaking aid for speech impaired people. Clearly, the difficulties encountered with this application include the extreme motor demands and time required to learn to use the device compared to other speech prostheses which require less control. Additionally, users must be able to hear to effectively use the device which further limits the potential user group. Of course, care must be taken when considering these criticisms since AVTs potentially provide a much richer speech space than other coarse granularity systems which may be preferable for some people. And, just as children who are learning to speak are willing to spend the large amount of time required to control their vocal tracts, it is not unreasonable to expect users to spend on the order of 100 hours to learn to speak with an AVT like Glove-TalkII. Besides the obvious application of Glove-TalkII, the neural network techniques used successfully can be applied to other complex interfaces where adaptation between a user's cognitive space and some objective space is required, for example; musical instrument design and telerobotics.
This paper first describes the Glove-TalkII system and then the experience of a single subject as he learned to speak with Glove-TalkII over 100 hours. Quantitative analysis of Glove-TalkII only provides a rough guide to the performance of the whole system. Observation of the single subject allows for qualitative analysis of Glove-TalkII to determine its effectiveness as a gesture-to-speech device.
The Glove-TalkII system converts hand gestures to speech, based on a gesture-to-formant model. The gesture vocabulary is based on a vocal-articulator model of the hand. By dividing the mapping tasks into independent subtasks, a substantial reduction in network size and training time is possible (see [5]).
Figure 2 illustrates the whole Glove-TalkII system. Important features include the three neural networks labeled vowel/consonant decision (V/C), vowel, and consonant. The V/C network is a 12--10--1 feed forward neural network with sigmoid activation functions Footnote 1:
FIGURE 2: Block diagram of Glove-TalkII: input from the user is measured by the Cyberglove, polhemus, keyboard and foot pedal, then mapped using neural networks and fixed functions to formant parameters which drive the parallel formant synthesizer [00].
The V/C network is trained on data collected from the user to decide whether he wants to produce a vowel or a consonant sound. Likewise, the consonant network is trained to produce consonant sounds based on user-generated examples from an initial gesture vocabulary. The consonant network is a 12--15--9 feed forward network. It uses normalized radial basis function (RBF) [2] ) activations for the hidden units and sigmoid activations for the output units. In contrast, the vowel network implements a fixed mapping between hand-positions and vowel phonemes defined by the user. The vowel network is a 2--11--8 feed forward network. It also uses normalized RBF hidden units and sigmoid output units Footnote 2
As is typical with speech research though, care must be taken when using quantitative analysis of the networks performance to judge the performance of the whole system. For this reason, qualitative analysis of the single user is important.}. Eight contact switches on the user's left hand designate the stop consonants (B, D, G, J, P, T, K, CH), because the dynamics of such sounds proved too fast to be controlled by the user. The foot pedal provides a volume control by adjusting the speech amplitude and this mapping is fixed. The fundamental frequency, which is related to the pitch of the speech, is determined by a fixed mapping from the user's hand height. The output of the system drives 10 control parameters of a parallel formant speech synthesizer every 10 msec. The 10 control parameters are: nasal amplitude (ALF), first, second and third formant frequency and amplitude (F1, A1, F2, A2, F3, A3), high frequency amplitude (AHF), degree of voicing (V) and fundamental frequency (F0).
Once trained, Glove-TalkII can be used as follows: to initiate speech, the user forms the hand shape of the first sound she intends to produce. She depresses the foot pedal and the sound comes out of the synthesizer. Vowels and consonants of various qualities are produced in a continuous fashion through the appropriate co-ordination of hand and foot motions. Words are formed by making the correct motions; for example, to say ``hello'' the user forms the ``h'' sound, depresses the foot pedal and quickly moves her hand to produce the ``e'' sound, then the ``l'' sound and finally the ``o'' sound. The user has complete control of the timing and quality of the individual sounds. The articulatory mapping between gestures and speech is decided {\em a priori}. The mapping is based on a simplistic articulatory phonetic description of speech [10]. The X,Y coordinates (measured by the polhemus) are mapped to something like tongue position and height\footnote{In reality, the XY coordinates map more closely to changes in the first two formants, F1 and F2 of vowels. From the user's perspective though, the link to tongue movement is useful.} producing vowels when the user's hand is in an open configuration (see figure 2 for the correspondence and table 1 for a typical vowel configuration). Manner and place of articulation for non-stop consonants are determined by opposition of the thumb with the index and middle fingers. Table 1 shows the initial gesture mapping between static hand gestures and static articulatory positions corresponding to phonemes. The ring finger controls voicing. Only static articulatory configurations are used as training points for the neural networks, and the interpolation between them is a result of the learning but is not explicitly trained. For example, the vowel space interpolation allows the user to easily move within vowel space to produce dipthongs. Ideally, the transitions should also be learned, but in the text-to-speech formant data we use for training [11] these transitions are poor, and it is very hard to extract formant trajectories from real speech accurately.
Figure 3 Hand-position to Vowel Sound Mapping. The coordinates are specified relative to the origin at the sound A. The X and Y coordinates form a horizontal plane parallel to the floor when the user is sitting. The 11 cardinal phoneme targets are determined with the text-to-speech synthesizer. (glovetalkII-vowel-map)






TABLE 1: Examples of static gesture-to-consonant mapping. Note, each gesture corresponds to a static non-stop consonant phoneme generated by the text-to-speech synthesizer and the neural networks provide the continous interpolation. (glovetalkII-consonant-map)
The second input device is a polhemus sensor which measures the X,Y,Z, roll, pitch and yaw of the hand relative to a fixed source. The small sensor is mounted on the back of the Cyberglove on the user's forearm; thus, the six parameters are independent of the user's wrist motion. The device measures the parameters at a frequency of 60Hz.
The third input device is a ContactGlove. This device measures contact between points on the fingers to the thumb which are mapped to stop consonants.
The final input device used is a foot pedal. This device has a variable resistance which is an approximately linear function of foot depression. The variable resistance is used in a voltage divider circuit. The variable voltage is sampled by the A/D circuitry included with the computer at its lowest frequency of 8kHz. Additionally, several elastic bands have been attached to the base to provide some force feedback and also to return the foot pedal to the fully undepressed position when the user's foot is lifted.
The output device is a Loughborough Sound Images (LSI) parallel formant speech synthesizer. The device requires 16 speech parameters at 100 Hz to operate. The parameters are quantized to 6 bits (integer range [0,63]). The ten main parameters are these:
The first eight are called formants parameters and can can be thought of resonances of the vocal tract. The last two represent glottal controls. The parameters are sent to the synthesizer using a parallel port. Control of these parameters is sufficient to produce high quality speech. A text-to-speech synthesizer [11] is available which outputs formant parameters to drive the formant synthesizer and provide formant targets for training the neural networks.
All the software runs on a Silicon Graphics Personal Iris 4D/35. The Xerion Neural Network libraries simulate neural networks and run all the hardware devices [?] After all the preprocessing and data collection, there is enough computing power remaining in each 10 msec interval to simulate networks with up to 1000 floating point weights which is sufficient for Glove-TalkII to operate without significant interruption. Glove-TalkII requires about 200,000 floating point operations per second.
One subject has been trained extensively to speak with Glove-TalkII. The subject is an accomplished pianist who can speak. It was anticipated that his skill in forming finger patterns for playing the piano and his musical training would transfer positively to aid his learning to speak with Glove-TalkII. The subject went through 8 learning phases during speech acquisition. The phases are:
During his training, Glove-TalkII also adapted to incorporate changes required by the subject. Of course, his progression through the stages is not as linear as suggested by the above list. Some aspects of speaking were more difficult than others, so a substantial amount of mixing of the different levels occurred. Practice at the higher levels facilitated perfecting more difficult sounds that were still being practiced at the lower levels. Also, the stages are iterative, that is, at regular intervals the subject returns to lower levels to further refine his speech. An interesting research issue would be to determine how adaptation by the user interacts with adaptation by the interface.
The first training set consisted of 2830 examples of static consonants for training the consonant network and 3502 examples (2830 consonants and 672 vowels) to train the V/C network. These data were used to train Glove-TalkII's neural networks to map the subject's interpretation of the initial gesture vocabulary. During data collection, the subject memorized the static hand configuration to static consonant mapping. In addition, he provided hand configurations most suited to his hand that approximated the initial mapping. The simplicity of the data collection procedure and the ease with which the networks train are important for Glove-TalkII to be a useful adaptive interface.
Stages 3 and 4 are less distinct than suggested above. The subject practiced individual words and individual sounds simultaneously. This amalgamation became particularly prominent as the subject became more proficient with individual sounds. Data were collected for improving the consonant sounds during the many hours of practice during phases 3 and 4.
Glove-TalkII was retrained about 10 times during these initial phases, sometimes with more data for particular phonemes and other times with replacement data. For future subjects, good performance of the V/C network must be a key focus in the early stages of learning. Several retraining sessions were probably unnecessary since the phoneme errors were caused by mixtures of vowels and consonants caused by poor vowel/consonant distinctions.
Three more significant adjustments were made after the V/C network was performing properly. First, the I position on the vowel mapping was shifted to (5,0) from (4.5,1) which is midway between EE and E (see figure 2 ). This modification was necessary because the subject had difficulty saying the I phoneme as in ``is'' reliably. This phoneme occurs frequently in English causing significant intelligibility problems. This problem was probably due to the I and E sound being placed relatively close to each other on the initial mapping, correspondingly, after the vowel network was trained, the area in the X-Y plane which produces the I sound was too small relatively. Second, the subject created another complete training set for every static phoneme sound once the V/C network performed well. The consonant network was trained with this new data set plus the data set used to train the good V/C network. Third, the entire vowel space was compressed by a factor of 0.75 since the subject found that he had to move his hand extensively in X,Y plane to speak. A factor of 0.5 was also tried but was found extreme. Another interesting attempt to provide a better vowel space was to form a radial representation of the static phonemes. Using A as the centre, the remaining 10 vowel phonemes were placed at equidistant positions along a ring 5~cm away from the A. Training data were generated by partitioning the plane into sectors formed by the mid-points between phonemes on the ring, and by specifying phoneme targets for each of the sectors sampled evenly with 60 points out to a radius of 10~cm. The subject found that already after 15 hours of training on the original vowel space the new vowel space was too different to integrate into his speech quickly. In comparison, shifting the I phoneme was easily integrated. From this observation, it appears that users can adapt relatively quickly to the first mapping, after which it becomes difficult to alter the mapping radically without significant performance penalties.
At this point, Glove-TalkII was relatively stable allowing the subject to produce static phonemes in sequence reliably. The subject could intelligibly say simple words that had been practiced. He was proficient at manipulating pitch within a word as well as getting difficult phoneme transitions, especially stop-to-vowel or stop-to-non stop consonants.
First, it is very important for vowel phonemes to sound correct to achieve proper enunciation of slow speech. With Glove-TalkII, it is difficult to know exactly which vowel will be produced until the foot pedal is depressed since there is poor absolute hand position feedback. Second, timing stop consonants is difficult since the stop phonemes are produced within 100 msec. Small timing errors produce unintelligible stop sounds. Third, if the R sound is sustained for too long, the speech produced sounds muffled and its intelligibility is impaired. Forty milliseconds should be a typical duration of the R sound, but this short timing is difficult to achieve since the static gesture required is hard to produce quickly (see table1 ).
Notice that when making the R sound, the index finger is very bent. To extend the finger requires a fairly large motion which must be made quickly to achieve the necessary transition. One technique to achieve the necessary transition speed is to form the R sound partially instead of completing the finger trajectory. This technique requires a large degree of finger control since the subject's index finger does not oppose the thumb in this case. Another alternative for some R sounds is to use one of the R sounding vowel sounds with a drop in pitch as in ``ar'' in the British pronunciation of ``farther''. Examples of R's that can be made in this fashion include UR, AR and ER as in ``curious'', ``are'', and ``curd'' respectively. This type of R sound is much easier to produce quickly. The difficulty for the subject is learning to know automatically which way to make the R sound. The subject uses a combination of a pitch drop and a short R burst as a safe alternative for unknown R contexts.
Would you like them in a house?
Would you like them with a mouse?
I would not like them in a house.
I would not like them with a mouse.
In addition, reading caused improvement in the three most difficult areas for producing intelligible speech: reliably producing vowel sounds, stop consonant clusters and R technique.
Several distinguishing features of the subject's speech were observed in informal listening tests. First, a strong contextual effect occurred. In particular, when a listener hears the subject speak for the first time, she sometimes does not understand a single word; rather perceives a long slurred speech-like utterance. However, once the listener is told what the utterance was and hears the subject say it again, the words become intelligible and distinguishable. Subsequent novel speech also becomes intelligible. This effect is similar to the adaptation people make when listening to speakers with strong accents or speech impairments. For familiar utterances, the subject's speech is very intelligible; for example, counting and saying the alphabet were never misunderstood even by listeners whose first language is not English. Second, the subject speaks slowly. Third, by using appropriate pitch control the subject produces some relatively natural-sounding speech compared to the text-to-speech synthesizer. As shown through interword pitch variation, proper control of pitch improves intelligibility of the subject's speech. Fourth, even with considerable practice some stops (i.e. P, T, K) are still difficult to discriminate in all contexts. While the R sound still sounds a bit muffled, after considerable practice (approximately 50 hours) the AR, ER, and UR sounds are made reliably in appropriate R-contexts, which alleviated the need for the consonant hand configuration for R to be used in these cases.
Some of the stages of learning the subject progressed through are similar to the stages encountered while learning to play a musical instrument. The stages can also be categorized according to Fitts' three stages of learning [8]: cognitive, associative and autonomous. Using Fitts' levels, stages 1--4 correspond to the cognitive level, stages 5--7 the associative level and stage 8 the autonomous level. One of the key features discovered while the subject was at levels 3 and 4 was that the V/C network must work well for the user to get adequate feedback about which phonemes he produces.
After 100 hours of practice the subject progressed from simple, barely speech-like noise to intelligible somewhat natural-sounding speech. The subject exhibits two levels of performance, one for rehearsed speech and one for unrehearsed. Rehearsed speech sounds similar to slow text-to-speech synthesized speech with natural intonation contours. For unrehearsed speech the subject still has difficulty pronouncing polysyllabic words intelligibly. However, with a few tries he can say any utterance found in the English language. Additionally, he can sing and make non-vocal sounds. The subject can also speak other languages. Even though Glove-TalkII has been designed for English speech sounds, it is a relatively simple matter to modify Glove-TalkII to produce speech sounds from other languages.
The initial mapping for Glove-TalkII is loosely based on an articulatory model of speech.
An open configuration of the hand corresponds to an unobstructed vocal
tract, which in turn generates vowel sounds. Different vowel sounds are produced
by movements of the hand in a horizontal X-Y plane that corresponds to
movements of the first two formants which are roughly related to
tongue position.
Consonants other than stops are produced by closing the
index, middle, or ring fingers or flexing the thumb, representing constrictions
in the vocal tract.
Stop consonants are produced by pressing keys on the keyboard. F0
is controlled by hand height and speaking intensity by foot pedal depression.
Glove-TalkII learns
the user's interpretation of this initial mapping.
The V/C network and the consonant network learn the
mapping from examples generated by the user during phases of training.
The vowel
network is trained on examples computed from the user-defined mapping
between hand-position and vowels. The F0 and volume
mappings are non-adaptive.
One subject was trained to use Glove-TalkII. After 100 hours of practice
he is able to speak intelligibly. The subject passed through 8 distinct
stages while he learned to speak.
His speech is fairly slow (1.5~to~3 times slower than normal speech) and somewhat
robotic. It sounds similar to speech produced with a text-to-speech synthesizer
but has a more natural intonation contour which greatly
improves the intelligibility and naturalness of the speech.
Reading novel passages intelligibly usually requires several attempts,
especially with polysyllabic words. Intelligible spontaneous speech
is possible but difficult.
We thank Peter Dayan, Sageev Oore and Mike Revow for their contributions.
This research was funded by the Institute for Robotics and Intelligent Systems
and NSERC. Geoffrey Hinton is the Noranda fellow of the Canadian Institute
for Advanced Research.
Return:
Footnote 1: See
[12]
for
an excellent introduction to neural networks and how they can be trained.
Return:
Footnote 2: Quantitative analysis of each of the various
neural networks on typical training data can be found in
[6].
Return:
Footnote 3: Calibration is performed infrequently due to the robustness of the
Cyberglove sensors.
SUMMARY
ACKNOWLEDGEMENTS
References
FOOTNOTES