A scientist called Carol Fowler has apparently done research that mine is related to. I was told so by Dominic Massaro, at the Visual Prosody workshop in Nijmegen MPI. He said that my findings on sign perception are similar to speech perception. Specifically, a delay of about 90 ms between detection and recognition, which I found for sign perception, was also found for speech perception by Fowler. But which is the literature I should consult? At the Haskins laboratory she was part of a stream of research in the 1980’s that treated speech as articulatory gestures:
Carol Fowler proposed a direct realism theory of speech perception: listeners perceive gestures not by means of a specialized decoder, as in the motor theory, but because information in the acoustic signal specifies the gestures that form it.
I am a little worried about the title though, in particular the phrase ‘Visual Prosody’. It appears to suggest that the main role of visual information in language is prosodic, which at least for sign language and gestures is not the case in my opinion. But the abstract does mention other aspects of visual information in language, so it must be allright if I add my perspective.
Some people are actively interested in the stuff I am doing in my PhD studies, or at least ask me questions about it. I usually tell them about my first experiment. That experiment was entirely about the difference between meaningless movements I call fidgeting and meaningful gestures, in this case sign language signs.
“Press the spacebar as soon as you see a sign”
It struck me then and it still strikes me that a bunch of people talking respond to each other so appropriately. Many, many times did I see people reacting to gestures of all sorts. Maybe just a little headnod or a palm-up gesture, or a raising of the eyebrows. And how often do you see anyone accidentally responding to a movement that was not intended to communicate after all?
Imagine the following chitchat:
You: “Nice weather huh?”
Her: “Yeah” (and makes some sort of movement)
You: “What do you mean, you think I am crazy?” (misinterpreting the movement)
Her: “I didn’t do anything, what are you talking about?” (now starts thinking you are crazy)
It just doesn’t happen.
No matter how much we talk and interact, it hardly ever goes wrong.
I will take the exceptional examples as exemplifying the rule.
So, I set out to see if I could test this in a lab. How fast can people make judgements about the status of a movement. I used sign language signs and fidgeting, and told people to press a button as soon as they saw a sign.
And I found people could do that very well and very fast. Even non-signers could do it. (In case you want to read more: the journal Gesture recently accepted my publication of these results, hooray!).
If you want you can repeat the experiment in real life whenever you (and a friend) watch a conversation. Just put up your finger as soon as you see the talking people make a gesture. I bet you will both skip the fidgeting and spot the gestures.
Now, imagine a gesture recognizing computer trying to do the same trick and ignore fidgeting. Currently computers that are programmed to recognize gestures, simply assume any movement is a gesture candidate, and will try to classify it against their vocabulary. In speech recognition one might see a similar problem. People say things like “ehm” or “ehr..” during an utterance. They may also cough, sneeze or scrape their throat. But is that really comparable to fidgeting?
I am tempted to think that they are quite different. Coughing or sneezing is a bodily function, whereas fidgeting is usually just a ritualized watered-down version of some bodily function, if any. The reason behind it is quite different. Saying “ehm” is mostly a way to fill the gap, or keep the floor, in a poorly planned utterance. It is in a way as much a deliberate part of the communication as the words used. Nevertheless the computers task is more or less the same: it must withstand the disruptions and continue recognizing the words (or gestures) as if nothing happened. Both “ehm” and fidgeting should be ignored without damaging other processes. And that is quite a challenge as it is. In speech recognition several techniques have been invented to cope with “ehm” and out-of-vocabulary (OOV) words. Most importantly ‘word spotting’ and ‘filler and garbage models’. Perhaps gesture recognition would do well to have a closer look at those techniques to start safely ignoring fidgeting?