Wednesday, November 7, 2007

Paper #20 - Speech and Sketching: An Empirical Study of Multimodal Interaction

Paper:
Speech and Sketching: An Empirical Study of Multimodal Interaction
(Aaron Adler and Randall Davis)

Summary:
The problem in this paper concerns the difficulty of sketching alone to convey the entire meaning of its design in the early stages. The author’s goal is to create a more natural speech and sketch recognition system which is capable of understanding enough of the user’s input. The limitations of such current systems include: unnatural speech consisting of short commands, the system’s inability to respond or add to sketches, a lack of freehand drawing, and the requirement of knowing a fixed symbol library. A study was conducted on 18 students requiring them to sketch: a familiar floor plan, an AC/DC transformer plan, a full adder design, and their digital current design project for their class. The setup consisted of interactions between an experimenter and a participant sitting across from each other using Tablet PCs.

The focus of study analysis was how speech and sketch worked together when people interacted with each other. Observations were fitted into five categories. In the sketching observation, sketching consisted of two types: stroke statistics and user of color. In the former, there were four types of strokes: creation, modification, selection, and writing. The percentage of ink ended up being more reflective of the participants’ drawing than was stroke count, and even though selection stroke was least common, it was still important since it was key to understanding the participant’s input. In the language observation, speech tended to have frequent word and phase repetition. Participants tended to respond to the experimenter’s question by repeating words used in the question. Speech utterances were also related to sketches. In the multimodal observation, there were three variants in the speech and sketch interaction: referencing list of items, referencing written words, and coordination between input modalities. In the question observation, participants would often make revisions or elaborate beyond what the question asked. It caused the participants to make the sketch more accurate, spurred explanations on other parts, and encouraged more detailed explanations of the sketch. In the comments observation, comments by the participants were not directly related to the sketch, yet were still valuable. Their comments aided the system in understanding their actions. The study analysis provided architectural implications for multimodal digital whiteboards. It demonstrated that implementations of them should integrate color knowledge and recognize the importance in switching colors, and should take advantage of the relationship between sketch and speech.

Discussion:
Like Oltmans' first paper (his master's thesis), the author's paper is also a multimodal paper which involves combining speech and sketch. Naturally, a comparison of the two papers would be warranted. What I did find most interesting about this paper is how it actually strives for natural speech as well, unlike the short command speech used in Oltmans' first paper. What I didn't see on my first reading of the paper is a mechanism to handle the special case when a user misspeaks repeatedly and then eventually corrects himself. If a user errs in drawing, the erasing functionality was enabled for correction, but it didn't seem like there was a speech correction mechanism. A possible solution to that problem may be to allow the system to employ some sort of NLP techniques to parse the later correct speech segments from the earlier incorrect speech segments. The solution I propose is very vague in scope though, and such parsing would appear as difficult of a problem as segmenting strokes in a stroke-based sketching approach.

2 comments:

Grandmaster Mash said...

Speech errors are one of my main concerns with speech interfaces. I was in a user interface class a few years ago and we created a speech interface from using some sort of toolkit. When testing the interface, users that had incorrectly recognized speech would continually say the exact same thing over and over again either louder, slower, or more distinct to try to get the system to recognize the command. If they became frustrated, the constant "No!", "UGH!" and "#&*#!" would throw off the system even more.

One issue is also saying the wrong command/keyword. This happened a few times in my study, and Adler's system is supposed to be natural, so the commands/keywords should be intuitive.

- D said...

So this comment is less mentally satisfying, but this reminds me of a time when I was trying to wade through those "talk to me" menus for my bank, trying to get to customer service. I got mad at the computer for just not putting me through to the help desk and called it retarded. It asked if I meant "Apply for a credit card." That made me laugh.

Speech recognition is relatively new as well, but it's getting good enough to put in phone systems. The more information you can get, the better you can make your system. All you have to do is a little better than chance and the little bits here and there really start to add up.