CHI '95 ProceedingsTopIndexes
PapersTOC

Situated Facial Displays: Towards Social Interaction

Akikazu Takeuchi, Taketo Naito


Sony Computer Science Laboratory, Inc.
3-14-13 Higashi-Gotanda
Shinagawa-ku, Tokyo 141, Japan
TEL: +81-3-5448-4380
takeuchi@csl.sony.co.jp

Computer Science Department
Keio University
3-14-1, Hiyoshi, Kohoku-ku
Yokohama, Kanagawa 223, Japan
TEL: +81-45-560-1150
naito@mt.cs.keio.ac.jp

© ACM

Abstract

Most interactive programs have been assuming interaction with a single user. We propose the notion of "Social Interaction" as a new interaction paradigm between multiple humans and computers. Social interaction requires that first a computer has the multiple participants model, second its behaviors are not only determined by internal logic but also affected by perceived external situations, and finally it actively joins the interaction. An experimental system with these features was developed. It consists of three subsystems, a vision subsystem that processes motion video input to examine an external situation, an action/reaction subsystem that generates an action based on internal logic of a task and a situated reaction triggered by perceived external situation, and a facial animation subsystem that generates a three-dimensional face capable of various facial displays. From the experiment using the system with a number of subjects, we found that subjects generally tended to try to interpret facial displays of the computer. Such involvement prevented them from concentrating on a task. We also found that subjects never recognized situated reactions of the computer that were unrelated to the task although they unconsciously responded to them. These findings seem to imply subliminal involvement of the subjects caused by facial displays and situated reactions.

Keywords:

User interface design, multimodal interfaces, facial expression, anthropomorphism, subliminal involvement

MOTIVATION AND Introduction

Recently, there is growing interest in an agent/agency model [15]. Maes and her colleagues are developing interface agents [6] which are (semi-) automatic/autonomous and help users to perform various tasks. An autonomous system has the ability to control itself and to make its own decisions. Autonomy is essential to surviving in a dynamically changing world. It is the subject of research in many areas including robotics, artificial life and artificial ecosystems. However, although autonomy is vital to surviving in the real world, it is only concerned with "self.'' It is selfish by nature. It does not seem to work well in human society. Socialness is a higher-level concept defined above the concept of an individual, and is the style of interaction between the individuals in a group. Socialness can be applied to the interaction between humans and computers, and possibly to that between multiple computers. However, most interactive programs have been assuming interaction with a single user. Programs of this type all share the following features:

The dialogical and transformational views are well suited to applications such as question-answering on databases and drawing. It is true that there are many domains in which single-user-oriented systems fit very well.

However, daily conversation is not always functional. One example is the co-constructive conversation studied by Chovil [2]. Co-constructive conversation is that of a group of individuals in which, say, people talk about the food they ate in a restaurant a month ago. There are no special roles (like a chairperson) for the participants to play. They all have the same role. All participants try to contribute to recalling the food by relating his or her memories about the food, adding comments, and correcting the others' impressions. Turn-taking is controlled by eye-contact, facial expression, body-gestures, voice tones, and so on. The conversation includes many subconversations, some of them existing in parallel and dividing the group into subgroups. The conversation terminates only when all the participants are satisfied with the conclusion.

Co-constructive conversation closely approximates day-to-day conversation. Conversation is a social action. Suchman said that communication is not a symbolic process that happens to go on in real- world settings, but a real-world activity in which we make use of language to delineate the collective relevance of our shared environment [10]. To create a computer that can participate in social conversations such as the co-constructive conversation described above, is our research goal. To this end, we propose the notion of "Social Interaction" as a new interaction paradigm between humans and computers.

SOCIAL INTERACTION AND SITUATEDNESS

In contrast to the single-user-oriented view, social interaction has the following features:

Face-to-face communication is a real-world activity, in which we use various communication modalities, including verbal and nonverbal ones. Multimodal signals are consciously or unconsciously directed to other participants in social interaction. People can perceive these signals and utilize them to coordinate social interaction among a group. Participants' actions such as body gestures, eye-contact, facial expressions, and coughing are major resources in a situation. Social interaction is multimodal interaction, hence it is essentially situated interaction.

We are attempting to bring facial displays into computer human interaction as a new modality that makes the interaction tighter and more social [12]. The system we developed has a synthesized face capable of various facial displays and shows an appropriate display depending on the current conversational situation. In this paper we describe a new system extended towards social interaction. Figure 1 illustrates an overview of the new system. The new system can handle the multiple user model to some extent. It is situated in the sense that it can perceive an external world through a video camera and its behaviors are determined by internal logic as well as perceived situations. Its behaviors are expressed as facial displays, eyeball movements, and/or head motions. Since it uses a situated behavior model, a set of possible facial displays are called situated facial displays. The system is not a transformational question-answering system, but an autonomous system observing users' actions and performing appropriate actions in appropriate situations.

FIGURE 1. Social Interaction

RELATED WORK

Within the context of CSCW, especially in videoconferencing research, social interaction among human participants are studied, and the importance of gaze/eye-contact is reported (for example [5]). There is also an attempt to develop such a videoconference system that can support local eye-contact between two participants [9] to promote social interaction. Our work is in parallel with these activities, and tries to develop a synthetic agent capable of social interaction with humans and other agents.

Takeuchi et al. reported that communicative facial displays help interaction between a human and a computer [12]. Walker et al. studied how facial expression affected users' performance/productivity and reported that a stern face is good for productivity, but creates a bad impression [13]. It implies that handling a human face, or humanity in general, is not a simple task since it brings up various unclear human factors at the same time. Brennan et al. reported that interaction with an anthropomorphized agent takes more effort [1]. Nass et al. reported that even computer experts respond to computers as if computers were social actors [7].

There are several application domains in which the notion of social interaction is essentially useful. These domains share the following features:

One example is a car navigation system where it is important for a computer to take an action when the driver takes a wrong path. There is also a clear distinction between the driver and the human navigator. "Backseat Driver" is a successful example in this application domain [8]. It talks to the driver whenever the car comes close to or passes by an intersection where the car should turn.

Another example is a tutoring system, where there is a clear distinction between the tutor and the student. A computer can actively help both of them by watching the situation and giving appropriate advice.

SYSTEM ARCHITECTURE

The new system was developed for verifying the notion of situated facial displays and social interaction. The system consists of three subsystems: a vision subsystem that processes motion video input, an action/reaction subsystem that determines the system's behavior based on internal logic and vision input, and a facial animation subsystem that generates a three-dimensional face with various facial displays (Figure 1). Currently, all subsystems are running on an SGI 320VGX equipped with VideoLab as a live video grabber.

Vision

The vision subsystem gets image data through a video camera that is placed on top of the display (see Figure 1). There is a steep trade-off between the processing speed and the content of the processing. The more information, the slower the processing. Since slow reactions are essentially useless in a real-world setting, we have to force ourselves to reduce time-consuming processing. In the current implementation, the subsystem can detect and track several users' positions simultaneously.

The vision subsystem keeps as a reference frame one still frame, which is the image of the scene with no human. By detecting differences between incoming images and the reference frame, and segmenting detected regions, moving objects are extracted in real-time. Assuming that the only moving objects are humans in the room, we can determine 2D positions of the humans in the image. Using the camera position and direction, the positions are translated into 3D orientations, which can be applied to eyeball rotation and face rotation when drawing a 3D face. Since this process is fast enough, the subsystem can keep track of several users' positions at 30 frames per second.

Action/Reaction

The outputs of the vision subsystem are representations of external events observed visually, and these are sent to the action/reaction subsystem. Although the information obtained is limited, we can extract the following:

The system's behaviors are expressed in terms of facial displays, eyeball movements and head motions. The system only has a synthesized human head. The face can express various facial displays, including those listed in [12]. In addition to them, the eyeballs and head itself can also express various communicative expressions such as gaze, wink, nod and so on. A "gaze" can establish eye-contact, which is a powerful communication signal. Eyecontact/gaze is especially important when several individuals are involved in social interaction, since eye-contact can indicate who a message is directed to.

In [12], computer's facial displays are determined by a correspondence between conversational situations and communicative facial displays, where "conversational situations" means different logical contexts for processing a user's inquiry. The current system was extended to incorporate the idea of "situatedness." Instructions to the facial animation subsystem are first determined by the current logical context. This forms a basic behavior. The other factor influencing the system's behavior is the "physical situation." The physical situation means various information, including information about users' position, users' gestures, and users' facial displays (not all of them are handled by the current system). For example, when one person moves quickly, the action "look at that person" is invoked. Actions invoked in this way are called reactions. In contrast, actions invoked based on logical contexts are simply called actions. In general, reactions can always override actions. Reactions are quick motions and never last long.

Facial Animation

The face is modeled three-dimensionally. The current face is composed of approximately 500 polygons. The face may be rendered using a skin-like surface material by applying a texture map taken from a photograph or a video frame. In 3D computer graphics, a facial display is realized by the local deformation of the polygons representing the face. Waters showed that deformation that simulates the action of muscles underlying the face looks more natural [14]. We use the numerical equations defined by Waters to simulate muscle dynamics. Currently, 16 muscles and 10 parameters, controlling mouth opening, jaw rotation, eye movement, eyelid opening, and head orientation are incorporated. The facial modeling and animation system is based on Takeuchi and Franks' work [11].

Gaze control is also implemented in the facial animation subsystem using a video camera fixed on top of a display. Using this, eye-contact between a user and a computer are achieved, although it is not real three dimensional eye-contact since the computer face is projected onto a 2D display screen. Real eye-contact is selective in the sense that only one pair of participants gets the feeling of eye-contact. In contrast, eye-contact with a computer face on the screen is like eye-contact with a TV news announcer. As long as the announcer looks straight ahead, every person watching the TV screen has the feeling of eye-contact.

EXPERIMENT

Method

Using the current prototype system, we performed the experiment under several different conditions to investigate the effect of situatedness and to study how much socialness was achieved.

Task. The task of the system is to watch a game and give nonverbally some information to participants. The selected game is a card matching game similar to "Concentration" where sixteen cards are arranged arbitrarily with their faces turned down. Each card has a pattern with a different shape and color on the underside. Two human players play alternately. In one turn, a player turns over two cards sequentially by a mouse pointer. When the shapes on the underside of two cards match, the player gets a point and the cards are removed. Otherwise the cards are turned over again and placed in the same positions and the next player takes a turn. Color Plate 1 shows a screen image of this game.

COLOR PLATE 1. A scene image of the card game experiment with a synthesized face.

At the beginning of the game, the computer chooses one player as its ally and the other as its opponent. As the computer is watching the game, it performs two types of behaviors. The first is a game action: this is an action that is associated with the current logical context of the game. The second is a situated reaction: this is a reaction triggered by the current physical situation. Table 1 lists both game actions and situated reactions implemented in the current system.

TABLE 1. The behaviors of the current system: the first column lists names of actions and reactions; the second lists their descriptions; the third shows how they were recognized by the subjects as a percentage.

Subjects. The prototype system was tested on 7 volunteer subjects from a group of university- level computer science students. The average age of the subjects was 22.

Other conditions. In addition to the experimental condition described so far, which is denoted by SF, three other conditions, NF, SA and NA were prepared. They are all listed in Table 2. The intention in designing these four conditions is first to examine the effect of situatedness by turning on and off the situated reactions. In conditions SF and SA, situated reactions were on, but in NF and NA they were turned off. The second is to examine the effect of a realistic face by comparing it with an arbitrary 3D object, in this case, a 3D arrow. A 3D arrow was considered as equivalent to a face without expressions (Color Plate 2). This arrow could perform actions mainly expressed by the head's direction such as "looking at" and "tracking." One exception is "smiling at a card." In this case, the arrow can flash its color to indicate whether or not an ally touches the correct card. SF and NF denote the conditions with a face, while SA and NA denote the conditions with an arrow.

COLOR PLATE 2. A scene image of the card game experiment with a 3D arrow.

TABLE 2. Four conditions

Seven subjects were grouped into four couples (one subject played twice). Each couple was requested to perform the experiment under four different conditions. The experiment was recorded by a video camera so that we could extract scenes in which a subject looked at the computer face or a subject reacted to a computer's action/reaction. Figure 2 illustrates a video recording setting. With a half mirror, we could obtain the view from the center of the screen so that analysis of the subjects' eye movements was made easy.

FIGURE 2. Video recording configuration

Result

Although the number of subjects is not enough to conclude something statistically, we obtained a number of interesting observations. The computer shows various actions and reactions. The third column in Table 1 shows how they were recognized by the subjects. It shows that the subjects noticed relatively well those actions/reactions which were closely related to the game such as "Happy," "Shrug" and "Smiling at a card," while they never noticed some of the situated reactions although they unconsciously responded to those reactions as shown in Color Plate 3.

COLOR PLATE 3. Comparison of time charts of the two games. Left: a situated face. Right: a situated arrow.

Color Plate 3 shows two diagrams which illustrate behaviors of human players and the computer face during the experiment under SF condition and SA condition. They were created by analyzing a video recording the game. In the picture, the vertical axis is the time axis. Horizontal gray stripes indicate opponent turns except the first one that is the opening scene in the game, while white zones between gray zones are ally turns. In each diagram, there are three columns; blue, red and green with yellow. The blue column represents the ally's behavior where short branches from a fat trunk indicate periods when the ally looked at the computer's face. Similarly, the red column represents the opponent's behavior with left branches indicating the opponent looked at the computer's face. The green column represents the computer face's behavior, especially in which direction it was looking. It consists of three vertical subcolumns, and they represent "looking at the ally," "looking at cards," and "looking at the opponent," from left to right, respectively. Yellow branches included in the left and right subcolumn indicate "Looking at a player" reactions. The picture shows eye-contact as line-contact between blue and green (yellow) branches, or red and green (yellow) branches. By close examination of the picture, we found that:

Table 3 shows subjects' evaluation of usefulness and entertainment factors for each condition. Smaller numbers in each cell indicate better scores. As the table indicates, an arrow was recognized as a useful and reliable tool, while a face was accepted for entertainment or fun.

TABLE 3. Usefulness and entertainment evaluation

DISCUSSION AND FUTURE DIRECTIONS

From the analysis of the results of the experiment, we have the following observations:

It is interesting that subjects never recognized situated reactions that were not directly related to the game although they unconsciously responded to them. This implies a subliminal involvement. People respond to systems with a human-like face in clearly different way from another system with an arrow. The difference is their attitude. When people face a human-like system, they try to read subtle signals, interpret them, and respond to them. All this happens consciously and unconsciously.

Facial displays are subtle expressions by nature. People naturally try to interpret them. This causes slow response and unconscious reactions in systems with human-like faces. The same evidence was reported by Brennan [1]. Anthropomorphism has been criticized for being inefficient and requiring more effort [3]. However, this is because people appreciate human images and try to interpret them. Such involvement is not a negative effect. We surmise that once people are accustomed to synthesized faces, performance becomes more efficient, and a long partnership further improves performance. Human-like characterization is one good form of autonomous agents, because people are accustomed to interact with other humans.

It is clear that facial displays are useful in entertainment applications. In addition, facial displays can be used in more subtle, complex and therefore sophisticated situations. GUIDE system is a successful example [4]. Although it uses the whole human video image, it exemplifies the usage of human images as indications of cultural background.

The present paper suggests that people's attitude towards computers with human-like faces is completely different from that to ordinary computers with desktop-style interfaces. Understanding this difference is important when designing a computer system capable of social interaction with humans. Situated reaction seems to affect users' behavior and to get participants involved in interaction, but we do not know how it can contribute to the content of interaction. We need more research on application domains.

Acknowledgments

We thank Steve Franks for his early contribution to the facial animation subsystem. Special thanks to Keith Waters for his original animation system. We thank Mario Tokoro and our colleagues at Sony CSL for their encouragement and discussion.

References

  1. Brennan, S. E. and Ohaeri, J. O. Effect of Message Style on Users' Attributions toward Agents, In CHI'94 Conference Companion Human Factors in Computing Systems (Boston, April 24-28, 1994), ACM Press, pp. 281-282.
  2. Chovil, N. Discourse-oriented facial displays in conversation. Research on Language and Social Interaction, 25 (1991/1992) 163-194.
  3. Don, A. and Brennan, S. and Laurel, B. and Shneiderman, B. Anthropomorphism: from Eliza to Terminator 2, In Proc. CHI'92 Human Factors in Computing Systems (Monterey, May 3-7, 1992), ACM Press, pp. 67-70.
  4. Don, A. and Oren, T. and Laurel, B. GUIDES 3.0, In Proc. CHI'91 Human Factors in Computing Systems (New Orleans, April 27-May 2, 1991), ACM Press, pp. 447-448.
  5. Ishii, H. and Kobayashi, M. and Grudin, J. Integration of Interpersonal Space and Shared Workspace: ClearBoard Design and Experiments.ACM Trans. on Information Systems, 11, 4 (Oct. 1993) 349-375.
  6. Maes, P. and Kozierok, R. Learning interface agents, In Proc. AAAIĠ93 (1993), MIT Press, Cambridge, pp. 459-465.
  7. Nass, C. and Steuer, J. and Tauber, E. R. Computers are Social Actors, In Proc. CHI'94 Human Factors in Computing Systems (Boston, April 24-28, 1994), ACM Press, pp. 72-78.
  8. Schmandt, C. M. and Davis, J. R. Synthetics speech for real time direction giving, In Digest of Technical Papers, IEEE ICCE, 1989, pp. 288-289.
  9. Sellen, A. and Buxton, B. Using Spacial Cues to Improve Videoconferencing, In Proc. CHI'92 Human Factors in Computing Systems (Monterey, May 3-7, 1992), ACM Press, pp. 651-652.
  10. Suchman, L. A. Plans and Situated Actions, Cambridge University Press, Cambridge, 1987.
  11. Takeuchi, A. and Franks, S. A Rapid Face Construction Lab. Tech. Report. SCSL-TR-92-010, Sony Computer Science Laboratory, Inc., Tokyo, 1992.
  12. Takeuchi, A. and Nagao, K. Communicative Facial Displays as a New Conversation Modality, In Proc. INTERCHI'93 Human Factors in Computing Systems (Amsterdam, April 24-29, 1993), ACM Press, pp. 187-193.
  13. Walker, J. and Sproull, L. and Subramani, R. Using a Human Face in an Interface, In Proc. CHI'94 Human Factors in Computing Systems (Boston, April 24-28, 1994), ACM Press, pp. 85-91.
  14. Waters, K. A muscle model for animating three-dimensional facial expression, In Computer Graphics 21, 4 (July 1987), 17-24.
  15. Special Issue on Intelligent Agents, Communications of the ACM 37, 7 (July 1994).