



Akikazu Takeuchi, Taketo Naito
Sony Computer Science Laboratory, Inc.
3-14-13 Higashi-Gotanda
Shinagawa-ku, Tokyo 141, Japan
TEL: +81-3-5448-4380
takeuchi@csl.sony.co.jp
Computer Science Department
Keio University
3-14-1, Hiyoshi, Kohoku-ku
Yokohama, Kanagawa 223, Japan
TEL: +81-45-560-1150
naito@mt.cs.keio.ac.jp
Most interactive programs have been assuming interaction with a single user. We propose the
notion of "Social Interaction" as a new interaction paradigm between multiple humans and
computers. Social interaction requires
that first a computer has the multiple participants model, second its behaviors are not only
determined by internal logic but also affected by perceived external situations, and finally it
actively joins the interaction. An experimental system with these features was developed. It
consists of three subsystems, a vision subsystem that processes motion video input to examine an
external situation, an action/reaction subsystem that generates an action based on internal logic of a
task and a situated reaction triggered by perceived external situation, and a facial animation
subsystem that generates a three-dimensional face capable of various facial displays. From the
experiment using the system with a number of subjects, we found that subjects generally tended to try
to interpret facial displays of the computer. Such involvement prevented them from concentrating on a
task. We also found that subjects never recognized situated reactions of the computer that were
unrelated to the task although they unconsciously responded to them. These findings seem to imply
subliminal involvement of the subjects caused by facial displays and situated reactions.
Recently, there is growing interest in an agent/agency model [15]. Maes and her colleagues are
developing interface agents [6] which are (semi-) automatic/autonomous and help users to perform
various tasks. An autonomous system has the ability to control itself and to make its own decisions.
Autonomy is essential to
surviving in a dynamically changing world. It is the subject of research in many areas including robotics,
artificial life and artificial ecosystems.
However, although autonomy is vital to surviving in the real world, it is only concerned with "self.'' It
is selfish by nature. It does not seem to work well in human society. Socialness is a higher-level
concept defined above the concept of an individual, and is the style of interaction between the individuals
in a group. Socialness can be applied to the interaction between humans and computers, and possibly to
that between multiple computers.
However, most interactive programs have been assuming interaction with a single user. Programs of this
type all share the following features:
The dialogical and transformational views are well suited to applications such as question-answering on
databases and drawing. It is true that there are many domains in which single-user-oriented systems fit
very well.
However, daily conversation is not always functional. One example is the co-constructive conversation
studied by Chovil [2]. Co-constructive conversation is that of a group of individuals in which, say,
people talk about the food they ate in a restaurant a month ago. There are no special roles (like a
chairperson) for the participants to play. They all have the same role. All participants try to
contribute to recalling the food by relating his or her memories about the food, adding comments, and
correcting the others' impressions. Turn-taking is controlled by eye-contact,
facial expression, body-gestures, voice tones, and so on. The conversation includes many
subconversations, some of them existing in parallel and dividing the group into subgroups. The
conversation terminates only when all the participants are satisfied with the conclusion.
Co-constructive conversation closely approximates day-to-day conversation. Conversation is a social
action. Suchman said that communication is not a symbolic process that happens to go on in real-
world settings, but a real-world activity in which we make use of language to delineate the
collective relevance of our shared environment [10]. To create a computer that can participate in
social conversations such as the co-constructive
conversation described above, is our research goal. To this end, we propose the notion of "Social
Interaction" as a new interaction paradigm between humans and computers.
In contrast to the single-user-oriented view, social interaction has the following features:
Face-to-face communication is a real-world activity, in which we use various communication
modalities,
including
verbal and nonverbal ones. Multimodal signals are consciously or unconsciously directed to other
participants in social interaction. People can perceive these signals and utilize them to coordinate
social interaction among a group. Participants' actions such as body gestures, eye-contact, facial
expressions, and coughing are major resources in a situation. Social interaction is multimodal
interaction, hence it is essentially situated interaction.
We are attempting to bring facial displays into computer human interaction as a new modality that
makes
the interaction tighter and more social [12]. The system we developed has a synthesized face capable of
various facial displays and shows an appropriate display depending on the current conversational
situation. In this paper we describe a new system extended towards social interaction. Figure 1
illustrates an overview of the new system. The new system can handle the multiple user model to some
extent. It is situated in the sense that it can perceive an external world through a video camera and
its behaviors are determined by internal logic as well as perceived situations. Its behaviors are
expressed as facial displays, eyeball movements, and/or head motions. Since it uses a situated behavior
model, a set of possible facial displays are called situated facial displays. The system is not a
transformational question-answering system, but an autonomous system observing users' actions and
performing appropriate actions in appropriate situations.
FIGURE 1.
Social Interaction
Takeuchi et al. reported that communicative facial displays help interaction between a human and a
computer [12]. Walker et al. studied how facial expression affected users' performance/productivity and
reported that a stern face is good for productivity, but creates a bad impression [13]. It implies that
handling a human face, or humanity in general, is not a simple task since it brings up various unclear
human factors at the same time. Brennan et al. reported that interaction with an anthropomorphized
agent takes more effort [1]. Nass et al. reported that even computer experts respond to computers as if
computers were social actors [7].
There are several application domains in which the notion of social interaction is essentially useful.
These domains share the following features:
One example is a car navigation system where it is important for a computer to take an action
when the driver takes a wrong path. There is also a clear distinction between the driver and
the human navigator. "Backseat
Driver" is a successful example in this application domain [8]. It talks to the driver whenever the
car comes close to or passes by an intersection where the car should turn.
Another example is a tutoring system, where there is a clear distinction between the tutor and the
student. A computer can actively help both of them by watching the situation and giving appropriate
advice.
The new system was developed for verifying the notion of situated facial displays and social
interaction. The system consists of three subsystems: a vision subsystem that processes motion video
input, an action/reaction subsystem that determines the system's behavior based on internal logic and
vision input, and a facial animation subsystem that generates a three-dimensional face with various
facial displays (Figure 1). Currently, all subsystems are running on an SGI 320VGX equipped with
VideoLab as a live video grabber.
The vision subsystem gets image data through a video camera that is placed on top of the display (see
Figure 1). There is a steep trade-off between the processing speed and the content of the processing.
The more information, the slower the processing. Since slow reactions are essentially useless in a
real-world setting, we have to force ourselves to reduce time-consuming processing. In the current
implementation, the subsystem can detect and track several users' positions simultaneously.
The vision subsystem keeps as a reference frame one still frame, which is the image of the scene with
no human. By detecting differences between incoming images and the reference frame, and segmenting
detected regions,
moving objects are extracted in real-time. Assuming that the only moving objects are humans in the
room, we can determine 2D positions of the humans in the image. Using the camera position and
direction, the positions are translated into 3D orientations, which can be applied to eyeball rotation
and face rotation when drawing a 3D face. Since this process is fast enough, the subsystem can keep
track of several users' positions at 30 frames per second.
The system's behaviors are expressed in terms of facial displays, eyeball movements and head motions.
The
system only has a synthesized human head. The face can express various facial displays, including those
listed in [12]. In addition to them, the eyeballs and head itself can also express various
communicative expressions such as gaze, wink, nod and so on. A "gaze" can establish eye-contact, which
is a powerful communication signal. Eyecontact/gaze is especially important when several individuals
are involved in social interaction, since eye-contact can indicate who a message is directed to.
In [12], computer's facial displays are determined by a correspondence between conversational
situations
and communicative facial displays, where "conversational situations" means different logical contexts
for processing a
user's inquiry. The current system was extended to incorporate the idea of "situatedness." Instructions
to the facial animation subsystem are first determined by the current logical context. This forms a basic
behavior. The other factor influencing the system's behavior is the "physical situation." The physical
situation means various information, including information about users' position, users' gestures, and
users' facial displays (not all of them are handled by the current system). For example, when one person
moves quickly, the action "look at that person"
is invoked. Actions invoked in this way are called reactions. In contrast, actions invoked based on
logical contexts are simply called actions. In general, reactions can always override actions.
Reactions are quick motions and never last long.
Gaze control is also implemented in the facial animation subsystem using a video camera fixed
on top of a display. Using this, eye-contact between a user and a computer are achieved,
although it is not real three dimensional eye-contact since the computer face is projected
onto a 2D display screen. Real eye-contact is
selective in the sense that only one pair of participants gets the feeling of eye-contact. In
contrast, eye-contact with a computer face on the screen is like eye-contact with a TV news announcer.
As long as the announcer looks straight ahead, every person watching the TV screen has the feeling of
eye-contact.
Task. The task of the system is to watch a game and give nonverbally some information to
participants.
The selected game is a card matching game similar to "Concentration" where sixteen cards are arranged
arbitrarily
with their faces turned down. Each card has a pattern with a different shape and color on the
underside. Two human players play alternately. In one turn, a player turns over two cards
sequentially by a mouse pointer. When the shapes on the underside of two cards match, the player gets
a point and the cards are removed. Otherwise the cards are turned over again and placed in the same
positions and the next player takes a turn. Color Plate 1 shows a screen image of this game.
COLOR PLATE 1.
A scene image of the card game experiment with a synthesized face.
At the beginning of the game, the computer chooses one player as its ally and the other as its
opponent. As the computer is watching the game, it performs two types of behaviors. The first is a
game action: this is an action that is associated with the current logical context of the game.
The second is a situated reaction: this is a reaction triggered by the current physical situation.
Table 1 lists both game actions and situated reactions implemented in the current system.
TABLE 1.
The behaviors of the current system: the first column lists names of actions and reactions;
the second lists their descriptions; the third shows how they were recognized by the subjects as a
percentage.
Subjects. The prototype system was tested on 7 volunteer subjects from a group of university-
level
computer science students. The average age of the subjects was 22.
Other conditions. In addition to the experimental condition described so far, which is denoted
by SF,
three other conditions, NF, SA and NA were prepared. They are all listed in Table 2. The intention
in designing these four
conditions is first to examine the effect of situatedness by turning on and off the situated reactions.
In conditions SF and SA, situated reactions were on, but in NF and NA they were turned off. The second is
to examine the
effect of a realistic face by comparing it with an arbitrary 3D object, in this case, a 3D arrow. A
3D arrow was considered as equivalent to a face without expressions (Color Plate 2). This arrow
could perform actions mainly expressed by the head's direction such as "looking at" and "tracking."
One exception is "smiling at a card." In this case, the arrow can flash its color to indicate
whether or not an ally touches the correct card. SF and NF denote the conditions with a face, while
SA and NA denote the conditions with an arrow.
COLOR PLATE 2.
A scene image of the card game experiment with a 3D arrow.
TABLE 2.
Four conditions
Seven subjects were grouped into four couples (one subject played twice). Each couple was requested
to perform the experiment under four different conditions. The experiment was recorded by a video
camera so that we could
extract scenes in which a subject looked at the computer face or a subject reacted to a computer's
action/reaction. Figure 2 illustrates a video recording setting. With a half mirror, we could obtain
the view from the center of the screen so that analysis of the subjects' eye movements was made easy.
FIGURE 2.
Video recording configuration
Although the number of subjects is not enough to conclude something statistically, we obtained a
number
of interesting observations.
The computer shows various actions and reactions. The third column in Table 1 shows how they were
recognized
by the subjects. It shows that the subjects noticed relatively well those actions/reactions which
were closely related to the game such as "Happy," "Shrug" and "Smiling at a card," while they
never noticed some of the situated reactions although they unconsciously responded to those
reactions as shown in Color Plate 3.
COLOR PLATE 3.
Comparison of time charts of the two games. Left: a situated face. Right: a
situated arrow.
Color Plate 3 shows two diagrams which illustrate behaviors of human players and the computer face
during the experiment under SF condition and SA condition. They were created by analyzing a video
recording the game. In
the picture, the vertical axis is the time axis. Horizontal gray stripes indicate opponent turns except
the first one that is the opening scene in the game, while white zones between gray zones are ally turns.
In each diagram,
there are three columns; blue, red and green with yellow. The blue column represents the ally's
behavior where short branches from a fat trunk indicate periods when the ally looked at the
computer's face. Similarly, the red column represents the opponent's behavior with left branches
indicating the opponent looked at the computer's face. The green column represents the computer
face's behavior, especially in which direction it was looking. It
consists of three vertical subcolumns, and they represent "looking at the ally," "looking at cards," and
"looking at the opponent," from left to right, respectively. Yellow branches included in the left and
right subcolumn indicate "Looking at a player" reactions. The picture shows eye-contact as line-contact
between blue and green (yellow) branches, or red and green (yellow) branches. By close examination of
the picture, we found that:
Table 3 shows subjects' evaluation of usefulness and entertainment factors for each condition. Smaller
numbers in each cell indicate better scores. As the table indicates, an arrow was recognized as a useful
and reliable tool, while a face was accepted for entertainment or fun.
TABLE 3.
Usefulness and entertainment evaluation
Abstract
Keywords:
User interface design, multimodal interfaces, facial expression,
anthropomorphism,
subliminal involvement
MOTIVATION AND Introduction
SOCIAL INTERACTION AND SITUATEDNESS
RELATED WORK
Within the context of CSCW, especially in videoconferencing research, social interaction among human
participants are studied, and the importance of gaze/eye-contact is reported (for example [5]). There
is also an attempt to develop such a videoconference system that can support local eye-contact between
two participants [9]
to promote social interaction. Our work is in parallel with these activities, and tries to develop a
synthetic agent capable of social interaction with humans and other agents.
SYSTEM ARCHITECTURE
Vision
Action/Reaction
The outputs of the vision subsystem are representations of external events observed visually, and these
are sent to the action/reaction subsystem. Although the information obtained is limited, we can extract
the following:
Facial Animation
The face is modeled three-dimensionally. The current face is composed of approximately 500 polygons.
The face may be rendered using a skin-like surface material by applying a texture map taken from a
photograph or a video
frame. In 3D computer graphics, a facial display is realized by the local deformation of the polygons
representing the face. Waters showed that deformation that simulates the action of muscles underlying
the face looks more natural [14]. We use the numerical equations defined by Waters to simulate muscle
dynamics. Currently, 16 muscles and 10 parameters, controlling mouth opening, jaw rotation, eye
movement, eyelid opening, and head orientation are incorporated. The facial modeling and animation
system is based on Takeuchi and Franks' work [11].
EXPERIMENT
Method
Using the current prototype system, we performed the experiment under several different conditions to
investigate the effect of situatedness and to study how much socialness was achieved.
Result