Communication Institute for Online Scholarship
Communication Institute for Online
Scholarship Continous online service and innovation
since 1986
Site index
ComAbstracts Visual Communication Concept Explorer Tables of Contents Electronic Journal of Communication ComVista

Dialogue Management for Computer-mediated Spoken Bilingual Dialogue
EJC logo
The Electronic Journal of Communication / La Revue Electronique de Communication
******* MILLER ********** EJC/REC Vol. 6, No. 3, 1996 ******


Keith Miller
Language Analysis Systems and Georgetown University

Susann LuperFoy
MITRE Corporation and Georgetown University

Esther Kim
Massachusetts Institute of Technology

David Duff
MITRE Corporation

        Abstract.  The empirical study reported on here
     is an effort to extract design constraints for the
     dialogue manager for a complex genre of computer-
     mediated communication:  real-time, spoken
     language dialogue between two telephone
     conversants speaking two different languages in a
     voice-only exchange.  This family of applications
     for bilingual spoken communication is termed the
     interpreting telephone (IT).  We sought to
     discover desiderata for the discourse module of
     such a system, abstracting away from all aspects
     of the problem other than those related to
     discourse processing.  To simulate a zero
     computational cost for machine translation and
     speech processing, a "Wizard of Oz" study was
     designed.  All input and output language was in
     the native language of the wizard, so that
     "translation" was from English to English.

        Even with the cost of all other computation
     reduced to near zero, lengthy delays entailed by
     half-duplex message transmission and the absence
     of system status indicators caused users to become
     impatient and easily distracted by other
     activities, and to lose track of whose turn it was
     to speak.  This article describes the data
     collection procedure and our analysis of the
     discourse results, including the discovery of
     several inadequacies of the standard system design
     for interpreting telephone systems.  We conclude
     by proposing a design for a multi-modal system to
     address these inadequacies.

                Introduction and Background

     As people exploit the ever-increasing capacity of
global communication networks, the need arises for discourse
processing technology to automate the mediation of the
interchange.  Tools for electronic communication in use
today offer limited dialogue mediation or no mediation at
all.  Turn-taking, adjudication of conflicts, and discourse
history tracking continue to be handled by the humans
involved.  However, as collaborative groups expand to
include more participants and multiple modes of interaction
(e.g., speech, keyboard text), the task of dialogue
management becomes increasingly unwieldy, and the need for
an automated solution overwhelming.

     This article examines the role of dialogue management
in computer-mediated voice machine translation (MT) systems,
and reports on a study of one such system.  After a brief
description of the MT project and the discourse requirements
of the voice-to-voice MT application, we describe the
empirical study and results obtained.  The study seeks to
extract design constraints for the dialogue manager for a
complex genre of mediated communication:  that of the
Interpreting Telephone (described below).  The study also is
intended to aid in the definition of parameters to be
measured in subsequent investigations into computer
mediation of bilingual interactions.  More generally, the
goal of our project is greater understanding of the task of
mediating electronic interaction.

Voice-to-Voice Machine Translation

     The Interpreting Telephony (IT) research community
(Morimoto, et al., 1989; Waibel, 1991) is in the process of
developing a device to enable real-time computer-mediated
conversations between monolingual speakers of different
languages.  The application calls for spoken language
understanding and generation, as well as bi-directional
translation, all with real-time operation.  Initially, the
subject matter is restricted in these studies to the narrow
domain of conference registration.  One caller is the
registrar for an international conference, the other caller
is a prospective conference attendee seeking registration
information.  In each of the sample dialogues currently
under analysis, an English speaker, E, phones the office of
the registrar for an international conference and is greeted
by a Japanese speaker, J. E solicits information concerning
deadlines, venue, submission requirements, and excursions
associated with the conference.  J answers these questions
and requests the caller's name and mailing address so that
registration materials can be shipped to E.

     The assumptions of cooperative dialogue and limited
subject domain are shared by other voice-to-voice MT
systems, including, most notably, Verbmobil (Kay et al.,
1993; Wahlster, 1993; Quantz et al., 1994; Maier, 1996).
This large scale research project is developing a spoken-
language MT system that enables English dialogue between two
non-native speakers of English who are scheduling a date and
time for a future meeting.  However, Verbmobil differs from
the Interpreting Telephone in three significant ways.
First, it is being designed for face-to-face rather than
telephone dialogue, which means that speakers can exchange
visual information.  Second, the Verbmobil dialogue is
conducted monolingually in English with the Verbmobil system
assisting in conversion of words, phrases or utterances into
English from some other language.  In contrast, the IT
system is for bidirectional translation alternating between
Japanese-to-English and English-to-Japanese after each turn
in the dialogue.  Third, unlike the Interpreting Telephone,
Verbmobil is not required to translate every utterance but
is invoked only when one of the speakers has difficulty
formulating an English utterance.  This can complicate the
task of the Verbmobil dialogue management task.  Both the
Verbmobil and Interpreting Telephony projects give rise to a
new set of requirements on discourse processors for
automated dialogue mediation systems.

Discourse for Interpreting Telephony

     In this study, we focus on the module that enables the
IT to perform discourse processing, the generic dialogue
processor.  The dialogue processor extracts and accumulates
information from the spoken context, maintaining an updated
representation of the ongoing discourse.  It uses that
representation to produce in-context interpretation of input
utterances and to generate appropriate context-sensitive
output utterances.  The discourse module can also supply
context information to other software components, to improve
their performance and facilitate user-system natural
language interaction in a multi-user environment.

     Our dual objectives were to demonstrate the value of
the discourse module, and to develop a set of discourse
design principles that could be generalized to other multi-
user systems mediating real-time or "near real-time" human
interchange and/or bidirectional translation.  Linguists
have begun to look at the discourse structure and content
that is produced in actual spontaneous computer-mediated
human interaction in text-based systems.  These data come
from electronic mailing lists (Herring, 1996), chat systems
(Werry, 1996), electronic bulletin boards (Collot & Belmore,
1996), and MUDs, or multi-user domains (Cherny, 1995).
Preliminary results indicate the need for a dialogue manager
to track parallel discussion threads that emerge in
spontaneous exchanges among large groups of people on
arbitrary topics (Robin, 1995).  Additional interesting
results often stem from the unique discourse characteristics
of the conversational environment itself, e.g., the metaphor
of a virtual room in opposition to the real world (Cherny,

     Dialogue Manager Tasks.  Real-time, interactive spoken
dialogue interpretation requires multiple discourse
processing strategies for the different tasks with which it
is faced.  Similar to a human translator performing bi-
directional simultaneous interpretation between two mono-
lingual clients, the dialogue manager has three types of
discourse processing tasks.

     First, the dialogue manager for real-time voice-to-
voice MT systems mediates a potentially complex exchange
between two clients.  Our dialogue manager enforces a simple
turn-taking protocol so that it can impose a total ordering
on utterances and partition input speech into non-
overlapping turns.  This ensures its own ability to record
an accurate history of the dialogue utterances as they
occur.  This process yields one bilingual user-user
dialogue, consisting of an alternating sequence of Japanese
and English utterances.  The bilingual dialogue relies
crucially on speech acts and speaker intentions.  One client
user will direct rejections, requests for confirmation or
clarification, apologies, warnings, etc. to the other
client, independent of the fact that their dialogue is known
to be mediated.  The mediator must consider the unfolding
collaborative dialogue (Grosz & Sidner, 1986) in order to
discern speaker intentions, speech acts, and rhetorical
moves exchanged by the two human users.

     The bilingual dialogue defined by the sequence of user
input utterances above can only occur in the presence of a
proficient interpreter, and is experienced by the
interpreter only.  That is, neither of the clients
experiences this bilingual dialogue.  Rather, each speaks in
their own language and hears only their own language through
the mediator's speech synthesizer.  Thus, the second type of
discourse processing task handled by the dialogue manager
involves the tracking of a distinct level of discourse
consisting of two separate monolingual dialogues, in this
case an English-English dialogue between E and interpreting
telephone (IT), and a Japanese-Japanese dialogue between J
and IT.  While participating in these two monolingual
dialogues, the mediator must obey the respective set of
rules for cooperative dialogue behavior for each.  In each
of these dialogues, half of the surface forms are input
utterances from one user in his native language, and half
are translations into that user's native language, generated
by the mediating MT system.  In this role, IT can be viewed
as a conduit, simply passing (the translation of) an
utterance from one client to the other.

     The third and final type of discourse processing task
performed by the dialogue manager is that of tracking two
separate monolingual dialogue histories in which IT is an
agent.  These dialogue histories are monolingual,
untranslated user-interface dialogues on a topic other than
the domain issue of conference registration.  The dialogues
take place between IT and one of the clients, in the
client's native language, and may be initiated by either the
client or IT.  For example, client-initiated, client-IT
dialogues with IT as agent may occur when one of the client
users wishes to address the interpreting telephone device to
issue a request or to elicit clarification from the IT,
without involving the other client.  This entails a
departure from the main dialogue to replay some past parts
of the conversation ("What was it that he said?"), to
request logistical information ("Has the person answered the
phone?", "How long have we been talking?", "Who is paying
for this call?"), to ask for information relevant to the
phone's task ("What time is it in Tokyo?"), to control the
overall dialogue ("Hang up," "Dial 555-1234"), or for
clarification ("Was that 'seven ninety-two'?").
Alternatively, IT-initiated, client-IT dialogues with IT as
agent may occur when the device needs to address the user
because of trouble during what Clark and Schaeffer (1987)
call the acceptance phase of each input contribution, e.g.,
if the parser or the machine translation operation fails,
the IT dialogue manager needs to ask the user to rephrase
the last input.  If J is the one addressed by the dialogue
manager, the exchange is not translated or transmitted to E.
E hears only silence while waiting for the English
translation of J's next communication directed at E.  This
task adds two additional dialogue histories that the
mediator must track.  This facility is important for
avoiding prolonged confusion as the result of a minor
communication failure between one of the users and the

     Unique discourse phenomena come into play with respect
to the second type of dialogue (with IT as conduit).
Because unmediated bilingual dialogues do not occur in
normal human experience, conventions for dialogue
interaction or rhetorical structure have not been developed.
The only strategy for J to employ is to observe the
conventions of Japanese dialogue.  This follows naturally
from the fact that J is not experiencing a bilingual
dialogue, but a monolingual (J-J) dialogue, as illustrated
above.  The translation system will be most intuitive for J
to use if it adheres to those Japanese dialogue conventions
when it generates Japanese utterances.  It will likewise
benefit in interpreting J's input utterances by
acknowledging J's adherence to rules of Japanese discourse.
For example, Japanese requires morphological marking of
honorific cues.  Hobbs and Kameyama (1990) point out that
while the English speaker may have no intention of conveying
honorific effect, it may be important for the translator to
generate some reasonable approximation of the honorifics of
a Japanese dialogue partner, to avoid an unintended and
undesirable message through the appearance of rudeness.  The
best guidance for morphological marking in the Japanese
Target Language (TL) output form will come not from the
English Source Language (SL) form of the input utterance
which presumably lacks that information, but from the
previous Japanese input.  In order for the IT system to
reciprocate the formality and humility markings of J, the
discourse component must store both input and output
Japanese utterances in the dialogue history.  It is worth
noting that this study assumes that the IT discourse module
would be able to determine the honorific markings for
generated Japanese utterances by examining solely the (one-
sided) input of the native Japanese speaker.  Although this
assumption is irrelevant to the present study in that we are
intentionally isolating the dialogue processing task (to the
exclusion of other Natural Language Processing tasks), it
has not been proven, and would be an important area for
further research.

     In summary, the system keeps one bilingual dialogue
representation where speech acts and intentions are encoded.
It keeps two monolingual representations of that same
dialogue for understanding and generation of language-
specific discourse phenomena such as Japanese honorific
morphology and other language-specific phenomena that rely
on relations between surface forms, for instance, 'one'
anaphora and reflexive pronouns in English.  Finally, it
keeps two separate monolingual records of any meta-dialogues
between the system and either client.

     In earlier work (Luperfoy, 1991), we described the
discourse framework that underlies our dialogue manager.  It
originated as an attempt to implement File Change Semantics
(Heim, 1982) and Discourse Representation Theory (Kamp,
1981).  Augmentations to the original theory were required
to support discourse processing for the domain of the
original implementation:  knowledge-based natural language
processing of human input to a conversational (dialogue-
based) interface.  The resulting framework is especially
well suited for spoken dialogue, where errors can arise
during the course of speech recognition which can only be
resolved with the help of contextual information.  Our
framework is designed to handle partial or incorrect input
robustly, and it can improve the performance of other
components of the software system by advising them with
contextual information.

                  The Observational Study

Experimental design

     At the present time, the interpreting telephone is a
research system exclusively.  In contrast to researchers
studying text-based computer-mediated communication systems
(MUDs, newsgroups, e-mail discussion lists, etc.), IT system
designers lack the benefit of a user community that can be
observed or surveyed as a source of data regarding human
behavior with respect to a specific software system.  Under
such circumstances, researchers can obtain approximations of
the data on target human behavior through what are known as
"Wizard of Oz" studies.  In these experiments, a human
surrogate (the wizard) acts in the place of the
computational system, interpreting complex user input beyond
the capacity of the current user-interface system.  The
intercepted input is converted by the wizard into legal
commands for the backend software application and submitted
to the real system.  The eventual system response is
displayed for the user as if there were no intervention by a
mediator.  In this way, human-computer interface issues can
be studied for complex systems before they are implemented.

     In our wizard study, the interpreting telephone wizard
used an electronic voice modulator to make his speech
resemble synthesized speech, and he "translated" English
utterances to English utterances by placing one caller on
hold and simply repeating or paraphrasing for the other
subject.  The wizard's task required no bilingual effort,
just the need to retain the content of one turn long enough
to repeat or paraphrase it.  In this way, we were able to
isolate the discourse processing requirements of the system
from issues of speech processing, translation, and NL
understanding, and thereby simulate the highly idealized
condition in which these technologies have been completely
solved, i.e., they impose a near zero run-time computational
cost while being essentially error free.  This follows an
established methodology in MT research of monolingual
"translation" to isolate components of a model for testing
(cf.  Brown et al., 1990).

     In most cases, the subject in the experiment (the
"user") is led to believe that he is interacting with the
computer system, and thus is unaware of the human wizard
"behind the curtain".  In our study, the users were aware
that they were interacting with a simulated system;
nevertheless, the system was designed in such a way as to
have many characteristics in common with a "real"
interpreting telephone.  Examples of such characteristics
are communication latency, the requirement to signal the end
of a dialogue turn by pushing a button on the phone keypad,
and the requirement to distinguish between utterances
directed to the user on the other end of the phone and those
directed to IT as agent, as described in the previous
section.  In these exchanges, IT is analogous to an
electronic operator.  As such, when operating in this mode,
the IT must perform deep Natural Language Understanding
(e.g., resolving pronouns).  Just as callers might address
their needs to the operator in a normal telephone
conversation (e.g., "Operator, could you try the other
number?"), participants were required to initiate all such
exchanges by prefixing their request with the vocative
"Phone, ....".

     Moreover, the wizard, when engaged in a sidebar
dialogue with one of the subjects, adhered to a restricted
grammar of spoken behavior, as would a real IT.  The wizard
spoke from a small set of scripted utterances to handle
simple repair situations only:  "Are you finished?", "Would
you repeat that please?", "Yes," "No," and recitation of
canned instructions.  This was done in order to discourage
three-agent behavior (E or J's treating the IT mediator as a
party to the main interaction) during the main dialogue.

Data Collection and Analysis

     Data were collected as videotaped conversations between
two technologically-sophisticated adult male subjects
speaking in their native language (English), mediated by an
experimenter acting as wizard in the role of interpreting
telephone.  Subjects were sequestered in their respective
offices for the 45 minute data collection session so that
all interaction was through the mediator.  Subjects sat at
their own desks and, using their normal office telephones,
carried on conversations in which they acted out the process
of registering for an international conference.  One subject
played the role of the conference registrar and the other
subject acted as a potential attendee of the conference who
ellicited information about dates, times, submission
requirements, excursions and lodging, and requested written
materials to be mailed to him.  Since the offices were in
separate parts of the building, and each subject was placed
on hold while the mediator interacted with the other,
neither subject heard the other's voice nor saw the other at
any time during the conversation.  This disallowed exchange
of non-linguistic data directly between subjects without the
knowledge of the wizard mediator.  This arrangement created,
in essence, a system with a half-duplex transmission mode,
since only one of the two subjects could speak and be heard
at one time.  This is in direct contrast with full-duplex
devices, as exemplified by ordinary telephones.

     In all, approximately 20 minutes of data were recorded
for each participant.  Since the subjects were videotaped in
separate offices during the experiment, and the wizard was
recorded as well, this provided approximately 60 minutes of
tape to be analyzed.  Experimenters also videotaped wizard
and subjects during a post-session interview.

     Because of the preliminary nature of this research, the
video data were reviewed in two passes, and observations
were made informally on a broad range of discourse
phenomena.  The experimenters focused their attention
especially on behavior surrounding repair dialogues, delays,
pauses, transmission of symbolic information (names,
addresses and the like), and on IT-as-agent dialogues.


     The data revealed four kinds of unexpected discourse
behavior.  The first group of these behaviors was a direct
result of the half-duplex transmission mode of the system.
While waiting for a response from the wizard, the subjects
often became bored with the task, and engaged in other
activities on their computers or in conversations with the
operator of the video camera.  Thus, the current design
fails to satisfy what can be assumed to be the clients'
original motivation for employing a real-time voice-based
system instead of a facsimile or e-mail, i.e., the desire
for a simplified, expedient transaction.

     Impatience was not the only problem with the half-
duplex transmission.  Subjects also had frequent difficulty
tracking the dialogue, as long pauses left them unprepared
for resumption of their turn in the main dialogue.  It is
likely that their awkwardness and lack of preparation
stemmed from the violation of (monolingual) rules regarding
average pause length in normal Japanese or English
dialogues, rules which at least partially have their basis
in the cognitive limitations on short term memory required
to retain the most recent utterance.

     The third observation has to do with sidebar
discussions.  Our voice-only wizard design provided help
dialogues, but repair dialogues themselves were problematic.
Subjects easily lost track of the main context during
sidebar discussions with the wizard and were unable to
gracefully terminate repair dialogue sequences.  For
example, in a preliminary dry run of our experimental set-
up, one subject asked the mediator, "Do I have to hit the
pound key?", and the wizard replied, "Yes, please press the
pound key when you are finished."  At that point, there was
no good way for the speaker to indicate the closure of her
meta-dialogue with the mediator or to signal that her next
utterance was intended as part of the main conversation with
the other subject -- an "ok" followed by the pound key could
have been directed at the mediator or at the other subject.
A further problem related to sidebar sequences was that
since the IT wizard's voice was the same during translation
output and sidebar dialogue, there were no overt cues to
tell the user whether the source of an utterance was the
other user or the translation device itself.  This last
problem can be easily remedied by use of multiple
synthesized voices, one for the output of translation of
input utterances and another for utterances generated by the
IT device.

     Finally, we observed that the voice-only system was
inefficient in transmitting symbolic information such as
names, addresses, phone, ID and other numbers.  Such
information is often a stumbling block even in monolingual
dialogues involving two speakers conversing in their native
language.  Names and other symbolic information must often
be spelled or broken down into the appropriate atomic units
in order to ensure accurate transmission of the information.
In a voice-to-voice MT system, these problems are
multiplied, even for two languages that use roman script.
When the additional issues involved with transcription or
transliteration are introduced, the likelihood of successful
accurate voice communication of this information decreases
dramatically.  Such information is easily conveyed through
modes other than speech, however, such as the touch tone

Design Considerations

     These findings led us to two conclusions regarding
proper design of the IT dialogue manager.  First, even at
its unrealistic fastest, the half-duplex transmission mode
was unacceptably slow.  Full-duplex transmission, on the
other hand, would present an even more complex dialogue
management scenario.  A second conclusion was that voice-
only interaction would be inadequate for many user-system
dialogue or multilingual tasks, such as conference
registration.  In a voice-only system, visual information is
lost, and transmission of information that could more
effectively be communicated via another channel is hampered.
A mixed-modality interface that provided visualization of
the dialogue setting would help users track the main
dialogue and distinguish human-human from human-system

     In response to these findings, we produced a conceptual
design of a multimodal system that would provide users
keyboard input for entering symbolic data that did not
require translation, such as postal addresses.  This system
allows a user to input textual information in the native
script, thereby vastly improving the quality of the
information transmitted.  This is true whether the eventual
goal is the storage or transmission of the information in
the native script, or the storage or transmission of a form
that has been automatically transliterated or transcribed by
another intelligent agent.

     Various sorts of contextual information were
distinguished visually and mode of interaction varies with
information stream:  immediate status of the system (text),
history of the dialogue (text), visual image of the other
user (video or still images), and user interface dialogue
that could be carried on locally to one user's machine
(text) and in parallel to the human-human bilingual dialogue
(voice).  This design modification deviates radically from
the implicit design constraints of the interpreting
telephone project in that it requires a computer screen and
non-voice input devices (keyboard, mouse) for graphical and
typed input and output, whereas the interpreting telephone
is constrained to use voice only.

     From the observational results of our wizard study, we
also produced a new set of parameters to be measured in a
subsequent empirical study.  These parameters include such
measures as frequency of user distraction and the user's
success at tracking the dialogue, and the accuracy of
transmission of (symbolic) information, as well as the
correlation of all of these measures with pause length.

                  Summary and Implications


     In this article, we described the Interpreting
Telephone application which enables computer-mediated
bilingual spoken dialogue between two humans.  Our project's
goal was to demonstrate improved performance reflected in
the machine translation component as a direct result of
assistance from discourse and its exploitation of
information available in the IT application but not in other
MT applications.  Real-time voice-to-voice MT, unlike most
MT systems for translating text documents, has the advantage
of two human monolingual post-editors reacting immediately
to each utterance produced.  Our proposed dialogue manager
has specific feedback cues to watch for from the target
language user to indicate if the translation did not make
sense.  The nature of this feedback on the output is useful
for confirming or diagnosing and repairing the system's

     In this application, the mediator has three discourse
tasks and therefore requires multiple language processing
strategies.  First, it controls the exchange between the two
conversants, in this case the two human clients speaking
different languages.  Second, it interprets each input
source language utterance in context for translation to the
target language.  It consults its stored discourse
representation to augment sentence analyses with context
information, and updates the discourse representation with
the new utterance.  Third, when necessary, the device must
engage in a separate, untranslated dialogue with one of the
two clients to discuss logistics, e.g., status of the
telephone connection, requests for repetition of a spoken
input utterance, or responses to a user's request for
clarification or confirmation.

     With respect to the first and second tasks, it was
discovered that the half-duplex design of the system led to
problems involving the attention span of the subjects,
short-term memory limitations, and generally awkward
conversational exchanges.  The untranslated human-machine
dialogues often further interfered with the main dialogue.
Furthermore, the transitions into and out of these dialogues
were seldom smooth.  These facts led to the conclusion that
a half-duplex voice-only IT system would not be the best
design for a system intended to manage multilingual dialogue
interactions.  Adapting the IT concept to incorporate a
multimodal design would go a long way towards overcoming the
weaknesses of voice-only systems identified as a result of
this study.

     From the observational results of our wizard study, we
also produced a new set of parameters to be measured in a
subsequent empirical study effort, including such measures
as frequency of user distraction and the user's success at
tracking the dialogue.

Implications for Other CMC Systems

     Computer-mediated human communication of all types
requires a dialogue manager that can switch from human-human
to human-machine dialogue, and this switch must be apparent
to the user.  With respect to text mode systems, this can be
likened to the human-human and human-machine dialogues that
take place on electronic mailing lists managed by a
listserver.  Participants in these exchanges may address a
message to the listserver (the equivalent of initiating
user-IT dialogues by use of the vocative "Phone,....") or to
the other participants (the equivalent of IT as conduit, as
outlined in the section entitled 'Discourse for Interpreting
Telephony').  In addition, the IT dialogue manager must
switch between shallow interpretation with broad coverage
sufficient for assisting source language to target language
conversion of any utterance the human may issue, and deep
Natural Language Understanding (NLU) with narrow coverage to
support the discourse module as an agent speaking and
understanding utterances within the very restricted domain
covering operation of the device itself.

     Text-based modes of communication are free from many of
the shortcomings of voice-only systems.  For one thing, the
interaction of system lag time and limitations of short-term
memory do not come into play, since the last utterance,
indeed the entire discourse history, is in principle
available to all participants throughout the course of the
exchange.  Similarly, the textual version of the IT dialogue
could be enhanced with extra information, such as speaker
information, a time stamp, or other relevant material.  This
would serve to reduce the potential for confusion
surrounding the differentiation between human-human and
human-machine communication, and would also facilitate the
inclusion of more than two human parties in the interaction.

     The final point concerns multiple-party interaction,
which is common in text-based computer-mediated modes, but
presents problems for voice-only mediation systems.  Whereas
a voice-only IT system might need to simulate a distinct
voice for each conversational participant in order to
indicate the source of each synthesized output utterance, in
a multimodal version of the system, a simpler textual
identification or a still or video image of the contributor
of each discourse event could accompany the translated
output utterance.  Such a system is similar to a text-based
MUD or chat system in which the system automatically
attaches identifying source names to each discourse
contribution (Cherny, 1995; Werry, 1996).  Also similar to
text-based systems, our mediator imposes a total ordering on
contributions, despite the fact that they may have been
generated more or less simultaneously.  In these respects,
one might say that the IT system we envisage amounts to a
multilingual chat system in which the primary mode of
exchange is spoken language.  However, in our proposed
system, as in chat systems, two or more sets of exchanges
may be interposed in situations where this would be
inappropriate (and probably checked) in face-to-face
communication, i.e., because one set of conversants would
cede the floor to another if they became aware that they
were talking at the same time.  A system incorporating full-
duplex transmission with a multimodal user interface would
ease the task of the dialogue manager by allowing
simultaneous two-way communication among the human
participants, thereby bringing the interaction closer to
face-to-face communication, while still allowing linguistic
and extralinguistic information to be carried by channels
other than voice.


Brown, P. F., Cocke, J., Della Pietra, S. A., Vincent J.
     Della Pietra, V. J., Jelinek, F., Lafferty, J. D.
     Mercer, Robert L. and Roosin, P. S. (1990).  A
     statistical approach to machine translation.
     Computational Linguistics 16(2), 79-85.

Cherny, L. (1995).  The MUD register:  Conversational modes
     of action in a text-based virtual reality.  Ph.D.
     thesis, Stanford University, Department of Linguistics.

Clark, H., & Schaefer, E. (1987).  Collaborating on
     contributions to conversations.  Language and Cognitive
     Processes, 19-41.

Collot, M., & Belmore, N. (1996).  Electronic Language:  A
     new variety of English.  In S. Herring (Ed.), pp. 13-

Grosz, B., & Sidner, C.L.  (1986).  Attention, intention and
     the structure of discourse.  Computational Linguistics,
     12 (3), 175-204.

Heim, I. (1982).  The semantics of definite noun phrases.
     Ph.D. thesis, University of Massachusetts, Department
     of Linguistics.

Herring, S. (1996).  Two variants of an electronic message
     schema.  In S. Herring (Ed.), pp.81-106.

Herring, S. (Ed.)  (1996).  Computer-mediated communication:
     Linguistic, social and cross-cultural perspectives.
     Amsterdam:  John Benjamins.

Hobbs, J., & Kameyama, M. (1990).  Translation by abduction.
     In Proceedings of COLING-90, Vol. 3 (pp. 155-161).
     [SRI International Technical Note 484, 1990.]

Kamp, H. (1981).  A theory of truth and semantic
     representation.  In J.A.G.  Groenendijk, T. Janssen &
     M. Stokhof (Eds.), Formal methods in the study of
     language, Part 1. Amsterdam:  Matematisch Centrum.

Kay, M., Gavron, J.M., & Norvig, P. (1993).  Verbombil:  A
     translation system for face-to-face dialog.  Stanford:
     Center for the Study of Language and Information.

LuperFoy, S. (1991).  Discourse pegs:  A computational
     analysis of context-dependent referring expressions.
     Ph.D. thesis, University of Texas, Department of

Maier, E. (1996).  Context construction as subtask of
     dialogue processing:  The Verbmobil case.  Twente
     Workshop on Dialogue Management in Natural Language
     Systems (TWLT11), Twente Workshop Series on Language
     Technology, pp.113-122.

Morimoto, T., Ogura, K., et al.  (1989).  Spoken language
     processing in SL-TRANS.  ATR Symposium on Basic
     Research for Telephone Interpretation, Tokyo.

Quantz, J.J., Gehrke, M., et al.  (1994).  The Verbmobil
     domain model.  Projectkgruppe KIT at Technishe
     Universitaet Berlin.  Technical Report 122.

Robin, J. (1995).  Turn-taking in a Cyberian pub:  The
     coordination of discourse on IRC.  Paper presented at
     GURT presession on Computer-Mediated Discourse
     Analysis, Georgetown University, March 8.

Wahlster, W. (1993).  Verbmobil:  Translation of face to
     face dialogues.  Technical Report from the German
     Research Centre for Artificial Intelligence (DFKI).

Werry, C. (1996).  Linguistic and interactional features of
     Internet Relay Chat.  In S. Herring (Ed.), pp. 47-63.
                      Copyright 1996
   Communication Institute for Online Scholarship, Inc.

     This file may not be publicly distributed or reproduced
without written permission of the Communication Institute
for Online Scholarship, P.O.  Box 57, Rotterdam Jct., NY
12150 USA (phone:  518-887-2443).