[MUD-Dev] [TECH] Voice in MO* - Phoneme Decomposition and Reconstruction

Ted L. Chen tedlchen at yahoo.com
Wed May 8 20:52:05 New Zealand Standard Time 2002

Hello to everyone,

Here's my first useful (hopefully) contribution to this list.  This
is primarily a brainstorming and partial feeler post about the
subject of voice in MO* environments and more specifically,
discussion about a possible technical direction for implementating
it.  This desired feature has popped up several times in the past
(Where are We Now? - May 2001).

Like all features, we should be looking at why do we want it
included before we even look at how.  And at the risk of repeating
something that's already been said, I'll list it out again here for
completeness sake:

    rate of input (some people type really slow <g>)
    inflection is subtle (mostly) and a very powerful tool
  Minimum Basic Requirements:
    imposes low server bandwidth
    imposes low server cpu processing overhead
  Highly Desirable Features:
    voice disguising
    imposes low client bandwidth use
    imposes low-med client cpu processing overhead
  Desirable Features:
    allows multiple channels (background chatter, focal chatter)
    translation of voice into text and vice versa

Mind you, the list I generated above is biased from the fact that I
generated it after arriving at the solution :) Please feel free to
add to it and see how the implementation below still holds up.

Note that the requirements I listed, I placed "translation of voice
into text" in desirable features.  In our ideal envisioned case,
most of us probably assume this would be the case, to encompass the
preferences of all players (much like how captions exist in say a
DVD).  However, the processing requirements of speech recognition
currently make that possibility unlikely in the near future.
Likewise, sending the sound as raw data (with or without modulation)
is prohibitive in terms of bandwidth.  I admit, this does seem like
a case of the 'requirements' chasing the solution.  But I'll go
through the implementation and let you judge whether or not this is
a viable tradeoff.

The idea centers around the assumption that the analysis of phonemes
in a sound file is orders of magnitude faster than actual
translation into meaningful text.  In the later case, we have
dictionary searches and heuristics that require cpu processing power
to piece the phonemes into something intelligible.  So, if we stop
at the phoneme level and send that information, it is assumed that
we can cut down on the bandwidth required.  In fact, this can be
considered a mild form of compression.

However, likely for an expressive system, this saving will be offset
by encoded inflection, tempo data and anything required to maintain
enough of the original speech pattern.  A good starting point for
how to encode this maybe gathered from the following:

  Microsoft Agent Speech Output Tags


  Speech Synthesis Markup Language Specification


On the client end, it is reconstructed through a low-level
phoneme-to-speech synthesizer.  This method does impose some hefty
(but hopefully not prohibitive) client side cpu processing,
especially if we allow mixing of multiple channels (c.f multiple
speakers on vicinity chat for instance).  However, server side
processing and bandwidth use will be on the same order as current
chat systems.  It is suggested that modulation can be implemented by
altering the encoded tags or switching the 'speaker' in the
text-to-speech synthesizer.

The only desired feature that it fails to address (so far) is the
translation of the speech into text.  The converse translation of
text into speech is possible but will lack any custom inflections.
Most text-to-speech systems like Lernout & Hauspie's TruVoice
generate phonemes along with inflection based on very simple
heuristics.  Conceivably, one could allow the player to append
encoding into the chat stream akin to the current use of smileys or

So, after all is said and done, has anyone attempted to do anything
similar to this or thought about it in depth?  My personal
background comes more from the text-to-speech side than anything
specifically related to MO*'s so I might have missed something that
someone who has a deeper familiarity with the technical side of
MO*'s would catch.

With a TTS, it is quite possible to expressively generate
synthesized speech but it currently requires hand coding a lot of
tags into the stream and at the phoneme level.  Automatically
generating this data from the user for the express purpose of
pumping it back out through a TTS synthesizer is something that I'm
not sure anyone has focused on.  The use of phonemes in speech
processing has mainly been used in application to compress real
voice data.  These compression techniques however, intend to
preserve the voice, which is ironically something MO*'s may not want
to preserve.  The MO* needs to only use a small (and somewhat)
established subset of this research.

Granted, there are some other issues of including voice in MO*'s,
primary of which is cost of admittance.  Having someone closer to
the MO* community than I to comment on this would be helpful.


Other useful resources:

  Comp.speech FAQ


  English to Phoneme Translation


MUD-Dev mailing list
MUD-Dev at kanga.nu

More information about the MUD-Dev mailing list