[MUD-Dev] [TECH] Voice in MO* - Phoneme Decomposition and Reconstruction

John Buehler johnbue at msn.com
Thu May 23 01:38:55 New Zealand Standard Time 2002

Ted L. Chen writes:

> To support the notion of being able to distinguish voices in a
> multiple person environment, I just played around with OnLive
> Traveller which uses real speech in a VRML world.  It's basically
> a talker.

> One of the things that allows voice to work in OnLive is that it
> does lip synching (which aids in determining who is/are speaking).
> It also incorporates a few other tricks such as 3D positional
> audio and distance attenuation so even in a crowded room, it's
> quite easy to carry on a conversation, especially given that the
> software also focuses on sound sources coming from directly in
> front of your first person field of view.

> So, I guess spatial sound cues are very important for this to be
> pulled off correctly.  Granted, in a text MUD, this might be
> problematic since there's no real spatial data associated with
> characters.

Yeah, I'd never attempt the whole voice thing in a text-only
environment.  As you say, spatial cues are important.  The cue that
I'd use for current games where you can't see mouths easily is to
use a pulsing icon or aura on the character as with Microsoft's
Media Player for audio files.  The pulsing color patterns match to
the audio and the end user can correlate each voice with the
speaking character.  And the other cues that you mention, such as
distance attenuation, will help.

> As for people generally speaking one at a time, it's not required
> when in a conversation, but it does occur out of courtesy much as
> it does IRL.  Sometimes two people would start to talk at the same
> time, but I could still easily make out what was being said.

Yeah, I was just toying around with some synthesized speech and
using a simple playback tool.  Just firing up four copies of the
tool playing different speech files overlapped produced perfectly
reasonable results.

>> Speech input permits two control channels, versus the single
>> channel of the keyboard.  With keyboard-only control, I have to
>> slice up its use between verbal statements and character control.
>> In times of intense character control, I don't say much.  Nobody
>> does.  Just running along in the wilderness in a game can be
>> dangerous if you choose to make a joke to somebody that takes
>> more than a few seconds to complete.  You're running in a
>> straight line all that while and you could easily run off a cliff
>> or into a wandering monster.

> One thing to note though is that although speech is more
> hands-free than having to type text, the practicalities of
> microphones, breathing, and background noise make auto-detection
> methods rather frowned upon.  They tend to transmit your asthma :)

> So, at the very least - to be socially accepted (and conserve
> bandwidth) -you'll still need to hold down a key, much like a
> walkie talkie.  That is, of course, you're roleplaying Darth
> Vader.

I don't doubt that there are practical considerations, but there's
no need to literally hold down a key.  I can tap a key to indicate
that I'm speaking significant words and then tap a separate key to
indicate that I'm done.  The critical point here is that I'm not
tying up the command and control input device (keyboard and mouse)
while I'm communicating.  And communication is something that should
be easily accessible all the time.  In time, quality microphones and
STT software will alleviate this problem to the point of no longer
needing to fool with a key.  Simple tonality can cue the computer
into how you want to deliver your speech.

Now that I say that, I see no reason that I can't issue commands to
say that I'm talking in-character.  Literally, I could say "start in
character" and "end in character" to ensure that the computer knows
when I'm trying to generate in-character speech.  Or I could say
"Startskies" and "Endskies" or any other pair of words.  I recall a
science fiction story where the computer was woken up to the fact
that it was supposed to perform an action based on a verbal command
by the speaker saying a word in Russian.  The assumption was that
the speaker would never use the Russian word normally - thus my
choice of silly words.


MUD-Dev mailing list
MUD-Dev at kanga.nu

More information about the MUD-Dev mailing list