[MUD-Dev] [TECH] Voice in MO* - Phoneme Decomposition and Reconstruction

Jon Leonard jleonard at slimy.com
Sun May 26 12:28:31 New Zealand Standard Time 2002


On Fri, May 24, 2002 at 01:35:40PM -0500, Eli Stevens wrote:
> From: "Ted L. Chen" <tedlchen at yahoo.com>
 
>> Ugh, I hate replying to my own posts but I guess I should just
>> stick a foot in my mouth :) In my own defense, "phoneme" didn't
>> pop up any good hits on the archive search engine so I thought
>> this topic wasn't discussed before.  But going backwards through
>> the archives manually (good reading) I finally got to the short
>> sub-thread:
 
>>   http://www.kanga.nu/archives/MUD-Dev-L/2001Q2/msg01688.php
 
>> Sorry for the noise, but hopefully at least the Comp.speech FAQ
>> and English to Phoneme Translation resources might be of some
>> interest to people on that older thread.

> Heh heh, noise.  :P I believe that your use of "phoneme" was the
> first time I have encountered the term.  Learn something new every
> week.  ;)

[snip proposed scheme for compressing speech (any audio, really) by
searching for similar waveforms]

It seems to me that the real challenge is to throw away information
that's uninteresting, and transmit the rest.  The point being mostly
to conserve bandwidth, but this can also allow for some other
interesting effects if the (lossy) compression is done in an
interesting way.

The sort of information we want to throw away comes in several
classes:

  -- Stuff people simply can't hear.

    This includes things like sounds over 20 KHz, and most phase
    information.

  -- Stuff that's uninteresting to the listeners.

    For speech processing, this might include speaker-specific
    details of the voice, at least after the initial information has
    been sent.

  -- Stuff that's excluded by the domain.  
 
    For example, if there's a consistent 60Hz hum, that's not part
    of the speech (more likely power interfereing with the sound
    card), and it should be excluded.

The challenge is to figure out some basis for representating the
sounds that make it easy to separate the stuff of interest from most
of the rest of the data.  (It's ok to send some stuff that's not of
interest, the point is to not send all of it.)

I'd recommend looking at the fourier transform of the data, or some
similar tranformation to get rid of the (inaudible) phase half of
the data.  At a more sophisticated level, the mp3 encoding scheme is
the result of a lot of study into what's audible...  Reviewing that
research would be a good start.

I have less of an idea how to proceed for reducing speech into its
component parts (phonemes, as an approximation), but there's a lot
of good reasearch there too.

Jon Leonard
_______________________________________________
MUD-Dev mailing list
MUD-Dev at kanga.nu
https://www.kanga.nu/lists/listinfo/mud-dev



More information about the MUD-Dev mailing list