[Student-projects] Language model and Acoustic model for Malayalam (Deepa P.Gopinath) as GSOC Project

karan singla ksingla025 at gmail.com
Mon Mar 10 04:23:05 PDT 2014


Hello Deepa,

I agree, that IIIT speech data is not really good and big for making
an ASR but its the only freely available data according to my
knowledge.

We can start with whatever data we have. and meanwhile in scope of
this project in the first phase, I propose we should start working on
re-recording IIIT data from 3 native fluent Malayalam speakers. also
we can look onto the option of working on  transcribing news speech
corpora available online.

Meanwhile !! we can start with making a baseline with this small
available data ( even if the model is not ready for practical use )

We can re-train the system as the data grows up.

According to my knowledge, the task of getting the the phoneme
sequence is much simple for Malayalam, due to the phonetic nature of
malayalam language. Majority of the letters themselves form phonemes.
We can remember the fact that almost all written malayalam words
unlike English words can only be read in one way.

Speech Lab at IIIT-Hyderabad is using similar technique for making
phonetic dictionary for Telegu and results are quite well

http://cpansearch.perl.org/src/SYAMAL/Unicode-Indic-0.01/keymap.pdf

For Language Modelling, There is freely distributed mono-lingual text
corpus from LTRC, IIIT-H of 60,000 sentences ( approx )

http://ltrc.iiit.ac.in/showfile.php?filename=ltrc/internal/nlp/corpus/index.html

Also one can get Malayalam Wikipidea dump.

One can't target for the model which is really good, but a model than
understand simple sentences.

Regards,
Karan




On Mon, Mar 10, 2014 at 3:03 PM, Deepa P.Gopinath
<deepapgopinath at gmail.com> wrote:
>
> Hello Karan,
>
> I have heard the iiit-speech data base. 1000 selected sentences are there, spoken as separate sentences. It is good for a TTS system. But for ASR it might not be very good I think for 2 reasons-
> 1) since it is isolated sentences, it may be able to recognize speech in isolated sentences. or in other words, the input speech should have enough pause in between sentences. 2) the articulation is very slow and pronunciation very clear and good. In that way it is slightly different from normal malayalam reading style.  For ASR system we need a speech data base that resembles a typical malayalam speech.
>
> For ASR, the training database is very important. The results depend on this.
>
> As you said, Malayalam have similarity with Telugu. So phonetic dictionary available for telugu, can be adapted for malayalam.
>
> A standard text corpus is not readily available for Malayalam, so far as I know.
>
> regards
>
>
> On Mon, Mar 10, 2014 at 4:30 AM, karan singla <ksingla025 at gmail.com> wrote:
>>
>> Hello Deepa,
>>
>> I am Karan, working in LTRC,IIIT-Hyderabad and have also worked in a project co-funded by AT&T in making an ASR for Hindi and have tried adaptive acoustic modelling for Kannada and Malyalam( results were not great )
>>
>>
>> As suggested by you, we can begin with taking a small speech corpus available freely available for Malyalam
>>
>> http://festvox.org/databases/iiit_voices/
>>
>> Although, this is not sufficient, but just to begin with. We need to record more data in the future.
>>
>> For Acoustic Modelling:
>>
>> There is a freely available phonetic dictionary for Hindi, in which Hindi graphemes have been mapped to English American Phone set as Sphinx is build up for English phone set and we don't have enough speech data for creating a new model. So adaptation is only possible at first.
>>
>> As Malayalam is a Dravidian language, I guess there is a phonetic dictionary available for Telugu in speech lab at my university but I need to check if they can share. So then adapting from Telugu will be a better option as it can be called "close" to Malayalam than Hindi.
>>
>> So after making a model with this dictionary, one need to generate phonetic mapping for all the words in the transcription files of speech corpus.
>>
>> For Language Modelling :
>> Transcriptions will be  included for sure. I am not aware of a raw text available in Malayalam. Is there a raw data avialble ??
>>
>> Am I thinking right ??
>>
>> Hoping a reply soon,
>> Karan Singla
>> LTRC, IIIT-Hyderabad
>>
>> _______________________________________________
>> Student-projects mailing list
>> Student-projects at lists.smc.org.in
>> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>>
>
>
>
> --
> Dr. Deepa P.Gopinath
> Lecturer in Electronics and Communication
> Department of  Electronics Engg.
> College of Engineering Thiruvananthapuram
> Kerala, India
> Mobile- +919446583466



More information about the Student-projects mailing list