[Student-projects] GSoC project on Language model and Acoustic model for Malayalam language for speech recognition system in CMU Sphinx

Khyathi Chandu khyathiraghavi at gmail.com
Thu Mar 20 23:46:36 PDT 2014


Hi sir,
Thank you for suggestions. I really appreciate the point b) regarding
mapping phoneme set to graphemes as much annotated audio data is not
available.

As per the remaining:
a)  There is a limited audio database available. To start with, we can use
the sample set by LDC-IL (Linguistic Data Consortium for Indian Languages)(
http://www.ldcil.org/resourcesSampleSpeechCorp.aspx) and the annotated
speech data available from Speech and Vision Lab of IIIT-H (
http://speech.iiit.ac.in/index.php/research-svl/69.html). But I think some
amount of speech recordings and manual transcription also gives strength to
the project.

c) Another challenge that is to be faced is the lack of availability of
vast text corpora in Malayalam that could be used for language modeling. My
idea for compilation of data is to use data from wikipedia pages and
reliable e news papers like Manorama (
http://www.manoramaonline.com/cgi-bin/MMOnline.dll/portal/ep/home.do?tabId=0)
and deshabhimani (http://www.deshabhimani.com/home.php) and also LDCIL
dataset (http://www.ldcil.org/Corpora/text/Malayalam/MAL1.pdf).


Link for the updated proposal is :

http://wiki.smc.org.in/User:Ragha

Feedback and suggestions are highly valued and appreciated.

Thank you


On Fri, Mar 21, 2014 at 4:09 AM, Kartik A <kartik.a9111 at gmail.com> wrote:

> Hi Khyati,
>
> A few queries about your plan of action. Please correct me if I am wrong.
>
> a) Data Compilation :- For an acoustic model audio data is a very
> significant requirement. Do you have any plan in mind about which databases
> you can focus on? You mentioned about transcribing from the audio data. So
> if you plan to take audio data that is 4 hours long so will it be manually
> transcribed? I think there needs to be setting up of resources before one
> can even think of training the Sphinx model.
>
> b) I guess huge amount of annotated audio data can not be gathered for
> Malayalam so one has to look into adaptive acoustic modelling for that you
> have to make a Grapheme to phoneme mapping, which should look like this:
>     മ ല യാ ളം   :  ma la ya La aM
> and then map to the phone set Sphinx supports
>
> c) Language Model : There are various straight forward approaches, and
> yeah I agree N-gram is still the best amongst them. But what about
> compiling data for Language Modelling like a large raw dataset for
> Malayalam. Is there any such available dataset except the transciptions of
> audio data?
>
>
>
> On Fri, Mar 21, 2014 at 12:06 AM, Deepa P.Gopinath <
> deepapgopinath at gmail.com> wrote:
>
>> Hello,
>>
>> Time line is better now. End deliverable can be 'Language and acoustic
>> model', itself I feel. A speech recognition system can be developed within
>> the constraints of time.
>>
>> regards
>>
>>
>> On Thu, Mar 20, 2014 at 8:04 PM, Khyathi Chandu <khyathiraghavi at gmail.com
>> > wrote:
>>
>>> Mam,
>>>
>>> I have updated the project proposal based on your suggestions. I have
>>> mentioned the details of data compilation and modified the time frame. Here
>>> is the link:
>>>
>>> http://wiki.smc.org.in/User:Ragha
>>>
>>> I am ready to dedicate any amount of time and include the intricacies to
>>> the best I can. Kindly expecting your feedback.
>>>
>>> Thank you
>>>
>>>
>>>
>>> On Thu, Mar 20, 2014 at 1:16 PM, Deepa P.Gopinath <
>>> deepapgopinath at gmail.com> wrote:
>>>
>>>> Hello,
>>>>
>>>> To develop language and acoustic model, we need to compile a sufficient
>>>> data base. This you haven't considered in your proposal.* I feel you
>>>> have to reframe your time line*. It seems to be a bit ambitious. After
>>>> the project we should be able to contribute a good database and a language
>>>> and acoustic model.
>>>>
>>>> do contact after modifying your proposal
>>>>
>>>> regards
>>>>
>>>>
>>>> On Wed, Mar 19, 2014 at 1:09 PM, Khyathi Chandu <
>>>> khyathiraghavi at gmail.com> wrote:
>>>>
>>>>> Hello,
>>>>>
>>>>> This is the link of how I would like to proceed with the project.
>>>>> I want to work on the project "Language model and Acoustic model for
>>>>> Malayalam language for speech recognition system in CMU Sphinx".
>>>>>
>>>>> http://wiki.smc.org.in/User:Ragha
>>>>>
>>>>> It would be very helpful if someone could give feedback and give some
>>>>> suggestions.
>>>>>
>>>>> Thank you
>>>>>
>>>>> _______________________________________________
>>>>> Student-projects mailing list
>>>>> Student-projects at lists.smc.org.in
>>>>> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Deepa P.Gopinath
>>>> Lecturer in Electronics and Communication
>>>> Department of  Electronics Engg.
>>>> College of Engineering Thiruvananthapuram
>>>> Kerala, India
>>>> Mobile- +919446583466
>>>>
>>>> _______________________________________________
>>>> Student-projects mailing list
>>>> Student-projects at lists.smc.org.in
>>>> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Student-projects mailing list
>>> Student-projects at lists.smc.org.in
>>> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>>>
>>>
>>
>>
>> --
>> Dr. Deepa P.Gopinath
>> Lecturer in Electronics and Communication
>> Department of  Electronics Engg.
>> College of Engineering Thiruvananthapuram
>> Kerala, India
>> Mobile- +919446583466
>>
>> _______________________________________________
>> Student-projects mailing list
>> Student-projects at lists.smc.org.in
>> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>>
>>
>
>
> --
> Thanks & Regards,
> Kartik A.
>
> _______________________________________________
> Student-projects mailing list
> Student-projects at lists.smc.org.in
> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/student-projects-smc.org.in/attachments/20140321/3330be54/attachment.html>


More information about the Student-projects mailing list