[Student-projects] Language and acoustic model - GSoc project

Tue Mar 18 10:25:52 PDT 2014

Hello Mam,

This is my proposal for the project "Language model and Acoustic model for
Malayalam language for speech recognition system in CMU Sphinx". Requesting
your early feedback.

The wiki page I have created for the same is :
http://wiki.smc.org.in/User:Ragha#PROJECT_EXECUTION_TIME_LINE

What the work is about?

The main objective of speech recognition system is to transcript speech
into text. Speech recognition systems can be broadly classified into two
categories based on how they are built. The first type is the one with a
limited vocabulary, and in specific a unit is considered to be a single
word, which are also known as word recognition system. The second category
of systems perform the task of recognition with large vocabulary. The main
issue that we face is that prior knowledge of word boundaries are not known
to the system. And even if this knowledge is provided, there would be
several choices from which the algorithm has to choose the appropriate one,
either based on context or statistical likelihood. The idea is that the
system needs to check at every instant if that is the word boundary.

SYNOPSIS

Language models that are sophisticated enough to provide adequate context
or semantics of the word are required to disambiguate between all the
available hypotheses shortlisted. Another issue that needs to be addressed
is the effects produced by co-articulatory effects. A particular sound of a
word is affected by its predecessor and/or successor. In general natural
language conversational speech, such effects are strong. Indic languages
provide several instances to show how deep this problem is. The
effectiveness of speech recognition relies on 4 primary pointers which are
explained in detail on how I plan to address in implementation details. The
first one is how the data is handled, i.e size of the vocabulary, speaker
independence criterion based on the application needs to be developed. The
second one is acoustic modeling which involves the choice of selection of
features for each frame or a set of frames. Next is the modeling of
language. Finally it is the search problem. Broadly till date, there are
two schools of speech recognition technology. They are HMM statistical
model and neural net model. I am interested in using statistical model and
implement the speech recognition problem using Viterbi algorithm.
IMPLEMENTATON DETAILS

1) Data handling

2) Acoustic modeling

3) Language modeling

4) Search using Viterbi

1) Data Handling and Acoustic Model:

Criteria and choice of selecting an appropriate unit of speech is a very
important parameter that determines the quality of speech recognition
system. Since Indian languages are syllabic, choice of unit could be
syllable. Words are made up of sequence of syllables. S Another issue that
the algorithm should address is the co-articulatory effects. With an
increase in the vocabulary size, confusion in these aspects increases.
Appropriate n gram can be chosen in the HMM model to address this issue,
and can further be clustered into equivalence classes. Such context
dependent model is to be developed. (most of the systems take trigrams).
Firstly, speech input is sampled and pre-processing is done on them to
obtain feature vectors for each frame. Sphinx characterizes frame by four
features which are cepstra, del(cepstra), del(del(cepstra)), and power.
(where del denotes the differential)

The training data includes multiple samples of each word from different
speakers so that the system created is independent of the speaker.From
implementation point of view, this can be achieved by modeling the
constituent syllables that make up the word.

In the training phase, random searches show better results as compared to
that of exhaustive search. I am also willing to learn and work upon
incorporation of genetic algorithms to optimize this part.

2) Language model: Since explicit word boundaries are not present, the
machine would be in a situation to make a selection of output from a large
number of word sequence hypotheses. There is a possibility that the
alternate hypothesis is syntactically correct. The idea of language model
is to select the most likely sequence from all the options the system has.
N gram model can be applied here but the issue is memory as large number
can be possible. Language modeling can be done in 3 primary ways: 1)Context
Free Grammars. But this model is highly restrictive and has to follow the
prescription. Hence it is not a good idea to use it in large vocabulary
systems. 2)N gram models: An n gram model need not contain the information
about the probabilities of all possible n sequences of words. Instead a
back off technique, by assigning weights can be applied when the required n
gram is not present. 3)Class n gram models: They are similar to the n gram
model, with the difference that the tokens are word classes such as months,
Named entities, digits etc,. Experiments in the past has shown effective
results using trigrams.

Finally, all of it is held on search! Search for the best word sequence
given the complete speech input. Viterbi decoding, A* algorithm can be used
to implement this task. This can be implemented by processing each frame or
a fixed set of frames together and making required updating till that
point. This is like time synchronous processing. As discussed earlier, the
implementation could be proceeded with stack decoding or dynamic approach.
A brief working of stack decoding is explained here.

3) Stack decoding: The possible hypotheses and their respective
probabilities are stored. The best hypothesis is thus chosen. If it is
complete, then it is the output or else expansion is done for each word and
places them in the stack to check further. But this takes a lot of time
based on its complexity, so it is better to proceed in dynamic programming
based Viterbi approach.

4) Viterbi algorithm: It is based on Hidden Markov Model. The HMM states
are traversed in a dynamic programming approach. In a list of dictionaries,
the value at 't'th element of 's'th dictionary stores the value of
probability corresponding to the best state sequence leading from the
initial state at time 0 to state s at time t. In the training phase, the
task of classification could be achieved better using SVM(Support Vector
Machines). In order to optimize Viterbi, I also propose to use the Elitist
model to decide upon the direction during the search phase. I am planning
to implement the solution to this problem in Python language. I have also
had the experience of building my own POS tagger using Viterbi algorithm.
Some of the modules can be adopted here as well.

Let N be the total number of states and T be total duration or the states
that are being checked for. Then the complexity of Viterbi decoding is
O(N^2T).

Tree structured lexicon: The direct implementation of viterbi decoding
still remains expensive. The search space can be optimized by the use of
lexical trees based on the fact that for each root, there are several words
that share the same prefix. So the model for the system can also be shared.

PROJECT EXECUTION TIME LINE

INITIALIZING WORK AND COMMUNITY BONDING WITH MENTORS: (Attain the knowledge
of some specifics in Malayalam language.) (I am having my end semester
exams from 19th to 30th April. So it would be preferable to start the work
from 2nd May) May 02 - May 06  : Take suggestions from the mentors and
incorporating them into the implementation. I also would like to study on
SVM to improve pattern recognition accuracies, and if so, I want to discuss
witht the mentors to incorporate the changes in the algorithm.

May 07 - May 10 : Design exact details of data structures to be used that
are thought of till then to optimally implement the algorithm. Planning out
the exact input and output details and intermediate deliverables that are
expected.

May 11 - May 18 : Getting feedback from the mentoring community to initiate
the implementations.

CODING PERIOD:

May 19 - May 24 : Handle the CMU Sphinx dataset to be worked upon and
extract feature vectors for each frame. cepstra, del(cepstra),
del(del(cepstra)), and power are to be considered as feature vectors.

May 25 - May 30 : Adapt the exact language model to be opted based on
feedback from the mentoring community.

May 31 - June 19 : Implement the Viterbi algorithm to select the best
possible word hypotheses for a given input.

June 20 - June 23 : Testing for consistency and efficiency of the models
used. Testing()

June 24 - June 27 : Discuss with the mentors regarding the progress of the
work to get their feedback and incorporate the suggested changes. Planning
to make further error corrections to improve upon the accuracy.

MID TERM DELIVERABLES:

June 28 - July 11 : Develop application specific (if any) that are required
to be performed by contacting the mentors.

July 12 - July 20 : Do some proactive further reading if the application or
the system could be improved upon in any ways.

July 21 - August 1 : Get suggestions from mentors regarding possible errors
and correcting them to improve performance and consistency.

August 2 - August 5 : Documentation of the details of the code. Testing the
model repeatedly and correcting errors.

August 5 - August 22 : Backup time for delays not anticipated.

END DELIVERABLE: A comprehensive accomplished speech recognition system for
Malayalam language in CMU Sphinx.

Post GSOC : Stay in touch with the mentors and the linguistics community to
be a part of the further projects and actively contribute towards the same.
About me

I am a student of IIIT-H(International Institute of Information Technology,
Hyderabad). I am an integrated dual degree student currently pursuing
B.Tech in Computer Science and MS by Research in Computational Linguistics.
I am working under the guidance of Dr.Kishore Prahallad for my research and
MS. Currently I have begun studying Deep Neural Network based speech
segmentation to work on my research.

Previous experience in the fields of Speech technologies and Computational
Linguistics:

1) Earlier I have worked on Text to Speech conversion on mobile platforms
for Indian languages, and in specific put effort along Telugu and Kannada,
and am also in the process of developing an android application for the
same. For improvement in the accuracy and consistency, backoff techniques
are implemented.

2) I had the opportunity to learn from Professor Lakshmi Bhai and have
worked on the project, Etymological reconstruction of Proto form of
Dravidian languages by comparing Malayalam and Telugu.

3) I have previously implemented my own POS tagger based on Viterbi
algorithm. Also developed an unsupervised working model of tagging as a
part of course project.

4) Developed a plugin for Domain specific morph analysis.

5) Built Intra Chunk Expansion tool for English: Developed a tool to mark
intra chunk dependencies of words in English with their expansions from
Shakti Standard Format(SSF).

I have submitted a paper on "Domain Adaptation in Morphological Analysis"
to ICON-2013: 10th International Conference on Natural Language Processing
organized by CDAC. I have also submitted a paper titled "A dynamic
programming based approach for generating syllable level templates in
statistical parametric synthesis" to IEEE International Conference on
Acoustics, Speech, and Signal Processing (ICASSP) 2014.

Requesting your early feedback.
Thank you,
Khyathi

On Sun, Mar 16, 2014 at 1:59 PM, Deepa P.Gopinath
<deepapgopinath at gmail.com>wrote:

> Hello,
>
> Please go through the link
> http://wiki.smc.org.in/SoC/2014. It has general links on the GSOC program
>
> For your project, you may want to create a page in our wiki
> http://wiki.smc.org.in like other students did, and add details about the
> project as much as you can and prepare a project execution timeline.
>
> The student applications window is open now, so you should prepare your
> application and submit.
>
> Once you explain clearly the algorithm/approach in the wiki page, let us
> know. L
>
> Once you submit your application, there will be a review by SMC mentors
> and rest of the GSOC selection process will follow.
> On Mar 15, 2014 12:43 PM, "Khyathi Chandu" <khyathiraghavi at gmail.com>
> wrote:
>
>> Hi,
>>
>> I am interested in working on Language model and Acoustic model for
>> Malayalam language for speech recognition system in CMU Sphinx. I have to
>> my level best tried to understand the background reading papers mentioned.
>> How should I proceed further?
>>
>> Thank you!
>>
>> _______________________________________________
>> Student-projects mailing list
>> Student-projects at lists.smc.org.in
>> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>>
>>
> _______________________________________________
> Student-projects mailing list
> Student-projects at lists.smc.org.in
> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/student-projects-smc.org.in/attachments/20140318/9122a6c2/attachment.html>