[Student-projects] Proposal: Sandhi Splitter for Malayalam

Jerin Philip monu1618 at gmail.com
Fri Mar 11 20:22:34 PST 2016


Hello,

I'm putting forward a new idea for the organization here.

The idea is for a statistical sandhi splitter in Malayalam, which is based
on a research paper[1] from LTRC, IIIT Hyderabad. LibIndic doesn't have one
of this kind yet, and I believe that apart from being a standalone module
in the site, this could have applications like reducing corpus sizes and
increasing learning efficiency for varnam, similar to the stemmer which was
implemented previously as a GSoC project did.

We identify the chances of a character being a split point by constructing
the frequencies of it being a split point learning from a dataset. The
approach takes into account the characters that come before and after a
possible split point by the use of conditional probability, and over
iterations recognizes patterns by itself, without having to be explicitly
programmed. An efficiency close to 90% is claimed by the paper. False
positives can be further reduced by post processing the output of the
statistical sandhi splitter, with a rule based system.

The proof of concept code, implemented in Java by the original authors is
hosted on github [2]. I'm in good touch with two of the authors of the
paper, who have extended their help and advice in case I'm having trouble
with the theoretical side.

The idea could be extended by coupling with varnam or the python port of
libvarnam from your existing ideas page.

I'd appreciate some response on the feasibility of this idea, availability
of mentors, inputs for enhancing the idea and queries for clarifications at
the earliest, so that I can improve the proposal.

[1]
http://ltrc.iiit.ac.in/icon2015/icon2014_proceedings/papers/File71-p164.pdf
[2] https://github.com/Devadath/Malayalam_Sandhi_Splitter

Thanks.
--
Jerin Philip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/student-projects-smc.org.in/attachments/20160312/030b1e12/attachment.htm>


More information about the Student-projects mailing list