[Student-projects] Proposal: Sandhi Splitter for Malayalam

Aboobacker MK aboobackervyd at gmail.com
Fri Mar 11 23:47:40 PST 2016


Interesting project !! . This project has wide variety of applications . We
already have a listed Idea " Spell checker which understand inflections "
which requires Sandhi splitter . I worked on a similar project in gsoc 2014
, but it was using rule based approach
On 12 Mar 2016 09:53, "Jerin Philip" <monu1618 at gmail.com> wrote:

> Hello,
>
> I'm putting forward a new idea for the organization here.
>
> The idea is for a statistical sandhi splitter in Malayalam, which is based
> on a research paper[1] from LTRC, IIIT Hyderabad. LibIndic doesn't have one
> of this kind yet, and I believe that apart from being a standalone module
> in the site, this could have applications like reducing corpus sizes and
> increasing learning efficiency for varnam, similar to the stemmer which was
> implemented previously as a GSoC project did.
>
> We identify the chances of a character being a split point by constructing
> the frequencies of it being a split point learning from a dataset. The
> approach takes into account the characters that come before and after a
> possible split point by the use of conditional probability, and over
> iterations recognizes patterns by itself, without having to be explicitly
> programmed. An efficiency close to 90% is claimed by the paper. False
> positives can be further reduced by post processing the output of the
> statistical sandhi splitter, with a rule based system.
>
> The proof of concept code, implemented in Java by the original authors is
> hosted on github [2]. I'm in good touch with two of the authors of the
> paper, who have extended their help and advice in case I'm having trouble
> with the theoretical side.
>
> The idea could be extended by coupling with varnam or the python port of
> libvarnam from your existing ideas page.
>
> I'd appreciate some response on the feasibility of this idea, availability
> of mentors, inputs for enhancing the idea and queries for clarifications at
> the earliest, so that I can improve the proposal.
>
> [1]
> http://ltrc.iiit.ac.in/icon2015/icon2014_proceedings/papers/File71-p164.pdf
> [2] https://github.com/Devadath/Malayalam_Sandhi_Splitter
>
> Thanks.
> --
> Jerin Philip
>
> _______________________________________________
> Student-projects mailing list
> Student-projects at lists.smc.org.in
> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/student-projects-smc.org.in/attachments/20160312/d3514eca/attachment.html>


More information about the Student-projects mailing list