[Student-projects] Varnam can now stem

Santhosh Thottingal santhosh.thottingal at gmail.com
Sat Jun 28 23:14:47 PDT 2014


On Wednesday, June 25, 2014, Rajeesh K Nambiar <rajeeshknambiar at gmail.com>
wrote:

>
>> It looks like a reasonably good approach, provided the level 1 & 2 rules
> are robust.
> Santhosh, could you take a better look as well?
>

Two three years back I tried this approach of suffix pattern matching and
conversion trick in Silpa[1]. I won't be 100% accurate to call it as a
stemmer. For certain use cases it might work. But we cannot claim any
accuracy or correctness in academic terms. The linguistic rules of
Malayalam is not that simple to interpret with a 'replace with' operating.

A classic example is 'മകള്‍', if we attempt this algorithm and mark 'കള്‍'
as a plural suffix pattern to be removed to get the stem(as in
മുറികള്‍=>മുറി or പേന - പേനകള്‍ ), we are obviously doing wrong. മകള്‍ is
not a plural form first of all and മ is not its stem.

But if our use case is a predictive entry system, the above approach might
not be disastrous and all these special cases can be tolerable. If I am not
mistaken a false stem is acceptable in that use case.


[1]
https://github.com/diadara/silpa-stemmer/blob/master/indicstemmer/stemmer_ml.rules
 It was originally in silpa code, then last years GSOC made it modules and
seperate repos.

Santhosh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/student-projects-smc.org.in/attachments/20140629/a4edf561/attachment.html>


More information about the Student-projects mailing list