[Student-projects] Varnam can now stem
Santhosh Thottingal
santhosh.thottingal at gmail.com
Sat Jun 28 23:14:47 PDT 2014
On Wednesday, June 25, 2014, Rajeesh K Nambiar <rajeeshknambiar at gmail.com>
wrote:
>
>> It looks like a reasonably good approach, provided the level 1 & 2 rules
> are robust.
> Santhosh, could you take a better look as well?
>
Two three years back I tried this approach of suffix pattern matching and
conversion trick in Silpa[1]. I won't be 100% accurate to call it as a
stemmer. For certain use cases it might work. But we cannot claim any
accuracy or correctness in academic terms. The linguistic rules of
Malayalam is not that simple to interpret with a 'replace with' operating.
A classic example is 'മകള്', if we attempt this algorithm and mark 'കള്'
as a plural suffix pattern to be removed to get the stem(as in
മുറികള്=>മുറി or പേന - പേനകള് ), we are obviously doing wrong. മകള് is
not a plural form first of all and മ is not its stem.
But if our use case is a predictive entry system, the above approach might
not be disastrous and all these special cases can be tolerable. If I am not
mistaken a false stem is acceptable in that use case.
[1]
https://github.com/diadara/silpa-stemmer/blob/master/indicstemmer/stemmer_ml.rules
It was originally in silpa code, then last years GSOC made it modules and
seperate repos.
Santhosh
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/student-projects-smc.org.in/attachments/20140629/a4edf561/attachment.html>
More information about the Student-projects
mailing list