[Student-projects] Varnam can now stem

Sun Jun 29 00:25:07 PDT 2014

>
>
> A classic example is 'മകള്‍', if we attempt this algorithm and mark 'കള്‍'
> as a plural suffix pattern to be removed to get the stem(as in
> മുറികള്‍=>മുറി or പേന - പേനകള്‍ ), we are obviously doing wrong. മകള്‍ is
> not a plural form first of all and മ is not its stem.
>

Exceptions such as these are more common with shorter malayalam words.  If
കള്‍ is considered as a suffix, then the word after stemming will be just
മ. That is, the result contains just one syllable. If the algorithm is
modified so as to stem only if original_word-suffix is greater than 2
syllables, a lot of error cases can be handled I guess. However, this would
be more like a hack rather than a proper approach.

I went through the silpa stemmer rules and they seem more robust than mine
(I'm still improvising the rules). I would like to know how one can test
the accuracy of the stemmer. The obvious way is to supply small chunks of
text and count the number of misses. I just want to make sure that no other
'easy' ways exist before I sit down and start counting.

Also, this[1] paper claims a malayalam stemmer (called STHREE) with more
accuracy than SILPA. Difference between my approach and theirs is that I do
not limit the number of passes to three (though cases where more passes are
need are rare). Apart from minor changes, all the three (mine, SILPA and
STHREE) uses the same suffix stripping/modifying approach. So the only area
where STHREE could possibly improve over SILPA is in the stem rules and
handling of exceptions. If these exceptions can be specified somehow in
varnam's language scheme file, I think it can improve accuracy to a great
extent.

But if our use case is a predictive entry system, the above approach might
> not be disastrous and all these special cases can be tolerable. If I am not
> mistaken a false stem is acceptable in that use case.
>

Since varnam will be using the stemmer to improve its predictions, a
reasonable amount of error can be tolerated. Stemming മകള്‍ to മ (assuming
കള്‍ represented the plural form) will not lead to any problems because
varnam already knows the token മ. However, an example of an unacceptable
stem result (as per my current stem rules) would be the stemming of
സുഹൃുത്തിലൂടെ to സുഹൃ്. I hope this can be handled soon.

Yesterday Abobacker pointed me to a morphological analyzer [2] that he is
using. He demonstrated it by supplying the word   മരിചു and then predicting
മരിക്കുക from it. Though it is more like reverse stemming, varnam can
really benefit from such an approach.

[1]
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6731640&queryText%3Dmalayalam+stemmer
[2] http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/student-projects-smc.org.in/attachments/20140629/f4ebb083/attachment.html>