[Student-projects] Varnam can now stem

Sun Jun 29 04:41:33 PDT 2014

ഇന്നലെ നമ്മള്‍ ചാറ്റില്‍ ഡിസ്കസ് ചെയ്തതതാണ് , മെയ്ലിങ്ങ് ലിസ്റ്റില്‍ കൂടി
കൊടുക്കാം എന്നു വച്ചു :-)

Take two similar words ചിരിക്കുക and ഇരിക്കുക , if you stemmed this ,
output will be ചിര and ഇര , but past tense of these words are ചിരിച്ചു
,ഇരുന്നു respectively . Then how to use this stem for prediction ?? ുന്നു
is not suitable for ചിര and ിച്ചു is not suitable for ഇര . In Malayalam
verb alone have ~ 30 different suffix patterns (or paradigms)

Similar case with noun :
തിരുവനന്തപുരം -> തിരുവനന്തപുരത്ത്
മരം->മരത്തില്‍ (not മരത്തില്‍ )

On Sun, Jun 29, 2014 at 12:55 PM, Kevin Martin <youcancallmekevin at gmail.com>
wrote:

>
>> A classic example is 'മകള്‍', if we attempt this algorithm and mark
>> 'കള്‍' as a plural suffix pattern to be removed to get the stem(as in
>> മുറികള്‍=>മുറി or പേന - പേനകള്‍ ), we are obviously doing wrong. മകള്‍ is
>> not a plural form first of all and മ is not its stem.
>>
>
> Exceptions such as these are more common with shorter malayalam words.  If
> കള്‍ is considered as a suffix, then the word after stemming will be just
> മ. That is, the result contains just one syllable. If the algorithm is
> modified so as to stem only if original_word-suffix is greater than 2
> syllables, a lot of error cases can be handled I guess. However, this would
> be more like a hack rather than a proper approach.
>
> I went through the silpa stemmer rules and they seem more robust than mine
> (I'm still improvising the rules). I would like to know how one can test
> the accuracy of the stemmer. The obvious way is to supply small chunks of
> text and count the number of misses. I just want to make sure that no other
> 'easy' ways exist before I sit down and start counting.
>
> Also, this[1] paper claims a malayalam stemmer (called STHREE) with more
> accuracy than SILPA. Difference between my approach and theirs is that I do
> not limit the number of passes to three (though cases where more passes are
> need are rare). Apart from minor changes, all the three (mine, SILPA and
> STHREE) uses the same suffix stripping/modifying approach. So the only area
> where STHREE could possibly improve over SILPA is in the stem rules and
> handling of exceptions. If these exceptions can be specified somehow in
> varnam's language scheme file, I think it can improve accuracy to a great
> extent.
>
>
> But if our use case is a predictive entry system, the above approach might
>> not be disastrous and all these special cases can be tolerable. If I am not
>> mistaken a false stem is acceptable in that use case.
>>
>
> Since varnam will be using the stemmer to improve its predictions, a
> reasonable amount of error can be tolerated. Stemming മകള്‍ to മ (assuming
> കള്‍ represented the plural form) will not lead to any problems because
> varnam already knows the token മ. However, an example of an unacceptable
> stem result (as per my current stem rules) would be the stemming of
> സുഹൃുത്തിലൂടെ to സുഹൃ്. I hope this can be handled soon.
>
> Yesterday Abobacker pointed me to a morphological analyzer [2] that he is
> using. He demonstrated it by supplying the word   മരിചു and then predicting
> മരിക്കുക from it. Though it is more like reverse stemming, varnam can
> really benefit from such an approach.
>
>
> [1]
> http://ieeexplore.ieee.org/xpl/articleDetails.jsp?tp=&arnumber=6731640&queryText%3Dmalayalam+stemmer
> [2] http://www.ling.helsinki.fi/kieliteknologia/tutkimus/hfst/
>
> _______________________________________________
> Student-projects mailing list
> Student-projects at lists.smc.org.in
> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>
>

-- 
Aboobacker MK
GSoC Student
twitter.com/abvayad
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/student-projects-smc.org.in/attachments/20140629/faddb220/attachment.html>