[Student-projects] Varnam can now stem

Navaneeth K N nkn at riseup.net
Sun Jun 29 19:55:36 PDT 2014


Hello,

On Sunday 29 June 2014 11:40 PM, aboobacker sidheeque mk wrote:
> On Sun, Jun 29, 2014 at 8:52 PM, Kevin Martin <youcancallmekevin at gmail.com>
> wrote:
> 
>>
>>
>>
>> On Sun, Jun 29, 2014 at 6:53 PM, Vasudev Kamath <kamathvasudev at gmail.com>
>> wrote:
>>
>>>
>>> Off topic not related to this discussion.
>>>
>>> aboobacker sidheeque mk <aboobackervyd at gmail.com> writes:
>>>
>>>> ഇന്നലെ നമ്മള്‍ ചാറ്റില്‍ ഡിസ്കസ് ചെയ്തതതാണ് , മെയ്ലിങ്ങ് ലിസ്റ്റില്‍
>>> കൂടി കൊടുക്കാം എന്നു വച്ചു
>>>> :-)
>>>
>>> Can you please translate this?.. I would suggest you restrain from
>>> writing comments or replies in Malayalam, there are mentors on this list
>>> who don't understand Malayalam.
>>>
>>
>>>>
>>>> Take two similar words ചിരിക്കുക and ഇരിക്കുക , if you stemmed this ,
>>>> output will be ചിര and ഇര , but past tense of these words are ചിരിച്ചു ,
>>>> ഇരുന്നു respectively . Then how to use this stem for prediction ??
>>> ുന്നു is
>>>> not suitable for ചിര and ിച്ചു is not suitable for ഇര . In Malayalam
>>>> verb alone have ~ 30 different suffix patterns (or paradigms)
>>>>
>>>
>> I thought about what you said yesterday. Strictly speaking, the goal of
>> the stemmer is not to find the past tense. But it is true that if
>> ചിരിക്കുക stems to ചിര then it wouldn't benefit varnam at all.
>>
>>>  > Similar case with noun :
>>>> തിരുവനന്തപുരം -> തിരുവനന്തപുരത്ത്
>>>> മരം->മരത്തില്‍ (not മരത്തില്‍ )
>>>
>>>
>> I did not understand the example about മരം->മരത്തില്‍ . I do not think any
>> stemmer can stem nouns properly, as the nouns can have foreign roots.
>> When testing with this[1] article, the stemmer stems with an accuracy of
>> 89%. However, this is a result of not stemming when stemming is not
>> necessary rather than stemming properly where stemming is necessary. But I
>> noted that malayali nouns are usually (not always) stemmed correctly.
>> eg : കോഴിക്കോട്ടെ : കോഴിക്കോട്
>>
> my question was not limited to stemmer :-) .  you have to use this stem in
> the prediction list by adding suffixes , at that time you can't add ില്‍ to
> തിരുവനന്തപുരം and ത്ത് to മരം , my questions was how you gonna handle this
> diversity :-) .

This won't be a problem as far as varnam is concerned. Because, varnam
doesn't do predictions this way. It has different tokenizers built in
which can tokenize the given word partially when it knows only a part of
the word and treat the next part as a separate word and combine the
result. So with the current implementation (not the stemmer, but varnam
tokenizer) can easily predict തിരുവനന്തപുരത്ത് if it knows തിരുവനന്തപുരം.



> 
>>
>> [1] ml.wikipedia.org/wiki/തച്ചോളി_ഒതേനൻo
>>
>>>
>>> --l
>>>
>>> Vasudev Kamath
>>> http://copyninja.info
>>> Connect on ~friendica: copyninja at samsargika.copyninja.info
>>> IRC nick: copyninja | vasudev {irc.oftc.net | irc.freenode.net}
>>> GPG Key: C517 C25D E408 759D 98A4  C96B 6C8F 74AE 8770 0B7E
>>>
>>> _______________________________________________
>>> Student-projects mailing list
>>> Student-projects at lists.smc.org.in
>>> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>>>
>>>
>>
>> _______________________________________________
>> Student-projects mailing list
>> Student-projects at lists.smc.org.in
>> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
>>
>>
> 
> 
> 
> 
> _______________________________________________
> Student-projects mailing list
> Student-projects at lists.smc.org.in
> http://lists.smc.org.in/listinfo.cgi/student-projects-smc.org.in
> 

-- 
Cheers,
Navaneeth



More information about the Student-projects mailing list