[smc-discuss] Tool to identify Malayalam text which we can normalize

Santhosh Thottingal santhosh.thottingal at gmail.com
Wed Dec 9 21:11:08 PST 2015


On Thu, Dec 10, 2015 at 9:30 AM, Baiju Muthukadan <baiju at muthukadan.net>
wrote:

> Hi,
>
> I have a huge Malayalam text where vowel forms are encoded in different
> ways.
>


This should work

#!/usr/local/bin/python
# -*- coding: utf-8 -*-
import unicodedata

unicode_string = u"കോ"
print [ unicodedata.name(c) for c in unicode_string ]
normalized = unicodedata.normalize('NFC', unicode_string)
print [ unicodedata.name(c) for c in normalized ]

Output:

['MALAYALAM LETTER KA', 'MALAYALAM VOWEL SIGN EE', 'MALAYALAM VOWEL SIGN
AA']
['MALAYALAM LETTER KA', 'MALAYALAM VOWEL SIGN OO']



-- 
Santhosh Thottingal
http://thottingal.in
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/discuss-smc.org.in/attachments/20151210/05a18a07/attachment-0002.html>


More information about the discuss mailing list