[smc-discuss] Updated GLibc Collation rules for Malayalam

Sun Dec 11 01:59:31 PST 2011

Hi,

A few months back, the glibc locales for Malayalam was updated with
currency symbol updated to ₹ symbol[1]. Now I am planning to update
the collation rules for Malayalam with latest unicode characters and
make it compatible with CLDR collation definitions as far as possible.

The collation rules we have in glibc for Malayalam is documented in my
blog[2] and available as PDF in our website [3] . The following
changes are made in new version:

1. Add 5 atomic chillu characters to the collation rule, and define
their collation weights.
        1.a) A chillu character(atomic and nonatomic) is primary
equivalent to its dead consonant form. that is the collation weights
of ന്‍ /ൻ  and  ന് are same. and it is sum of collation weights of
consonant + virama(chandrakkala)
       1.b) The collation weights of Atomic chillu and non atomic
chillu are primarily  equivalent. But when applications tries to
resolve a ties while sorting, the second level collation weights are
not same. That means,  Atomic and Non atomic chillus are first level
equivalent and second level different.
       1.c)  In Non Atomic Chillu, the  zwj is having zero collation
width and becomes ignorable.
2. ൗ, ൌ were primarily equivalent in glibc, I added a tertiary
differentiator for tie breaking while sorting, that makes ൗ appearing
first in sorted data.
3. DOT Repha, <0D4E> is primarily equivalent to RA+ VIRAMA sequence.

If you look at the latest CLDR definition for Malayalam[4] , these are
the rules exist there:

1. Archaic and modern AU-Signs are different only by tertiary. - The
collation equivalence existis in glibc already, But the tertiary
difference was not defined, I just added it now. ൗ, ൌ will be sorted
adjacent and ൗ having less collation weight than ൌ.
2. Anuswara primary equal to MA_dead.  - This rule is already present
in glibc. മ് = ം
3. Pre-5.1 Chillus secondary equal to 5.1 chillus. Chillus primary
equal to its consonant_dead form. - We added it now
4. /nta/ is sorted as <NA, Virama, RRA>. - Unless one write ന്റ using
<0D7B, 0D4D, 0D31> ie CHILLU NA + VIRAMA + RA, this rule is present in
glibc already
5. Avagraha and Visarga are primary ignorables. - I did not understood
why they are ignorables, at the same time, if they are not ignorable,
I dont know the collation position for them. Not implemented in glibc.

The major difference between CLDR and GLIBC  collation for Malayalam
as of now is in the sorting of order of CONSONANT , CONSONANT +
VIRAMA.
Glibc sorts dead consonant, ie CONSONANT+VIRAMA before CONSONANT, ie
ക് comes first, ക is second. This logic is because of the definition-
ക is ka with inherent a vowel , and ക് is vowel less consonant ka.

So in glibc, ക് , ക്‍, ൿ[atomic] , ക , കാ , കി, കു്[Samvruthokaram] ,
കു , കൂ, കൗ, കൌ  is the sorted list, while in CLDR it is ക ,
കു്[Samvruthokaram] ,  കാ , കി,  കു , കൂ, കൗ, കൌ , ക് , ക്‍, ൿ[atomic]
If anybody want to experiment the cldr collation rules ,  use ICU
Locale explorer [5]

The updated collation definition file for glibc, iso14651_t1_common
is placed in our git[6]

If anybody want to try this rules in your computer, use the following
steps in debian based distros. (Please do not try unless you know what
you are doing)
1. Backup your existing collation file :
/usr/share/i18n/locales/iso14651_t1_common
2. Replace it with the new file
3. sudo localegen If you did not configure your computer for ml_IN
locale, use dpkg-reconfigure locales before doing  this step.
4. Create a plain text file with some Malayalam words you want to
test. Sort it - You can use gedit -> Edit -> Sort or use sort command.

I did not filed the patch to glibc, will do that after I do more
testing and I get feedback

References:

[1] http://sourceware.org/bugzilla/show_bug.cgi?id=12541
[2] http://thottingal.in/blog/2009/01/01/malayalam-locale/
[3] http://smc.org.in/doc/malayalam-collation.pdf
[4] http://unicode.org/cldr/trac/browser/trunk/common/collation/ml.xml
[5] http://demo.icu-project.org/icu-bin/locexp?_=ml&d_=en&x=col
[6] http://git.savannah.gnu.org/cgit/smc.git/tree/collation

Thanks
Santhosh Thottingal