[smc-discuss] [Git][smc/hyphenation][master] Improvments on Hindi hyphenation rules
Santhosh Thottingal
gitlab at mg.gitlab.com
Fri Mar 11 21:20:45 PST 2016
Santhosh Thottingal pushed to branch master at SMC / Hyphenation
Commits:
55046f1d by Santhosh Thottingal at 2016-03-12T10:49:42+05:30
Improvments on Hindi hyphenation rules
1. ZWNJ - Avoid breaking on both sides, ZWNJ does not make as a standalone
block. Break on the right side
2. Make sure break happens in both side of independent vowel. The comment
was correct, but explicit left side break rule was missing
3. Simplify the combining mark rules for bindu etc. Explicitly define the
non breaking left, leave the right side for contextual rules.
4. Simplify Virama rule - since there is no case of vowel sign+virama in Hindi,
just right side no-break rule is enough. Preceding Consonants has no explicit
right side rule.
Copy these to Marathi too.
Thanks to Eric Muller (emuller at amazon.com) for the suggestions.
- - - - -
2 changed files:
- hi_IN/hyph_hi_IN.dic
- mr_IN/hyph_mr_IN.dic
Changes:
=====================================
hi_IN/hyph_hi_IN.dic
=====================================
--- a/hi_IN/hyph_hi_IN.dic
+++ b/hi_IN/hyph_hi_IN.dic
@@ -24,23 +24,23 @@ UTF-8
% GENERAL RULE
% Do not break either side of ZERO-WIDTH JOINER (U+200D)
22
-% Break on both sides of ZERO-WIDTH NON JOINER (U+200C)
-11
+% Break after ZERO-WIDTH NON JOINER (U+200C)
+1
% Break before or after any independent vowel.
-अ1
-आ1
-इ1
-ई1
-उ1
-ऊ1
-ऋ1
-ॠ1
-ऌ1
-ॡ1
-ए1
-ऐ1
-ओ1
-औ1
+1अ1
+1आ1
+1इ1
+1ई1
+1उ1
+1ऊ1
+1ऋ1
+1ॠ1
+1ऌ1
+1ॡ1
+1ए1
+1ऐ1
+1ओ1
+1औ1
% Break after any dependent vowel but not before.
ा1
ि1
@@ -92,11 +92,11 @@ UTF-8
1ह
% Do not break before chandrabindu, anusvara, visarga, avagraha
% and accents.
-2ँ1
-2ं1
-2ः1
-2ऽ1
-2॑1
-2॒1
+2ँ
+2ं
+2ः
+2ऽ
+2॑
+2॒
% Do not break either side of virama (may be within conjunct).
-2्2
+्2
=====================================
mr_IN/hyph_mr_IN.dic
=====================================
--- a/mr_IN/hyph_mr_IN.dic
+++ b/mr_IN/hyph_mr_IN.dic
@@ -24,23 +24,23 @@ UTF-8
% GENERAL RULE
% Do not break either side of ZERO-WIDTH JOINER (U+200D)
22
-% Break on both sides of ZERO-WIDTH NON JOINER (U+200C)
-11
+% Break after ZERO-WIDTH NON JOINER (U+200C)
+1
% Break before or after any independent vowel.
-अ1
-आ1
-इ1
-ई1
-उ1
-ऊ1
-ऋ1
-ॠ1
-ऌ1
-ॡ1
-ए1
-ऐ1
-ओ1
-औ1
+1अ1
+1आ1
+1इ1
+1ई1
+1उ1
+1ऊ1
+1ऋ1
+1ॠ1
+1ऌ1
+1ॡ1
+1ए1
+1ऐ1
+1ओ1
+1औ1
% Break after any dependent vowel but not before.
ा1
ि1
@@ -92,11 +92,11 @@ UTF-8
1ह
% Do not break before chandrabindu, anusvara, visarga, avagraha
% and accents.
-2ँ1
-2ं1
-2ः1
-2ऽ1
-2॑1
-2॒1
+2ँ
+2ं
+2ः
+2ऽ
+2॑
+2॒
% Do not break either side of virama (may be within conjunct).
-2्2
+्2
View it on GitLab: https://gitlab.com/smc/hyphenation/commit/55046f1d2e983d640c3fe92f54cd6a1f22bd99bb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/discuss-smc.org.in/attachments/20160312/a13d7549/attachment-0001.html>
More information about the discuss
mailing list