[smc-discuss] [Git][smc/hyphenation][master] Improvments on Hindi hyphenation rules

Santhosh Thottingal gitlab at mg.gitlab.com
Fri Mar 11 21:20:45 PST 2016


Santhosh Thottingal pushed to branch master at SMC / Hyphenation


Commits:
55046f1d by Santhosh Thottingal at 2016-03-12T10:49:42+05:30
Improvments on Hindi hyphenation rules

1. ZWNJ - Avoid breaking on both sides, ZWNJ does not make as a standalone
  block. Break on the right side
2. Make sure break happens in both side of independent vowel. The comment
  was correct, but explicit left side break rule was missing
3. Simplify the combining mark rules for bindu etc. Explicitly define the
  non breaking left, leave the right side for contextual rules.
4. Simplify Virama rule - since there is no case of vowel sign+virama in Hindi,
  just right side no-break rule is enough. Preceding Consonants has no explicit
  right side rule.

Copy these to Marathi too.

Thanks to Eric Muller (emuller at amazon.com) for the suggestions.

- - - - -


2 changed files:

- hi_IN/hyph_hi_IN.dic
- mr_IN/hyph_mr_IN.dic


Changes:

=====================================
hi_IN/hyph_hi_IN.dic
=====================================
--- a/hi_IN/hyph_hi_IN.dic
+++ b/hi_IN/hyph_hi_IN.dic
@@ -24,23 +24,23 @@ UTF-8
 % GENERAL RULE
 % Do not break either side of ZERO-WIDTH JOINER  (U+200D)
 2‍2
-% Break on both sides of ZERO-WIDTH NON JOINER  (U+200C)
-1‌1
+% Break after ZERO-WIDTH NON JOINER  (U+200C)
+‌1
 % Break before or after any independent vowel.
-अ1
-आ1
-इ1
-ई1
-उ1
-ऊ1
-ऋ1
-ॠ1
-ऌ1
-ॡ1
-ए1
-ऐ1
-ओ1
-औ1
+1अ1
+1आ1
+1इ1
+1ई1
+1उ1
+1ऊ1
+1ऋ1
+1ॠ1
+1ऌ1
+1ॡ1
+1ए1
+1ऐ1
+1ओ1
+1औ1
 % Break after any dependent vowel but not before.
 ा1
 ि1
@@ -92,11 +92,11 @@ UTF-8
 1ह
 % Do not break before chandrabindu, anusvara, visarga, avagraha
 % and accents.
-2ँ1
-2ं1
-2ः1
-2ऽ1
-2॑1
-2॒1
+2ँ
+2ं
+2ः
+2ऽ
+2॑
+2॒
 % Do not break either side of virama (may be within conjunct).
-2्2
+्2


=====================================
mr_IN/hyph_mr_IN.dic
=====================================
--- a/mr_IN/hyph_mr_IN.dic
+++ b/mr_IN/hyph_mr_IN.dic
@@ -24,23 +24,23 @@ UTF-8
 % GENERAL RULE
 % Do not break either side of ZERO-WIDTH JOINER  (U+200D)
 2‍2
-% Break on both sides of ZERO-WIDTH NON JOINER  (U+200C)
-1‌1
+% Break after ZERO-WIDTH NON JOINER  (U+200C)
+‌1
 % Break before or after any independent vowel.
-अ1
-आ1
-इ1
-ई1
-उ1
-ऊ1
-ऋ1
-ॠ1
-ऌ1
-ॡ1
-ए1
-ऐ1
-ओ1
-औ1
+1अ1
+1आ1
+1इ1
+1ई1
+1उ1
+1ऊ1
+1ऋ1
+1ॠ1
+1ऌ1
+1ॡ1
+1ए1
+1ऐ1
+1ओ1
+1औ1
 % Break after any dependent vowel but not before.
 ा1
 ि1
@@ -92,11 +92,11 @@ UTF-8
 1ह
 % Do not break before chandrabindu, anusvara, visarga, avagraha
 % and accents.
-2ँ1
-2ं1
-2ः1
-2ऽ1
-2॑1
-2॒1
+2ँ
+2ं
+2ः
+2ऽ
+2॑
+2॒
 % Do not break either side of virama (may be within conjunct).
-2्2
+्2



View it on GitLab: https://gitlab.com/smc/hyphenation/commit/55046f1d2e983d640c3fe92f54cd6a1f22bd99bb
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/discuss-smc.org.in/attachments/20160312/a13d7549/attachment.htm>


More information about the discuss mailing list