[Student-projects] GSoC'14 Project: Heavily Enhanced Indic IME's, with a Preliminary Focus on Malayalam

SANDEEP KALYAN SUBRAMANIAN sandsub95 at berkeley.edu
Wed Mar 19 03:43:22 PDT 2014


Dear SMC GSoC Mentors,

My name is Sandeep Subramanian, and I am an undergraduate student at the
University of California, Berkeley. I am currently studying computer
science in addition to chemical biology and environmental & transportation
engineering. I have a very strong background in Indic scripts, being
personally able to read and write Tamil, Malayalam, Kannada, Tulu, Grantha,
Telugu, and Devanagari, among others (including Arabic, Urdu, Korean, and
Chinese).

I would like to work on developing an IME for Malayalam using an algorithm
based on character frequency and movement of hands, similar to the QWERTY
format. Here is an outline of my proposal and why I would really like to
pursue this project.

My project is a little late. I apologize for this in advance, and I would
greatly appreciate any feedback in its regard. Please find below my
proposal.


Problem:

The computer keyboard was not designed for languages other than English,
and adapting it to other languages, particularly for Indic scripts, has
proven exceedingly difficult. Existing input method editors (IME) for Indic
scripts typically follow three formats: 1) All of the vowel forms and
diacritics are on the left side and the consonant forms are on the right
side of the keyboard; 2) The user types using the Roman letters found on
the keyboard and the IME attempts to transliterate it into native
characters; 3) Common syllables are each represented by a character on the
keyboard, and combining them in certain ways produces other characters.

All three of these methods are problematic and not suited for Indic
scripts, particularly those such as Malayalam that have a large inventory
of consonants and vowels on top of special ligatures for specific
combinations, samyuktaksharas, and chillaksharas. In the first, the hands
need to alternate, which slows down the pace of typing considerably. This
layout is essentially equivalent to arranging the QWERTY keyboard as an
ABCDEF keyboard with the letters in order. This is not efficient for
typing. Vowels and consonants should also be interspersed on the keyboard,
as this more naturally reflects the optimal positions for the movement of
fingers while typing.

In the second, many Indic characters cannot be typed efficiently: Indian
languages have double more vowels and many more consonants than English
does, and current methods of Latin-based typing fail to adequately
distinguish the frequency of certain characters in Malayalam as opposed to
Latin. For example, the first word in "Thattathin Marayathu" (തട്ടത്തിൻ
മറയത്ത്) can be written in several different ways based on the Roman string
"Thattathin": തടതിൻ, ഥത്തത്ഥിണ്‍, ഥട്ടത്തിണ്‍, all of which are incorrect.
Roman transliteration is not only ambiguous but also not reflective of the
frequency of Malayalam characters. For example, ക്ക may appear more
frequently than ഗ, but it would be easier to write "ga" on the Roman
keyboard than "kka." This is especially true of the aspirated consonant ഭ
over ബ, and of the diphthong ഐ, which appears very frequently in words of
Tamil origin. Also, the location of the "ka" letter on the Roman keyboard
may not reflect its frequency in Malayalam, which means its optimal
position may be somewhere else. As such, the Roman keyboard layout is not
suited for typing Malayalam and is ambiguous.

The third layout, of character combinations, usually follows a certain
order that is not conducive toward an efficient typing layout, such as
Chinese radical-based typing.

These problems propagate in the stark lack of Indic script material on the
web. There is a plethora of resources available for copy onto the web, but
the web simply doesn't have it. Indian users often prefer to type native
languages using the Roman keyboard, which leads to ambiguity and inability
to communicate at length on digital media using Indian languages. This
poses a threat to the survival of Indian languages in the new digital age.


Planned Solution:

The QWERTY layout of the keyboard accounts for the movement of the hands
more closely than would an ABCDEF layout that is arranged by the order of
the English alphabet. Similarly, I would like to analyze the letter
frequency of Malayalam characters in various settings: education, media,
literature, science, etc. Based on an evaluation of the the frequency of
Malayalam syllables and diacritics that appear in printed Malayalam, these
frequencies can be mapped to frequencies on the QWERTY keyboard layout, or
any other efficient keyboard algorithm that exists for modeling typing
patterns.

One of the primary problems of typing Indic scripts is the sheer number of
characters present in comparison to the 26-character Roman script. This
should be tackled on a similar basis. Vowels should be prioritized based on
those that occur most frequently in the language, and those that occur less
frequently should be relegated to less frequented positions on the keyboard
or as capitalized letters. Consonants should be arranged based on their
frequency and occurrence in samyuktakshara combinations or chillaksharas
and viramas. All of these will be analyzed from Malayalam sources, such as
media, education, and literature. They will be processed, and their
frequencies will be stored in data files.

As developing a simulator to model hand movement on the keyboard is quite
difficult to do without a large sample size, the QWERTY keyboard format and
corresponding letter frequencies for each key position will be assumed and
will be applied to the eventual Malayalam IME developed.

A simulator will be developed to simulate typing efficacy of different
keyboard methods, based on the efficiency of typing using QWERTY. This will
be developed based on simulator algorithms already available on the web.
and based on letter frequencies.

Based on the simulator and the frequencies of letters discovered, IME's
will be developed and tested to enhance the typability of Malayalam.

Some features will be manually corrected for: the use of double consonants
(kka, tta, nna), etc. with ease; chillakshara, which are only present for a
set of consonants; which letters are considered as capitalized, and their
relationship to the lowercase form. The algorithm used will need to account
for the occurrence of capitalized letters as well. These features will
deviate from the ideal keyboard to account for human interpretation of
letters and their associations, and will need to be tested.

Ultimately, a very Malayalam-specific keyboard/IME should be developed by
the end of this process that is suitable for typing Malayalam as used in
media, education, literature, and other digital forms widely found today.
It should proliferate the use of Malayalam script amongst typed Malayalam
on the Internet, and should be easily applicable to other non-Malayalam
languages as well.


General Implementation Schedule (Tentative):

Weeks 1 & 2: (May 19 - June 1): Exploring the algorithms associated with
quick typing and collecting data on a representative set of Malayalam
sources to process (media, education, literature). Plans to review Python,
Java, and XML code that would specifically be required to process Malayalam
characters. Deliverable: Code that can be used to process a website's
characters and store in a data file for easy graphical representation. Also
plan to have a presentation of the QWERTY format and the arrangement of
keys according to letter frequency. Systems of typing in other languages
will also be analyzed (going into Week 3)

Weeks 3 & 4: (June 2 - June 15): Processing of Malayalam characters, and
storage in data file. Analysis and graphical representation. Algorithm
developed to process two-letter combinations and three-letter combinations,
as appear in English; apply to QWERTY, and begin collecting data for such
in Malayalam. QWERTY analysis and typing algorithms should be thoroughly
analyzed and complete by Week 4.

Weeks 5 & 6: (June 16 - June 29): Analysis of two-letter and three-letter
combination for syllables in Malayalam; displayed in graphical form.
Development of Malayalam typing simulator to show the time and ease
required to type in Malayalam, using existing keyboards. Analysis of
Malayalam character frequency should be completed by Week 6.

Weeks 7 & 8: (June 30 - July 13): Development of the actual IME. Ensuring
accurate processing of Unicode characters. Refining of typing simulator
based on mentor feedback and tests on existing keyboards. Simulator should
be finalized by Week 8.

Weeks 9 & 10: (July 14 - July 27): Running simulator on existing IME and
comparing to existing keyboards. Refining based on results. Show results in
tabular form for comparison.

Weeks 11 & 12: (July 28 - August 10): Continue refining IME and running on
simulator. IME should be finalized by end of Week 12. Collaboration with
other language processing coders possible, and possible testing in
Malayalam community to observe reaction.

Week 13: (August 11 - August 18): Account for any possible Backlog; final
submission and tweaking; Outreach to members of community for testing of
new keyboard with feedback; Collaborate more closely with other GSoC
members in projects concerning language processing (specifically those that
are designing software to help predict Malayalam characters - similar to
AutoCorrect). Also possible collaboration with people designing Android
layout.


Impact:

Typing in Indic languages should be intuitive and second-nature. It should
not have to go through another script or language (Roman and English) to
achieve the same effect. This is like having to speak a language but have
it translated from another one first, or like having to breathe but after
the air has already been processed by someone else.

Currently, people in India, especially youth, type in Indian languages
using the Roman script because it's more readily available. If this
keyboard proves to be efficient, Malayalam typing can be used on any
digital device with ease, and typing in Malayalam will be just as easy as
typing in English. This ensures non-ambiguity of language and propagates
use of the language in the digital age.

It will also prove to be a very useful tool for digitization of Malayalam
texts. If Malayalam can be typed easily, more people will have the
incentive and the means to contribute to the Malayalam web and researchers
will be able to convert Malayalam palm-leaf manuscripts to a
word-searchable form much quicker.

This project is also quite adaptible with other projects recommended for
Google Summer of Code '14. Word processing in Malayalam can be used to
optimize the keyboard, and Android implementation of the keyboard will
broaden its impact.

The code that is developed as a result of this will revolutionize the
structuring of keyboards for other Indic scripts, and the algorithm can be
easily implemented to develop efficient typing keyboards in other
languages. The code will not have to be duplicated; this project can serve
as a base for further language-specific keyboard and IME development in
other languages.

As electronic equipment becomes more accessible to rural India, the
challenge of needing to learn the English alphabet for use of a computer
and for use of the web is an obstacle that needs to be overcome for
effective transmission and development of digital media in India. Not only
can more language-specific materials be developed, but also more people can
participate in the cyberspace if they are taught with a keyboard specific
for Malayalam - not based on methods that were designed for another
language or script pattern.


If you have any further questions regarding my proposal, I would be happy
to discuss them; please feel free to send me an email at
sandsub95 at berkeley.edu. Thank you so much for considering my project; I am
very grateful for the opportunity to share my ideas with your organization.
Looking forward to your feedback and response.

Best regards,
Sandeep Subramanian
സന്ദീപ്‌ സുബ്രമണ്യൻ
University of California, Berkeley '17
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/student-projects-smc.org.in/attachments/20140319/5ef5330d/attachment-0002.htm>


More information about the Student-projects mailing list