[smc-discuss] Free and open Malayalam dictionary dataset

Anivar Aravind anivar.aravind at gmail.com
Wed May 22 06:41:15 PDT 2013


On Wed, May 22, 2013 at 11:29 AM, Kailash Nadh <kailash.nadh at gmail.com>wrote:

>  Hello all,
> I've just been able publish the semanticised version of Datuk's original
> ASCII Malayalam-Malayalam dictionary digitisation work.
> => http://olam.in/open/datuk
>
> "The Datuk Corpus" is a human readable, parse-ready, Unicode dictionary
> dataset with over 83,000 Malayalam words and over 106,000 definitions. It's
> been in development for over two years. The dataset is an evolution of
> Datuk's original work, and has undergone extensive refinement, corrections,
> and structuring, amounting to tens of thousands of changes. The Github
> repository for the project contains the full text corpus, an SQL dump, and
> a couple Python scripts for parsing and conversion.
>
> This is the same dataset that powers Olam's Malayalam-Malayalam dictionary
> that went live two days ago. Also, Datuk's original work constitutes a
> substantial portion of the Malayalam Wiktionary.
>
>
> Sample entries from the dataset:
>
> ച	ചക്രാംഗി	സം. -അംഗീ	_   36953
> 	നാ.	അരയന്നപ്പിട
> 	നാ.	ചക്രവാകപ്പിട
> 	നാ.	മഞ്ചട്ടി
> 	നാ.	കക്കടകശൃംഗി
>
> പ	പരോക്ഷം	_	_	57697
> 	നാ.	മറവ്
> 	നാ.	പരോക്ഷജ്ഞാനം
> 	നാ.	പ്രത്യക്ഷമല്ലാത്തത്
>
>
> The dataset is licensed under the ODbL<http://opendatacommons.org/licenses/odbl/>,
> inspired by the Open Street Map project.
>
> Hope this is all useful.
>
> Thanks
>

Great Work Kailash:-) . This is indeed a great release . When Public funded
projects are wasting money in creating unreleased datasets (like this
http://tools.malayalam.kerala.gov.in/) , It is very heartening to see this
structured dataset release.  Hope you will periodically update the release
with new contributions.

Now we need people for dictd packaging and integrating this with Silpa's
Jabberbot

BTW Just thinking about another project . Can anybody extend Artha(
http://artha.sourceforge.net/wiki/index.php/Artha:About), the best gTK
thesaurus application to support dictd format ? As of now it only supports
wordnet and there is no wordnet for malayalam


 ~ Regards
Anivar
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/discuss-smc.org.in/attachments/20130522/6ec9a270/attachment.htm>


More information about the discuss mailing list