[smc-discuss] Free and open Malayalam dictionary dataset

Manilal K M libregeek at gmail.com
Tue May 21 22:43:19 PDT 2013


---------- Forwarded message ----------
From: Kailash Nadh <kailash.nadh at gmail.com>
To: smc-discuss at googlegroups.com
Cc:
Date: Wed, 22 May 2013 10:18:16 +0530
Subject: Free and open Malayalam dictionary dataset
 Hello all,
I've just been able publish the semanticised version of Datuk's original
ASCII Malayalam-Malayalam dictionary digitisation work.
=> http://olam.in/open/datuk

"The Datuk Corpus" is a human readable, parse-ready, Unicode dictionary
dataset with over 83,000 Malayalam words and over 106,000 definitions. It's
been in development for over two years. The dataset is an evolution of
Datuk's original work, and has undergone extensive refinement, corrections,
and structuring, amounting to tens of thousands of changes. The Github
repository for the project contains the full text corpus, an SQL dump, and
a couple Python scripts for parsing and conversion.

This is the same dataset that powers Olam's Malayalam-Malayalam dictionary
that went live two days ago. Also, Datuk's original work constitutes a
substantial portion of the Malayalam Wiktionary.


Sample entries from the dataset:

ച	ചക്രാംഗി	സം. -അംഗീ	_   36953
	നാ.	അരയന്നപ്പിട
	നാ.	ചക്രവാകപ്പിട
	നാ.	മഞ്ചട്ടി
	നാ.	കക്കടകശൃംഗി

പ	പരോക്ഷം	_	_	57697
	നാ.	മറവ്
	നാ.	പരോക്ഷജ്ഞാനം
	നാ.	പ്രത്യക്ഷമല്ലാത്തത്


The dataset is licensed under the
ODbL<http://opendatacommons.org/licenses/odbl/>,
inspired by the Open Street Map project.

Hope this is all useful.

Thanks

Kailash




-- 
Manilal K M | മണിലാല്‍ കെ എം.
http://libregeek.blogspot.com
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/discuss-smc.org.in/attachments/20130522/86505bdf/attachment-0001.htm>


More information about the discuss mailing list