[Student-projects] added datuk word corpus and ml_IN.dict to varnam

kiran ps pskirann at gmail.com
Thu Mar 13 08:27:49 PDT 2014


Currently there are 317169 word in varnam word corpus,i have managed to
extract 78855 words from The Datuk word corpus published by olam.
ml_IN.dict dictionary used by spellchecker, which has around 142591
word.Together 116461 words was new to varnam. Some of the words around 856
have brackets () and colon : in them, i think they belong to sanskrit so i
added them in to another file.

no of words in varnam = 317169
no of words in olam = 78855
no of words in ml_IN.dict = 142591
olam ∩ ml_dict = 8583
new words to varnam = 116461

The varnam Corpus is based mainly on material collected from pages on the
World Wide Web.By the use of synchronization tool we can upload the words
from offline IMEs to the online repository more easily.I think the data we
collected need to be reviewed.By doing so we can create a better corpus.The
corpus will be helpful track and record the very latest developments in
language today. By analyzing the corpus and using special software, we can
see words in context and find out how new words and senses are emerging, as
well as spotting other trends in usage, spelling and so on.The corpus will
help to create a better dictionary.The spellchecker that we are currently
using have only 150000 while varnam has 400000 +. The corpus will be helpul
almost every projects that we have.

*Attachments*

newtovarnam<https://drive.google.com/file/d/0B-5aQGt-4wv7VVhkS0VoNkZ4UkE/edit?usp=sharing>-
words new to varnam
varnam<https://drive.google.com/file/d/0B-5aQGt-4wv7T2VWcUZlZURVXzQ/edit?usp=sharing>-
varnam word corpus
datukextracted<https://drive.google.com/file/d/0B-5aQGt-4wv7QS1vMWxKV25UZDg/edit?usp=sharing>-
words extracted from datuk
datuk brackets<https://drive.google.com/file/d/0B-5aQGt-4wv7d2JkUGk0NWhfS2s/edit?usp=sharing>-
words having brackets
datuk colon<https://drive.google.com/file/d/0B-5aQGt-4wv7cjhMZUNydF93dW8/edit?usp=sharing>-
words having colon
datuk<https://drive.google.com/file/d/0B-5aQGt-4wv7OTcwdFk4M1RmNTg/edit?usp=sharing>-
datuk corpus
ml_IN.dict<https://drive.google.com/file/d/0B-5aQGt-4wv7TVZfSEpxY0toQ3M/edit?usp=sharing>-
malayam dictionary used in spellcheckers
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.smc.org.in/pipermail/student-projects-smc.org.in/attachments/20140313/ecb5aff3/attachment-0001.htm>


More information about the Student-projects mailing list