<html>

  <head>

    <meta content="text/html; charset=ISO-8859-1"

      http-equiv="Content-Type">

  </head>

  <body text="#000000" bgcolor="#FFFFFF">

    <div class="moz-cite-prefix">Thanks @Anivar :)<br>

      <br>

      Yes, the dataset will be updated periodically. Since it's on

      Github, anyone can contribute to the main branch any time. If

      anyone's interested in contributor/moderator access to the

      repository, please let me know.<br>

      <br>

      In addition, I'm also working on making Olam's English-Malayalam

      dataset public.<br>

      <br>

      Kailash<br>

      <br>

      On 22/05/2013 7:11 PM, Anivar Aravind wrote:<br>

    </div>

    <blockquote

cite="mid:CA+nuCJbuV6z+=AsW0+RQjDwboZQt-0AKDFqVC8-p2XfYUMC1Tg@mail.gmail.com"

      type="cite">

      <div dir="ltr"><br>

        <div class="gmail_extra"><br>

          <br>

          <div class="gmail_quote">On Wed, May 22, 2013 at 11:29 AM,

            Kailash Nadh <span dir="ltr"><<a moz-do-not-send="true"

                href="mailto:kailash.nadh@gmail.com" target="_blank">kailash.nadh@gmail.com</a>></span>

            wrote:<br>

            <blockquote class="gmail_quote" style="margin:0px 0px 0px

              0.8ex;border-left:1px solid

              rgb(204,204,204);padding-left:1ex">

              <div text="#000000" bgcolor="#FFFFFF">

                <div lang="x-western"> Hello all,<br>

                  I've just been able publish the semanticised version

                  of Datuk's original ASCII Malayalam-Malayalam

                  dictionary digitisation work.<br>

                  => <a moz-do-not-send="true"

                    href="http://olam.in/open/datuk" target="_blank">http://olam.in/open/datuk</a><br>

                  <br>

                  "The Datuk Corpus" is a human readable, parse-ready,

                  Unicode dictionary dataset with over 83,000 Malayalam

                  words and over 106,000 definitions. It's been in

                  development for over two years. The dataset is an

                  evolution of Datuk's original work, and has undergone

                  extensive refinement, corrections, and structuring,

                  amounting to tens of thousands of changes. The Github

                  repository for the project contains the full text

                  corpus, an SQL dump, and a couple Python scripts for

                  parsing and conversion.<br>

                  <br>

                  This is the same dataset that powers Olam's

                  Malayalam-Malayalam dictionary that went live two days

                  ago. Also, Datuk's original work constitutes a

                  substantial portion of the Malayalam Wiktionary.<br>

                  <br>

                  <br>

                  Sample entries from the dataset:<br>

                  <pre>ച    ചക്രാംഗി        സം. -അംഗീ   _   36953

        നാ. അരയന്നപ്പിട

        നാ. ചക്രവാകപ്പിട

        നാ. മഞ്ചട്ടി

        നാ. കക്കടകശൃംഗി</pre>

                  <pre>പ    പരോക്ഷം        _       _       57697

        നാ. മറവ്

        നാ. പരോക്ഷജ്ഞാനം

        നാ. പ്രത്യക്ഷമല്ലാത്തത്</pre>

                  <br>

                  The dataset is licensed under the <a

                    moz-do-not-send="true"

                    href="http://opendatacommons.org/licenses/odbl/"

                    target="_blank">ODbL</a>, inspired by the Open

                  Street Map project.<br>

                  <br>

                  Hope this is all useful.<br>

                  <br>

                  Thanks<span class=""></span></div>

              </div>

            </blockquote>

            <div><br>

            </div>

            <div>Great Work Kailash:-) . This is indeed a great release

              . When Public funded projects are wasting money in

              creating unreleased datasets (like this <a

                moz-do-not-send="true"

                href="http://tools.malayalam.kerala.gov.in/">http://tools.malayalam.kerala.gov.in/</a>)

              , It is very heartening to see this structured dataset

              release.  Hope you will periodically update the release

              with new contributions. <br>

              <br>

            </div>

            <div>Now we need people for dictd packaging and integrating

              this with Silpa's Jabberbot <br>

            </div>

            <div><br>

            </div>

            <div>BTW Just thinking about another project . Can anybody

              extend Artha(<a moz-do-not-send="true"

                href="http://artha.sourceforge.net/wiki/index.php/Artha:About">http://artha.sourceforge.net/wiki/index.php/Artha:About</a>),

              the best gTK thesaurus application to support dictd format

              ? As of now it only supports wordnet and there is no

              wordnet for malayalam<br>

              <br>

            </div>

            <div><br>

            </div>

            <div> ~ Regards<br>

            </div>

            <div>Anivar<br>

            </div>

          </div>

        </div>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

Swathanthra Malayalam Computing discuss Mailing List

Project: <a class="moz-txt-link-freetext" href="https://savannah.nongnu.org/projects/smc">https://savannah.nongnu.org/projects/smc</a>

Web: <a class="moz-txt-link-freetext" href="http://smc.org.in">http://smc.org.in</a> | IRC : #smc-project @ freenode

<a class="moz-txt-link-abbreviated" href="mailto:discuss@lists.smc.org.in">discuss@lists.smc.org.in</a>

<a class="moz-txt-link-freetext" href="http://lists.smc.org.in/listinfo.cgi/discuss-smc.org.in">http://lists.smc.org.in/listinfo.cgi/discuss-smc.org.in</a>


</pre>

    </blockquote>

    <br>

  </body>

</html>