[smc-discuss] [task #9305] Convert English-Malayalam dictionary to TEI XML format

Sunil K INVALID.NOREPLY at gnu.org
Tue Jan 25 14:14:37 PST 2011


Follow-up Comment #2, task #9305 (project smc):

Dear Santhosh,

Thanks for the feedback. Thanks to that I have a better version of tei file
(Header info need lot of modification ) 

1. I am not able to validate using
http://sourceforge.net/apps/mediawiki/freedict/index.php?title=FreeDict_HOWTO_-_12
	I am using debian (TEI DTD are not available from rep) 

But I validated using http://wiki.tei-c.org/index.php/TEI-to-DICT_howto

xmllint --noout your-dictionary.tei
	
Other test I am not able do due to some errors
		xmllint --noout --valid eng-mal.tei 
		eng-mal.tei:2: validity error : Validation failed: no 	
		DTD found !
		<TEI xmlns="http://www.tei-c.org/ns/1.0">


also the conversion given in the link is also working fine 

"
Transform the TEI file into something appropriate to be fed into dictfmt by
entering 

	xsltproc -o dictionary.c5 -novalid --stringparam current-date 
	$(date) ../CVS/tools/xsl/tei2c5.xsl dictionary.tei 

Transform the intermediate file into the database and index files by entering
one (or a combination) of the following: 

	dictfmt -t --utf8 my_dictionary < dictionary.c5 

	dictfmt -t --headword-separator %%% --utf8 my_dictionary < 
	dictionary.c5 

	dictfmt -t -s <short_descriptive_name> --utf8 my_dictionary < 
	dictionary.c5

"

And the resulting dictionary file is working fine in goldendict


2. Header file is not modified ---now it just a copy from eng-fra  will do
modify soon but I dont know the contents to be put.

3. whitespaces removed



one method described in http://wiki.tei-c.org/index.php/TEI-to-DICT_howto is
working (given above) for converting from tei to dict
 Note: there are  few split lines in the dict, I removed quite a few of them
but this may sometimes lead to missing part of the info(malayalam meaning). I
will look into it once again  


I didn;t use dict2tei.py because initially it didn;t worked for me. even now
it is not giving output that can be validated 

	output of dict2tei.py is 

<form><orth>abbey</orth></form>
<def>       1. കന്യകാമഠം</def>
<def>       2. മഠത്തോട് ബന്ധപ്പെട്ട
ദേവാലയം</def>
<def>       3. സന്യാസിമഠം</def>
<form><orth>abbot</orth></form>
<def>       1. മഠാധിപതി</def>

but the output in the format given in eng-fra and eng-hin dicts is of form
	 <entry>
            <form>
               <orth>ABC</orth>
               <pron>eibiːsiː</pron>
            </form>
            <sense n="1">
               <cit type="trans">
                  <quote>abc</quote>
               </cit>
               <cit type="trans">
                  <quote>alphabet</quote>
               </cit>
            </sense>
            <sense n="2">
               <cit type="trans">
                  <quote>abc</quote>
               </cit>
            </sense>
         </entry>
         <entry>
	
So the script because of that (I know little bit perl but no python even to
debug)

to run the script 
 perl tei2dict file.dict file.tei  (file should not have comments)



git I have to learn. I will do it tomorrow


Thanks & Best regards

Sunil


    _______________________________________________________

Reply to this item at:

  <http://savannah.nongnu.org/task/?9305>

_______________________________________________
  Message sent via/by Savannah
  http://savannah.nongnu.org/




More information about the discuss mailing list