[smc-discuss] [task #9305] Convert English-Malayalam dictionary to TEI XML format
Sunil K
INVALID.NOREPLY at gnu.org
Tue Jan 25 14:14:37 PST 2011
Follow-up Comment #2, task #9305 (project smc):
Dear Santhosh,
Thanks for the feedback. Thanks to that I have a better version of tei file
(Header info need lot of modification )
1. I am not able to validate using
http://sourceforge.net/apps/mediawiki/freedict/index.php?title=FreeDict_HOWTO_-_12
I am using debian (TEI DTD are not available from rep)
But I validated using http://wiki.tei-c.org/index.php/TEI-to-DICT_howto
xmllint --noout your-dictionary.tei
Other test I am not able do due to some errors
xmllint --noout --valid eng-mal.tei
eng-mal.tei:2: validity error : Validation failed: no
DTD found !
<TEI xmlns="http://www.tei-c.org/ns/1.0">
also the conversion given in the link is also working fine
"
Transform the TEI file into something appropriate to be fed into dictfmt by
entering
xsltproc -o dictionary.c5 -novalid --stringparam current-date
$(date) ../CVS/tools/xsl/tei2c5.xsl dictionary.tei
Transform the intermediate file into the database and index files by entering
one (or a combination) of the following:
dictfmt -t --utf8 my_dictionary < dictionary.c5
dictfmt -t --headword-separator %%% --utf8 my_dictionary <
dictionary.c5
dictfmt -t -s <short_descriptive_name> --utf8 my_dictionary <
dictionary.c5
"
And the resulting dictionary file is working fine in goldendict
2. Header file is not modified ---now it just a copy from eng-fra will do
modify soon but I dont know the contents to be put.
3. whitespaces removed
one method described in http://wiki.tei-c.org/index.php/TEI-to-DICT_howto is
working (given above) for converting from tei to dict
Note: there are few split lines in the dict, I removed quite a few of them
but this may sometimes lead to missing part of the info(malayalam meaning). I
will look into it once again
I didn;t use dict2tei.py because initially it didn;t worked for me. even now
it is not giving output that can be validated
output of dict2tei.py is
<form><orth>abbey</orth></form>
<def> 1. കന്യകാമഠം</def>
<def> 2. മഠത്തോട് ബന്ധപ്പെട്ട
ദേവാലയം</def>
<def> 3. സന്യാസിമഠം</def>
<form><orth>abbot</orth></form>
<def> 1. മഠാധിപതി</def>
but the output in the format given in eng-fra and eng-hin dicts is of form
<entry>
<form>
<orth>ABC</orth>
<pron>eibiːsiː</pron>
</form>
<sense n="1">
<cit type="trans">
<quote>abc</quote>
</cit>
<cit type="trans">
<quote>alphabet</quote>
</cit>
</sense>
<sense n="2">
<cit type="trans">
<quote>abc</quote>
</cit>
</sense>
</entry>
<entry>
So the script because of that (I know little bit perl but no python even to
debug)
to run the script
perl tei2dict file.dict file.tei (file should not have comments)
git I have to learn. I will do it tomorrow
Thanks & Best regards
Sunil
_______________________________________________________
Reply to this item at:
<http://savannah.nongnu.org/task/?9305>
_______________________________________________
Message sent via/by Savannah
http://savannah.nongnu.org/
More information about the discuss
mailing list