[smc-discuss] Boxfile for tesseract and Malayalam letters with expanded spacing

Baiju M baiju.m.mail at gmail.com
Sun Oct 20 08:51:24 PDT 2013


It looks like Pango/HarfBuzz can do the basic character formation. The
other challenge is creating the sample character set.  May be I will
use Malayalam wikipedia/wikisource to create this basic character set
that need to be recognized.  I am not sure how large character set or
TIF file is supported by tesseract.  It looks like it's going to be
such a huge file for Malayalam (This may affect the speed) For time,
being I will not worry about the performance.

I will commit my code here, now it's empty :)
https://github.com/smc/mal-ocr


On Sun, Oct 20, 2013 at 8:55 PM, Anivar Aravind
<anivar.aravind at gmail.com> wrote:
> Dear Baiju ,
>
> Have you checked this
> https://code.google.com/p/tesseractindic/downloads/detail?name=tesseract_trainer.beta.tar.gz&can=2&q=
>
> THis is a tool to automatically generate the files required by
> tesseract-ocr for adding support to a new script. This tool takes as
> input a file containing all characters of the alphabet, and a
> directory of all different fonts. It then generates several tif images
> and corresponding box files, and then proceeds to generate the 5
> training files:
>
> inttemp
> normproto
> unicharset
> Microfeat
> pffmtable
>
>
> I dont know all of them needed for cyrrent version .
> But I think it is worth to go through Debayan's previous work at
> https://sites.google.com/site/debayanin/hackingtesseract
>
> On Sun, Oct 20, 2013 at 8:43 PM, Baiju M <baiju.m.mail at gmail.com> wrote:
>> Hi,
>>
>> I am trying to create a boxfile for tesseract.  My current target is
>> to recognize Rachana typeface. I am experimenting with LibreOffice to
>> create a sample TIF file using some Malayalam text.
>>
>> In LibreOffice, what's happening when we use
>> Format->Character->Position->Spacing->Expanded for Malayalam
>> characters ? What's the logic to identify a character ?
>>
>> Can I get something similar using Pango or any other tool which I can
>> use as a library (C/Python) or command-line which does similar to
>> LibreOffice ?
>>
>> So far I am fine with result of LibreOffice, but I would like to use
>> something which I can automate.
>>
>> Regards,
>> Baiju M
>> _______________________________________________
>> Swathanthra Malayalam Computing discuss Mailing List
>> Project: https://savannah.nongnu.org/projects/smc
>> Web: http://smc.org.in | IRC : #smc-project @ freenode
>> discuss at lists.smc.org.in
>> http://lists.smc.org.in/listinfo.cgi/discuss-smc.org.in
>>
> _______________________________________________
> Swathanthra Malayalam Computing discuss Mailing List
> Project: https://savannah.nongnu.org/projects/smc
> Web: http://smc.org.in | IRC : #smc-project @ freenode
> discuss at lists.smc.org.in
> http://lists.smc.org.in/listinfo.cgi/discuss-smc.org.in
>



More information about the discuss mailing list