[smc-discuss] Boxfile for tesseract and Malayalam letters with expanded spacing

Sun Oct 20 08:25:54 PDT 2013

Dear Baiju ,

Have you checked this
https://code.google.com/p/tesseractindic/downloads/detail?name=tesseract_trainer.beta.tar.gz&can=2&q=

THis is a tool to automatically generate the files required by
tesseract-ocr for adding support to a new script. This tool takes as
input a file containing all characters of the alphabet, and a
directory of all different fonts. It then generates several tif images
and corresponding box files, and then proceeds to generate the 5
training files:

inttemp
normproto
unicharset
Microfeat
pffmtable

I dont know all of them needed for cyrrent version .
But I think it is worth to go through Debayan's previous work at
https://sites.google.com/site/debayanin/hackingtesseract

On Sun, Oct 20, 2013 at 8:43 PM, Baiju M <baiju.m.mail at gmail.com> wrote:
> Hi,
>
> I am trying to create a boxfile for tesseract.  My current target is
> to recognize Rachana typeface. I am experimenting with LibreOffice to
> create a sample TIF file using some Malayalam text.
>
> In LibreOffice, what's happening when we use
> Format->Character->Position->Spacing->Expanded for Malayalam
> characters ? What's the logic to identify a character ?
>
> Can I get something similar using Pango or any other tool which I can
> use as a library (C/Python) or command-line which does similar to
> LibreOffice ?
>
> So far I am fine with result of LibreOffice, but I would like to use
> something which I can automate.
>
> Regards,
> Baiju M
> _______________________________________________
> Swathanthra Malayalam Computing discuss Mailing List
> Project: https://savannah.nongnu.org/projects/smc
> Web: http://smc.org.in | IRC : #smc-project @ freenode
> discuss at lists.smc.org.in
> http://lists.smc.org.in/listinfo.cgi/discuss-smc.org.in
>