Dear all,

We've just opensourced a tool which allows to create Tesseract training 
material out of the PAGE XMLs from Aletheia.  Source code (Java) of the 
tool is available here: https://github.com/psnc-dl/page-generator -- 
binaries can be also downloaded from github. This is a command line tool so 
it should be easy to use it as a part of your scripts.

Tool allows to "cut" images on top of glyph data from PAGE file and 
afterwards create Tesseract training page with respective box file. This 
can be used for Tesseract training. I was testing this using script: 
https://github.com/psnc-dl/page-generator/blob/master/src/etc/train.sh and 
it seems that it can produce valid Tesseract profile.

Page-generator supports also output from our tool -- Cutouts 
(http://wlt.synat.pcss.pl/cutouts, 
https://confluence.man.poznan.pl/community/display/WLT/Cutouts+application) 
which allows to work on preparation of training material. 

Kind regards,
Adam Dudczak

--
Digital libraries team, PSNC
http://dl.psnc.pl

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to tesseract-ocr@googlegroups.com
To unsubscribe from this group, send email to
tesseract-ocr+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to tesseract-ocr+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to