Dear all, We've just opensourced a tool which allows to create Tesseract training material out of the PAGE XMLs from Aletheia. Source code (Java) of the tool is available here: https://github.com/psnc-dl/page-generator -- binaries can be also downloaded from github. This is a command line tool so it should be easy to use it as a part of your scripts.
Tool allows to "cut" images on top of glyph data from PAGE file and afterwards create Tesseract training page with respective box file. This can be used for Tesseract training. I was testing this using script: https://github.com/psnc-dl/page-generator/blob/master/src/etc/train.sh and it seems that it can produce valid Tesseract profile. Page-generator supports also output from our tool -- Cutouts (http://wlt.synat.pcss.pl/cutouts, https://confluence.man.poznan.pl/community/display/WLT/Cutouts+application) which allows to work on preparation of training material. Kind regards, Adam Dudczak -- Digital libraries team, PSNC http://dl.psnc.pl -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to tesseract-ocr@googlegroups.com To unsubscribe from this group, send email to tesseract-ocr+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to tesseract-ocr+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.