Hi! I'm currently training with the same gibberish page used in Tesseract's default English training set that begins with "THAN PHONE:" and ends with "BEING TO WEB". The pages I will recog all have the same format (several text regions of varying sizes distributed throughout the page much like a newspaper).
I'm getting good results if I apply a recog rectangle to each region individually. But I expect that winds up being slower than a single page wide region recog would be. Trouble is, Tesseract hasn't been good at correctly recognizing the text within most of those fields when recog'ed as a single region. Automatically finding the proper position and orientation within the scanned images at which to apply those region rectangles is (as you might expect) also problematic. My question is: Should I train using known samples of the formatted pages (observing any other recommended training criteria, of course - i.e. not to group repeated characters together all in a bunch, etc.). Or would I be better off sticking with THAN PHONE? Thanks, Ted. -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

