Formatted Pages Training

TedJ Thu, 23 May 2013 02:55:22 -0700

Hi!  I'm currently training with the same gibberish page used in 
Tesseract's default English training set that begins with "THAN PHONE:" and 
ends with "BEING TO WEB".  The pages I will recog all have the same format 
(several text regions of varying sizes distributed throughout the page much 
like a newspaper).


I'm getting good results if I apply a recog rectangle to each region 
individually.  But I expect that winds up being slower than a single page 
wide region recog would be.  Trouble is, Tesseract hasn't been good at 
correctly recognizing the text within most of those fields when recog'ed as 
a single region.  Automatically finding the proper position and orientation 
within the scanned images at which to apply those region rectangles is (as 
you might expect) also problematic.

My question is:  Should I train using known samples of the formatted pages 
(observing any other recommended training criteria, of course - i.e. not to 
group repeated characters together all in a bunch, etc.).  Or would I be 
better off sticking with THAN PHONE?

Thanks, Ted.

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Formatted Pages Training

Reply via email to