I have a set of English single-page TIFF document images that come with ground truth files. Each TIFF has a single rectangular zone of text and each GT file is a UTF8 text file containing the correct text.
I built T3.03 from the source and applied it to this set using whatever English model that came out of the box. Results were mixed and so the question I am trying to answer is this: Can I incrementally train Tesseract using a part of this corpus to get better accuracy? I've been reading https://code.google.com/p/tesseract-ocr/wiki/TrainingTesseract3 but it's unclear to me if incremental training is possible. Is it? How would I have to modify the training procedure to include previosuly trained data in it to increment it with whatever comes from the new data? Thx -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/tesseract-ocr. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/8a2bf1e9-3bac-46ba-a7c1-8cfe566b5873%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.

