if it is possible have a look at regions pointed by tesseract
("((503,2112),(509,2121)):
FAILURE!") on binarized image (you can use tesseract config
"tessedit_write_images T". Something you are able identify problem easily
(e.g. there is no space between symbols) - see screenshot in issue 698,
comment 16[1]. Maybe in such cases it would make sense to train combination
of "rt" (untested ;-) )If the error messages are on "random" places (and there are different symbols) I would not care about it. [1] http://code.google.com/p/tesseract-ocr/issues/detail?id=698#c16 Zdenko On Wed, Feb 20, 2013 at 7:53 AM, Dmitri Silaev <[email protected]>wrote: > Having perfect training logs for the entire set of training images > (especially real-word samples) definitely would be a headache. I suppose a > reasonable number of APPLY_BOXES errors is okay. "Reasonable" can be based > on the error to total ratio and finally depends on you. I personally allow > for up to 10% error rate. > > Warm regards, > Dmitri Silaev > www.CustomOCR.com > > > On Tue, Feb 19, 2013 at 10:19 PM, Carlos Antunes <[email protected]>wrote: > >> Hello all, >> >> While generating the TR for a TIF/BOX pair using a large text, there are >> some errors when the box cannot be made and hence some of the characters >> will throw errors. >> >> The Wiki says the following: >> >> Don't make the mistake of grouping all the non-letters together. Make the >> text more realistic. For example, *The quick brown fox jumps over the >> lazy dog. 0123456789 !@#$%^&(),.{}<>/?* is terrible. Much better is *The >> (quick) brown {fox} jumps! over the $3,456.78 <lazy> #90 dog & >> duck/goose, as 12.5% of E-mail from aspammer is spam?* This gives the >> textline finding code a much better chance of getting sensible baseline >> metrics for the special characters. >> >> Now, doing via a realistic text, I have: >> >> APPLY_BOXES: boxfile line 4962/b ((503,2112),(509,2121)): FAILURE! >> Couldn't find a matching blob >> APPLY_BOXES: >> Boxes read from boxfile: 4963 >> Boxes failed resegmentation: 1157 >> Found 3806 good blobs. >> Leaving 26 unlabelled blobs in 0 words. >> TRAINING ... Font name = rageitalic >> Generated training data for 550 words >> >> Now, redoing that with less characters and properly spaced will not yield >> any errors. >> >> Tesseract Open Source OCR Engine v3.02.02 with Leptonica >> APPLY_BOXES: >> Boxes read from boxfile: 92 >> Found 92 good blobs. >> TRAINING ... Font name = rageitalic >> Generated training data for 8 words >> antunes@antunes-Inspiron-N7010:~$ tesseract eng.rageitalic.exp0.tif >> eng.rageitalic.exp0 nobatch box.train.stderr >> Tesseract Open Source OCR Engine v3.02.02 with Leptonica >> APPLY_BOXES: >> Boxes read from boxfile: 92 >> Found 92 good blobs. >> TRAINING ... Font name = rageitalic >> Generated training data for 8 words >> >> Is it better to train with a larger text regardless of the errors, or is >> it better to train all the possible characters without errors? >> >> Looks like, by the tesseract code, that the first step is to identify >> offline each character. The dictionaries then work to do some filtering. >> >> But it seems to me that it might not be bad at all to have say 100 >> characters possible and have a perfect TR generation other than a bigger >> text with failures. >> >> Any thoughts? >> >> -- >> -- >> You received this message because you are subscribed to the Google >> Groups "tesseract-ocr" group. >> To post to this group, send email to [email protected] >> To unsubscribe from this group, send email to >> [email protected] >> For more options, visit this group at >> http://groups.google.com/group/tesseract-ocr?hl=en >> >> --- >> You received this message because you are subscribed to the Google Groups >> "tesseract-ocr" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> For more options, visit https://groups.google.com/groups/opt_out. >> >> >> > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

