Having perfect training logs for the entire set of training images (especially real-word samples) definitely would be a headache. I suppose a reasonable number of APPLY_BOXES errors is okay. "Reasonable" can be based on the error to total ratio and finally depends on you. I personally allow for up to 10% error rate.
Warm regards, Dmitri Silaev www.CustomOCR.com On Tue, Feb 19, 2013 at 10:19 PM, Carlos Antunes <[email protected]>wrote: > Hello all, > > While generating the TR for a TIF/BOX pair using a large text, there are > some errors when the box cannot be made and hence some of the characters > will throw errors. > > The Wiki says the following: > > Don't make the mistake of grouping all the non-letters together. Make the > text more realistic. For example, *The quick brown fox jumps over the > lazy dog. 0123456789 !@#$%^&(),.{}<>/?* is terrible. Much better is *The > (quick) brown {fox} jumps! over the $3,456.78 <lazy> #90 dog & > duck/goose, as 12.5% of E-mail from aspammer is spam?* This gives the > textline finding code a much better chance of getting sensible baseline > metrics for the special characters. > > Now, doing via a realistic text, I have: > > APPLY_BOXES: boxfile line 4962/b ((503,2112),(509,2121)): FAILURE! > Couldn't find a matching blob > APPLY_BOXES: > Boxes read from boxfile: 4963 > Boxes failed resegmentation: 1157 > Found 3806 good blobs. > Leaving 26 unlabelled blobs in 0 words. > TRAINING ... Font name = rageitalic > Generated training data for 550 words > > Now, redoing that with less characters and properly spaced will not yield > any errors. > > Tesseract Open Source OCR Engine v3.02.02 with Leptonica > APPLY_BOXES: > Boxes read from boxfile: 92 > Found 92 good blobs. > TRAINING ... Font name = rageitalic > Generated training data for 8 words > antunes@antunes-Inspiron-N7010:~$ tesseract eng.rageitalic.exp0.tif > eng.rageitalic.exp0 nobatch box.train.stderr > Tesseract Open Source OCR Engine v3.02.02 with Leptonica > APPLY_BOXES: > Boxes read from boxfile: 92 > Found 92 good blobs. > TRAINING ... Font name = rageitalic > Generated training data for 8 words > > Is it better to train with a larger text regardless of the errors, or is > it better to train all the possible characters without errors? > > Looks like, by the tesseract code, that the first step is to identify > offline each character. The dictionaries then work to do some filtering. > > But it seems to me that it might not be bad at all to have say 100 > characters possible and have a perfect TR generation other than a bigger > text with failures. > > Any thoughts? > > -- > -- > You received this message because you are subscribed to the Google > Groups "tesseract-ocr" group. > To post to this group, send email to [email protected] > To unsubscribe from this group, send email to > [email protected] > For more options, visit this group at > http://groups.google.com/group/tesseract-ocr?hl=en > > --- > You received this message because you are subscribed to the Google Groups > "tesseract-ocr" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/groups/opt_out. > > > -- -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To post to this group, send email to [email protected] To unsubscribe from this group, send email to [email protected] For more options, visit this group at http://groups.google.com/group/tesseract-ocr?hl=en --- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.

