Hello all,
While generating the TR for a TIF/BOX pair using a large text, there are
some errors when the box cannot be made and hence some of the characters
will throw errors.
The Wiki says the following:
Don't make the mistake of grouping all the non-letters together. Make the
text more realistic. For example, *The quick brown fox jumps over the lazy
dog. 0123456789 !@#$%^&(),.{}<>/?* is terrible. Much better is *The (quick)
brown {fox} jumps! over the $3,456.78 <lazy> #90 dog & duck/goose, as 12.5%
of E-mail from aspammer is spam?* This gives the textline finding code a
much better chance of getting sensible baseline metrics for the special
characters.
Now, doing via a realistic text, I have:
APPLY_BOXES: boxfile line 4962/b ((503,2112),(509,2121)): FAILURE! Couldn't
find a matching blob
APPLY_BOXES:
Boxes read from boxfile: 4963
Boxes failed resegmentation: 1157
Found 3806 good blobs.
Leaving 26 unlabelled blobs in 0 words.
TRAINING ... Font name = rageitalic
Generated training data for 550 words
Now, redoing that with less characters and properly spaced will not yield
any errors.
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 92
Found 92 good blobs.
TRAINING ... Font name = rageitalic
Generated training data for 8 words
antunes@antunes-Inspiron-N7010:~$ tesseract eng.rageitalic.exp0.tif
eng.rageitalic.exp0 nobatch box.train.stderr
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
APPLY_BOXES:
Boxes read from boxfile: 92
Found 92 good blobs.
TRAINING ... Font name = rageitalic
Generated training data for 8 words
Is it better to train with a larger text regardless of the errors, or is it
better to train all the possible characters without errors?
Looks like, by the tesseract code, that the first step is to identify
offline each character. The dictionaries then work to do some filtering.
But it seems to me that it might not be bad at all to have say 100
characters possible and have a perfect TR generation other than a bigger
text with failures.
Any thoughts?
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.