Should TR errors be ignored for a large text sample on a pair of TIF/BOX? What is the best practice here?

Carlos Antunes Tue, 19 Feb 2013 18:37:14 -0800

Hello all,

While generating the TR for a TIF/BOX pair using a large text, there are 
some errors when the box cannot be made and hence some of the characters 
will throw errors.


The Wiki says the following:

Don't make the mistake of grouping all the non-letters together. Make the 
text more realistic. For example, *The quick brown fox jumps over the lazy 
dog. 0123456789 !@#$%^&(),.{}<>/?* is terrible. Much better is *The (quick) 
brown {fox} jumps! over the $3,456.78 <lazy> #90 dog & duck/goose, as 12.5% 
of E-mail from aspammer is spam?* This gives the textline finding code a 
much better chance of getting sensible baseline metrics for the special 
characters. 

Now, doing via a realistic text, I have:

APPLY_BOXES: boxfile line 4962/b ((503,2112),(509,2121)): FAILURE! Couldn't 
find a matching blob
APPLY_BOXES:
   Boxes read from boxfile:    4963
   Boxes failed resegmentation:    1157
   Found 3806 good blobs.
   Leaving 26 unlabelled blobs in 0 words.
TRAINING ... Font name = rageitalic
Generated training data for 550 words

Now, redoing that with less characters and properly spaced will not yield 
any errors.

Tesseract Open Source OCR Engine v3.02.02 with Leptonica
APPLY_BOXES:
   Boxes read from boxfile:      92
   Found 92 good blobs.
TRAINING ... Font name = rageitalic
Generated training data for 8 words
antunes@antunes-Inspiron-N7010:~$ tesseract eng.rageitalic.exp0.tif 
eng.rageitalic.exp0 nobatch box.train.stderr
Tesseract Open Source OCR Engine v3.02.02 with Leptonica
APPLY_BOXES:
   Boxes read from boxfile:      92
   Found 92 good blobs.
TRAINING ... Font name = rageitalic
Generated training data for 8 words

Is it better to train with a larger text regardless of the errors, or is it 
better to train all the possible characters without errors?

Looks like, by the tesseract code, that the first step is to identify 
offline each character. The dictionaries then work to do some filtering.

But it seems to me that it might not be bad at all to have say 100 
characters possible and have a perfect TR generation other than a bigger 
text with failures.

Any thoughts?

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Should TR errors be ignored for a large text sample on a pair of TIF/BOX? What is the best practice here?

Reply via email to