Re: Should TR errors be ignored for a large text sample on a pair of TIF/BOX? What is the best practice here?

Dmitri Silaev Tue, 19 Feb 2013 22:53:47 -0800

Having perfect training logs for the entire set of training images
(especially real-word samples) definitely would be a headache. I suppose a
reasonable number of APPLY_BOXES errors is okay. "Reasonable" can be based
on the error to total ratio and finally depends on you. I personally allow
for up to 10% error rate.


Warm regards,
Dmitri Silaev
www.CustomOCR.com


On Tue, Feb 19, 2013 at 10:19 PM, Carlos Antunes <[email protected]>wrote:

> Hello all,
>
> While generating the TR for a TIF/BOX pair using a large text, there are
> some errors when the box cannot be made and hence some of the characters
> will throw errors.
>
> The Wiki says the following:
>
> Don't make the mistake of grouping all the non-letters together. Make the
> text more realistic. For example, *The quick brown fox jumps over the
> lazy dog. 0123456789 !@#$%^&(),.{}<>/?* is terrible. Much better is *The
> (quick) brown {fox} jumps! over the $3,456.78 <lazy> #90 dog &
> duck/goose, as 12.5% of E-mail from aspammer is spam?* This gives the
> textline finding code a much better chance of getting sensible baseline
> metrics for the special characters.
>
> Now, doing via a realistic text, I have:
>
> APPLY_BOXES: boxfile line 4962/b ((503,2112),(509,2121)): FAILURE!
> Couldn't find a matching blob
> APPLY_BOXES:
>    Boxes read from boxfile:    4963
>    Boxes failed resegmentation:    1157
>    Found 3806 good blobs.
>    Leaving 26 unlabelled blobs in 0 words.
> TRAINING ... Font name = rageitalic
> Generated training data for 550 words
>
> Now, redoing that with less characters and properly spaced will not yield
> any errors.
>
> Tesseract Open Source OCR Engine v3.02.02 with Leptonica
> APPLY_BOXES:
>    Boxes read from boxfile:      92
>    Found 92 good blobs.
> TRAINING ... Font name = rageitalic
> Generated training data for 8 words
> antunes@antunes-Inspiron-N7010:~$ tesseract eng.rageitalic.exp0.tif
> eng.rageitalic.exp0 nobatch box.train.stderr
> Tesseract Open Source OCR Engine v3.02.02 with Leptonica
> APPLY_BOXES:
>    Boxes read from boxfile:      92
>    Found 92 good blobs.
> TRAINING ... Font name = rageitalic
> Generated training data for 8 words
>
> Is it better to train with a larger text regardless of the errors, or is
> it better to train all the possible characters without errors?
>
> Looks like, by the tesseract code, that the first step is to identify
> offline each character. The dictionaries then work to do some filtering.
>
> But it seems to me that it might not be bad at all to have say 100
> characters possible and have a perfect TR generation other than a bigger
> text with failures.
>
> Any thoughts?
>
> --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Should TR errors be ignored for a large text sample on a pair of TIF/BOX? What is the best practice here?

Reply via email to