Re: Should TR errors be ignored for a large text sample on a pair of TIF/BOX? What is the best practice here?

zdenko podobny Wed, 20 Feb 2013 04:02:28 -0800

if it is possible have a look at regions pointed by tesseract
("((503,2112),(509,2121)):
FAILURE!") on binarized image (you can use tesseract config
"tessedit_write_images T". Something you are able identify problem easily
(e.g. there is no space between symbols) - see screenshot in issue 698,
comment 16[1]. Maybe in such cases it would make sense to train combination
of "rt" (untested ;-) )


If the error messages are on "random" places (and there are different
symbols) I would not care about it.

[1] http://code.google.com/p/tesseract-ocr/issues/detail?id=698#c16

Zdenko


On Wed, Feb 20, 2013 at 7:53 AM, Dmitri Silaev <[email protected]>wrote:

> Having perfect training logs for the entire set of training images
> (especially real-word samples) definitely would be a headache. I suppose a
> reasonable number of APPLY_BOXES errors is okay. "Reasonable" can be based
> on the error to total ratio and finally depends on you. I personally allow
> for up to 10% error rate.
>
> Warm regards,
> Dmitri Silaev
> www.CustomOCR.com
>
>
> On Tue, Feb 19, 2013 at 10:19 PM, Carlos Antunes <[email protected]>wrote:
>
>> Hello all,
>>
>> While generating the TR for a TIF/BOX pair using a large text, there are
>> some errors when the box cannot be made and hence some of the characters
>> will throw errors.
>>
>> The Wiki says the following:
>>
>> Don't make the mistake of grouping all the non-letters together. Make the
>> text more realistic. For example, *The quick brown fox jumps over the
>> lazy dog. 0123456789 !@#$%^&(),.{}<>/?* is terrible. Much better is *The
>> (quick) brown {fox} jumps! over the $3,456.78 <lazy> #90 dog &
>> duck/goose, as 12.5% of E-mail from aspammer is spam?* This gives the
>> textline finding code a much better chance of getting sensible baseline
>> metrics for the special characters.
>>
>> Now, doing via a realistic text, I have:
>>
>> APPLY_BOXES: boxfile line 4962/b ((503,2112),(509,2121)): FAILURE!
>> Couldn't find a matching blob
>> APPLY_BOXES:
>>    Boxes read from boxfile:    4963
>>    Boxes failed resegmentation:    1157
>>    Found 3806 good blobs.
>>    Leaving 26 unlabelled blobs in 0 words.
>> TRAINING ... Font name = rageitalic
>> Generated training data for 550 words
>>
>> Now, redoing that with less characters and properly spaced will not yield
>> any errors.
>>
>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica
>> APPLY_BOXES:
>>    Boxes read from boxfile:      92
>>    Found 92 good blobs.
>> TRAINING ... Font name = rageitalic
>> Generated training data for 8 words
>> antunes@antunes-Inspiron-N7010:~$ tesseract eng.rageitalic.exp0.tif
>> eng.rageitalic.exp0 nobatch box.train.stderr
>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica
>> APPLY_BOXES:
>>    Boxes read from boxfile:      92
>>    Found 92 good blobs.
>> TRAINING ... Font name = rageitalic
>> Generated training data for 8 words
>>
>> Is it better to train with a larger text regardless of the errors, or is
>> it better to train all the possible characters without errors?
>>
>> Looks like, by the tesseract code, that the first step is to identify
>> offline each character. The dictionaries then work to do some filtering.
>>
>> But it seems to me that it might not be bad at all to have say 100
>> characters possible and have a perfect TR generation other than a bigger
>> text with failures.
>>
>> Any thoughts?
>>
>> --
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]
>> To unsubscribe from this group, send email to
>> [email protected]
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>
>  --
> --
> You received this message because you are subscribed to the Google
> Groups "tesseract-ocr" group.
> To post to this group, send email to [email protected]
> To unsubscribe from this group, send email to
> [email protected]
> For more options, visit this group at
> http://groups.google.com/group/tesseract-ocr?hl=en
>
> ---
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/groups/opt_out.
>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Should TR errors be ignored for a large text sample on a pair of TIF/BOX? What is the best practice here?

Reply via email to