Re: Should TR errors be ignored for a large text sample on a pair of TIF/BOX? What is the best practice here?

Carlos Antunes Wed, 20 Feb 2013 18:35:52 -0800

Zdenko & Dmitri,

In this case, would you suggest a sample text to train it? How about the 
length of the text? One page will suffice or more? I say one page because 
it fits in one TIF image.


Thanks.

On Wednesday, February 20, 2013 1:26:24 AM UTC-7, zdenop wrote:
>
> if it is possible have a look at regions pointed by tesseract 
> ("((503,2112),(509,2121)): 
> FAILURE!") on binarized image (you can use tesseract config 
> "tessedit_write_images T". Something you are able identify problem easily 
> (e.g. there is no space between symbols) - see screenshot in issue 698, 
> comment 16[1]. Maybe in such cases it would make sense to train combination 
> of "rt" (untested ;-) )
>
> If the error messages are on "random" places (and there are different 
> symbols) I would not care about it. 
>
> [1] http://code.google.com/p/tesseract-ocr/issues/detail?id=698#c16
>
> Zdenko
>
>
> On Wed, Feb 20, 2013 at 7:53 AM, Dmitri Silaev 
> <[email protected]<javascript:>
> > wrote:
>
>> Having perfect training logs for the entire set of training images 
>> (especially real-word samples) definitely would be a headache. I suppose a 
>> reasonable number of APPLY_BOXES errors is okay. "Reasonable" can be based 
>> on the error to total ratio and finally depends on you. I personally allow 
>> for up to 10% error rate.
>>
>> Warm regards, 
>> Dmitri Silaev 
>> www.CustomOCR.com
>>
>>
>> On Tue, Feb 19, 2013 at 10:19 PM, Carlos Antunes 
>> <[email protected]<javascript:>
>> > wrote:
>>
>>> Hello all,
>>>
>>> While generating the TR for a TIF/BOX pair using a large text, there are 
>>> some errors when the box cannot be made and hence some of the characters 
>>> will throw errors.
>>>
>>> The Wiki says the following:
>>>
>>> Don't make the mistake of grouping all the non-letters together. Make 
>>> the text more realistic. For example, *The quick brown fox jumps over 
>>> the lazy dog. 0123456789 !@#$%^&(),.{}<>/?* is terrible. Much better is 
>>> *The (quick) brown {fox} jumps! over the $3,456.78 <lazy> #90 dog & 
>>> duck/goose, as 12.5% of E-mail from aspammer is spam?* This gives the 
>>> textline finding code a much better chance of getting sensible baseline 
>>> metrics for the special characters. 
>>>
>>> Now, doing via a realistic text, I have:
>>>
>>> APPLY_BOXES: boxfile line 4962/b ((503,2112),(509,2121)): FAILURE! 
>>> Couldn't find a matching blob
>>> APPLY_BOXES:
>>>    Boxes read from boxfile:    4963
>>>    Boxes failed resegmentation:    1157
>>>    Found 3806 good blobs.
>>>    Leaving 26 unlabelled blobs in 0 words.
>>> TRAINING ... Font name = rageitalic
>>> Generated training data for 550 words
>>>
>>> Now, redoing that with less characters and properly spaced will not 
>>> yield any errors.
>>>
>>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica
>>> APPLY_BOXES:
>>>    Boxes read from boxfile:      92
>>>    Found 92 good blobs.
>>> TRAINING ... Font name = rageitalic
>>> Generated training data for 8 words
>>> antunes@antunes-Inspiron-N7010:~$ tesseract eng.rageitalic.exp0.tif 
>>> eng.rageitalic.exp0 nobatch box.train.stderr
>>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica
>>> APPLY_BOXES:
>>>    Boxes read from boxfile:      92
>>>    Found 92 good blobs.
>>> TRAINING ... Font name = rageitalic
>>> Generated training data for 8 words
>>>
>>> Is it better to train with a larger text regardless of the errors, or is 
>>> it better to train all the possible characters without errors?
>>>
>>> Looks like, by the tesseract code, that the first step is to identify 
>>> offline each character. The dictionaries then work to do some filtering.
>>>
>>> But it seems to me that it might not be bad at all to have say 100 
>>> characters possible and have a perfect TR generation other than a bigger 
>>> text with failures.
>>>
>>> Any thoughts?
>>>
>>> -- 
>>> -- 
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]<javascript:>
>>> To unsubscribe from this group, send email to
>>> [email protected] <javascript:>
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>  
>>> --- 
>>> You received this message because you are subscribed to the Google 
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send 
>>> an email to [email protected] <javascript:>.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>  
>>>  
>>>
>>
>>  -- 
>> -- 
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>  
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>  
>>  
>>
>
>

-- 
-- 
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en

--- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: Should TR errors be ignored for a large text sample on a pair of TIF/BOX? What is the best practice here?

Reply via email to