Zdenko & Dmitri,
In this case, would you suggest a sample text to train it? How about the
length of the text? One page will suffice or more? I say one page because
it fits in one TIF image.
Thanks.
On Wednesday, February 20, 2013 1:26:24 AM UTC-7, zdenop wrote:
>
> if it is possible have a look at regions pointed by tesseract
> ("((503,2112),(509,2121)):
> FAILURE!") on binarized image (you can use tesseract config
> "tessedit_write_images T". Something you are able identify problem easily
> (e.g. there is no space between symbols) - see screenshot in issue 698,
> comment 16[1]. Maybe in such cases it would make sense to train combination
> of "rt" (untested ;-) )
>
> If the error messages are on "random" places (and there are different
> symbols) I would not care about it.
>
> [1] http://code.google.com/p/tesseract-ocr/issues/detail?id=698#c16
>
> Zdenko
>
>
> On Wed, Feb 20, 2013 at 7:53 AM, Dmitri Silaev
> <[email protected]<javascript:>
> > wrote:
>
>> Having perfect training logs for the entire set of training images
>> (especially real-word samples) definitely would be a headache. I suppose a
>> reasonable number of APPLY_BOXES errors is okay. "Reasonable" can be based
>> on the error to total ratio and finally depends on you. I personally allow
>> for up to 10% error rate.
>>
>> Warm regards,
>> Dmitri Silaev
>> www.CustomOCR.com
>>
>>
>> On Tue, Feb 19, 2013 at 10:19 PM, Carlos Antunes
>> <[email protected]<javascript:>
>> > wrote:
>>
>>> Hello all,
>>>
>>> While generating the TR for a TIF/BOX pair using a large text, there are
>>> some errors when the box cannot be made and hence some of the characters
>>> will throw errors.
>>>
>>> The Wiki says the following:
>>>
>>> Don't make the mistake of grouping all the non-letters together. Make
>>> the text more realistic. For example, *The quick brown fox jumps over
>>> the lazy dog. 0123456789 !@#$%^&(),.{}<>/?* is terrible. Much better is
>>> *The (quick) brown {fox} jumps! over the $3,456.78 <lazy> #90 dog &
>>> duck/goose, as 12.5% of E-mail from aspammer is spam?* This gives the
>>> textline finding code a much better chance of getting sensible baseline
>>> metrics for the special characters.
>>>
>>> Now, doing via a realistic text, I have:
>>>
>>> APPLY_BOXES: boxfile line 4962/b ((503,2112),(509,2121)): FAILURE!
>>> Couldn't find a matching blob
>>> APPLY_BOXES:
>>> Boxes read from boxfile: 4963
>>> Boxes failed resegmentation: 1157
>>> Found 3806 good blobs.
>>> Leaving 26 unlabelled blobs in 0 words.
>>> TRAINING ... Font name = rageitalic
>>> Generated training data for 550 words
>>>
>>> Now, redoing that with less characters and properly spaced will not
>>> yield any errors.
>>>
>>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica
>>> APPLY_BOXES:
>>> Boxes read from boxfile: 92
>>> Found 92 good blobs.
>>> TRAINING ... Font name = rageitalic
>>> Generated training data for 8 words
>>> antunes@antunes-Inspiron-N7010:~$ tesseract eng.rageitalic.exp0.tif
>>> eng.rageitalic.exp0 nobatch box.train.stderr
>>> Tesseract Open Source OCR Engine v3.02.02 with Leptonica
>>> APPLY_BOXES:
>>> Boxes read from boxfile: 92
>>> Found 92 good blobs.
>>> TRAINING ... Font name = rageitalic
>>> Generated training data for 8 words
>>>
>>> Is it better to train with a larger text regardless of the errors, or is
>>> it better to train all the possible characters without errors?
>>>
>>> Looks like, by the tesseract code, that the first step is to identify
>>> offline each character. The dictionaries then work to do some filtering.
>>>
>>> But it seems to me that it might not be bad at all to have say 100
>>> characters possible and have a perfect TR generation other than a bigger
>>> text with failures.
>>>
>>> Any thoughts?
>>>
>>> --
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To post to this group, send email to [email protected]<javascript:>
>>> To unsubscribe from this group, send email to
>>> [email protected] <javascript:>
>>> For more options, visit this group at
>>> http://groups.google.com/group/tesseract-ocr?hl=en
>>>
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected] <javascript:>.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>>
>>>
>>
>> --
>> --
>> You received this message because you are subscribed to the Google
>> Groups "tesseract-ocr" group.
>> To post to this group, send email to [email protected]<javascript:>
>> To unsubscribe from this group, send email to
>> [email protected] <javascript:>
>> For more options, visit this group at
>> http://groups.google.com/group/tesseract-ocr?hl=en
>>
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "tesseract-ocr" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>>
>>
>
>
--
--
You received this message because you are subscribed to the Google
Groups "tesseract-ocr" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to
[email protected]
For more options, visit this group at
http://groups.google.com/group/tesseract-ocr?hl=en
---
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.