Re: [tesseract-ocr] Re: How to achieve very high fine-tuning accuracy on a particular font of english? (requirement: char error rate < 0.1%)

Shree Devi Kumar Thu, 11 Jul 2019 01:34:38 -0700

Search the forum for Cursive

On Thu, 11 Jul 2019, 13:00 sai sumanth Kalluri, <[email protected]>
wrote:


> Thanks for the reply but that link does not lead anywhere. Could you
> please correct it?
>
> On Thursday, 11 July 2019 12:34:38 UTC+5:30, shree wrote:
>>
>> See
>> https://groups.google.com/forum/m/?utm_medium=email&utm_source=footer#!searchin/tesseract-ocr/Cursive/tesseract-ocr/6naBkXZvTlI
>>
>>
>>
>> On Thu, 11 Jul 2019, 11:58 sai sumanth Kalluri, <[email protected]>
>> wrote:
>>
>>> Can somebody please give me some advice regarding this?
>>>
>>> On Tuesday, 9 July 2019 11:52:28 UTC+5:30, sai sumanth Kalluri wrote:
>>>>
>>>> Hi!
>>>>
>>>> I'm trying to teach tesseract to recognize a particularly tricky font
>>>> of the english language (I do not know the name of the font and any online
>>>> tool couldn't find it as well) and I have a very high accuracy
>>>> requirement.It is completely *okay if my model does not generalize to
>>>> other fonts* and works only on this font. Following are the details
>>>> about what I've done so far.
>>>>
>>>> -I'm using: tesseract 5.0.0-alpha-174-g60b4c
>>>>                 leptonica-1.78.0
>>>>                 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) :
>>>> libpng 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2
>>>> 2.3.0
>>>>                 Found AVX2
>>>>                 Found AVX
>>>>                 Found SSE
>>>> - I have approx. 6000 lines of training data, each line has around
>>>> 12-15 words. I'm guessing around 1 in 50 lines has a mislabelled character
>>>> (how much does that affect the result?).
>>>> - I'm *fine-tuning* the *'eng.traineddata' **bes*t model using this
>>>> data.
>>>> - The training as well as the testing data are properly scanned
>>>> document images in jpg format so I'm assuming any data preprocessing is not
>>>> required.
>>>> - Also when I apply the end trained model to a document with approx. 50
>>>> lines of text, I believe the error rate is definitely higher than what
>>>> lstemeval is telling me.
>>>> - I have trained tesseract on this data incrementally from 300
>>>> iterations to 6000 iterations and the best I could achieve was *after
>>>> 4200 iterations: Eval Char error rate=0.70714604, Word error rate=1.922281*
>>>> - After that it has more or less saturated and I even suspect
>>>> overfitting from the kind of errors its making.
>>>>  - I need to achieve* ~0.1 char error rate*. What can be my next
>>>> steps? (it is possible for me to create more training data if thats and
>>>> option but i would prefer something simpler, changing network parameter
>>>> perhaps?).
>>>>
>>>> (NOTE: The font is indeed very tricky sometimes even for the human eye
>>>> and I have attached a small sample of it with this post)
>>>> Thanks in Advance!
>>>>
>>>> (PROBABLY UNNECESSARY DETAIL: full-stops(.) and commas(,) are very
>>>> frequently mis-labelled in the training data but I really don't care about
>>>> puntuation for my project, I only want accurate detection of the other
>>>> characters. should I be worrying about this?)
>>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "tesseract-ocr" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at https://groups.google.com/group/tesseract-ocr.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/tesseract-ocr/35abc1cd-552b-405c-85be-9e0af720b04d%40googlegroups.com
>>> <https://groups.google.com/d/msgid/tesseract-ocr/35abc1cd-552b-405c-85be-9e0af720b04d%40googlegroups.com?utm_medium=email&utm_source=footer>
>>> .
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>> --
> You received this message because you are subscribed to the Google Groups
> "tesseract-ocr" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at https://groups.google.com/group/tesseract-ocr.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/tesseract-ocr/d8dff6ac-10ba-4038-a027-e1a9802acdcd%40googlegroups.com
> <https://groups.google.com/d/msgid/tesseract-ocr/d8dff6ac-10ba-4038-a027-e1a9802acdcd%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/CAG2NduWsU1gV7%2BNfDXhnwE07oqjVNoUgfhk4VjbrrUnnQp8i9A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: [tesseract-ocr] Re: How to achieve very high fine-tuning accuracy on a particular font of english? (requirement: char error rate < 0.1%)

Reply via email to