Sorry I forgot to mention that I'm fine-tuning this with the 
eng.traineddata best file as the base model.
On Tuesday, 9 July 2019 11:43:28 UTC+5:30, sai sumanth Kalluri wrote:
>
> Hi!
>
> I'm trying to teach tesseract to recognize a particularly tricky font of 
> the english language (I do not know the name of the font and any online 
> tool couldn't find it as well) and I have a very high accuracy 
> requirement.It is completely okay if my model does not generalize to other 
> fonts and works only on this font. Following are the details about what 
> I've done so far.
>
> -I'm using: tesseract 5.0.0-alpha-174-g60b4c
>                 leptonica-1.78.0
>                 libgif 5.1.4 : libjpeg 8d (libjpeg-turbo 1.5.2) : libpng 
> 1.6.34 : libtiff 4.0.9 : zlib 1.2.11 : libwebp 0.6.1 : libopenjp2 2.3.0
>                 Found AVX2
>                 Found AVX
>                 Found SSE
> - I have approx. 6000 lines of training data, each line has around 12-15 
> words. I'm guessing around 1 in 50 lines has a mislabelled character (how 
> much does that affect the result?)
> - Also when I apply the end trained model to a document with approx. 50 
> lines of text, I believe the error rate is definitely higher than what 
> lstemeval is telling me.
> - I have trained tesseract on this data incrementally from 300 iterations 
> to 6000 iterations and the best I could achieve was after 4200 iterations: 
> Eval Char error rate=0.70714604, Word error rate=1.922281
> - After that it has more or less saturated and I even suspect overfitting 
> from the kind of errors its making.
>  - I need to achieve ~0.1 char error rate. What can be my next steps? (it 
> is possible for me to create more training data if thats and option but i 
> would prefer something simpler, changing network parameter perhaps?).
> (NOTE: The font is indeed very tricky sometimes even for the human eye and 
> I have attached a small sample of it with this post)
> Thanks in Advance!
>
> (PROBABLY UNNECESSARY DETAIL: full-stops(.) and commas(,) are very 
> frequently mis-labelled in the training data but I really don't care about 
> puntuation for my project, I only want accurate detection of the other 
> characters. should I be worrying about this?) 
>

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at https://groups.google.com/group/tesseract-ocr.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/046e2c9a-df7e-41cf-8258-646e832e66e9%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to