[tesseract-ocr] Random low confidence score, Is fine-tuning a good solution for my use-case ?

Tuan Ardouin Tue, 12 May 2020 11:54:22 -0700

Hello Everyone,

I'm trying to use Tesseract on a legal/accountant document with a lot of 
numbers placed in tables and the rest of the text/words data in French.

Example of a document :
https://imgur.com/a/hemeVdA

Right now I have some pretty good results but I'm trying to improve them. I
already deleted all the straight lines and it gave me much better results,
but as you can see in the next image some numbers have a low confidence
score. But when I run Tesseract on just this isolated number the confidence
score is excellent. Same thing with words.

[image: Screenshot 2020-05-12 at 20.17.47.png] <about:invalid#zClosurez>

My config :
PSM 6
OEM 1
lang fra
model best

I have some ideas as to why I'm getting this result and how to fix it, but
your input would be greatly appreciated :

*- Fine tune the model I'm using on the documents I have.*
Right now I don't think that's the best idea because of the results I'm
getting on the isolated images. The model seems to work fine but another
element I'm not seeing is giving me those low confidence score.

*- Use different configs when running Tesseract.*I have to be honest, apart
from the layout type and the engine I didn't try any other one, because I
don't really understand them and there is a lot of them.
http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
If you think some would help I can test them right away.

*- Add a custom dictionary.*I think this will improve the results for the
text, but not for the numbers.

*- Use a custom model just for the numbers.*
I saw seen this discussion :
https://groups.google.com/forum/#!topic/tesseract-ocr/-oeCTcojYfw
and was thinking of fine-tuning the French model to better detect numbers
myself, but once again the result I'm getting on the isolated image lead me
to think that the problem is elsewhere.

*- Run tesseract on the low confidence zone*
This is my last idea, and because I've never run Tesseract in a production
environment I have some difficulties seeing how it will impact the speed of
the whole process and future problems it will potentially create.

So my question is :
Do you think one of those path would be more interesting to follow first,
or do you have some other ideas ?

Thank you,
Tuan

--
You received this message because you are subscribed to the Google Groups
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/tesseract-ocr/ae5b331f-cfec-4d93-9860-6a150a283c9a%40googlegroups.com.

[tesseract-ocr] Random low confidence score, Is fine-tuning a good solution for my use-case ?

Reply via email to