Hello Everyone,

I'm trying to use Tesseract on a legal/accountant document with a lot of 
numbers placed in tables and the rest of the text/words data in French.

Example of a document :
https://imgur.com/a/hemeVdA

Right now I have some pretty good results but I'm trying to improve them. I 
already deleted all the straight lines and it gave me much better results, 
but as you can see in the next image some numbers have a low confidence 
score. But when I run Tesseract on just this isolated number the confidence 
score is excellent. Same thing with words.


[image: Screenshot 2020-05-12 at 20.17.47.png] <about:invalid#zClosurez>










My config :
PSM   6
OEM   1
lang     fra
model  best

I have some ideas as to why I'm getting this result and how to fix it, but 
your input would be greatly appreciated :

*- Fine tune the model I'm using on the documents I have.*
Right now I don't think that's the best idea because of the results I'm 
getting on the isolated images. The model seems to work fine but another 
element I'm not seeing is giving me those low confidence score.


*- Use different configs when running Tesseract.*I have to be honest, apart 
from the layout type and the engine I didn't try any other one, because I 
don't really understand them and there is a lot of them.
http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version
If you think some would help I can test them right away.


*- Add a custom dictionary.*I think this will improve the results for the 
text, but not for the numbers.

*- Use a custom model just for the numbers.*
I saw seen this discussion : 
https://groups.google.com/forum/#!topic/tesseract-ocr/-oeCTcojYfw
and was thinking of fine-tuning the French model to better detect numbers 
myself, but once again the result I'm getting on the isolated image lead me 
to think that the problem is elsewhere. 

*- Run tesseract on the low confidence zone*
This is my last idea, and because I've never run Tesseract in a production 
environment I have some difficulties seeing how it will impact the speed of 
the whole process and future problems it will potentially create.


So my question is : 
Do you think one of those path would be more interesting to follow first, 
or do you have some other ideas ?

Thank you,
Tuan

-- 
You received this message because you are subscribed to the Google Groups 
"tesseract-ocr" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/tesseract-ocr/ae5b331f-cfec-4d93-9860-6a150a283c9a%40googlegroups.com.

Reply via email to