Hello Everyone, I'm trying to use Tesseract on a legal/accountant document with a lot of numbers placed in tables and the rest of the text/words data in French.
Example of a document : https://imgur.com/a/hemeVdA Right now I have some pretty good results but I'm trying to improve them. I already deleted all the straight lines and it gave me much better results, but as you can see in the next image some numbers have a low confidence score. But when I run Tesseract on just this isolated number the confidence score is excellent. Same thing with words. [image: Screenshot 2020-05-12 at 20.17.47.png] <about:invalid#zClosurez> My config : PSM 6 OEM 1 lang fra model best I have some ideas as to why I'm getting this result and how to fix it, but your input would be greatly appreciated : *- Fine tune the model I'm using on the documents I have.* Right now I don't think that's the best idea because of the results I'm getting on the isolated images. The model seems to work fine but another element I'm not seeing is giving me those low confidence score. *- Use different configs when running Tesseract.*I have to be honest, apart from the layout type and the engine I didn't try any other one, because I don't really understand them and there is a lot of them. http://www.sk-spell.sk.cx/tesseract-ocr-parameters-in-302-version If you think some would help I can test them right away. *- Add a custom dictionary.*I think this will improve the results for the text, but not for the numbers. *- Use a custom model just for the numbers.* I saw seen this discussion : https://groups.google.com/forum/#!topic/tesseract-ocr/-oeCTcojYfw and was thinking of fine-tuning the French model to better detect numbers myself, but once again the result I'm getting on the isolated image lead me to think that the problem is elsewhere. *- Run tesseract on the low confidence zone* This is my last idea, and because I've never run Tesseract in a production environment I have some difficulties seeing how it will impact the speed of the whole process and future problems it will potentially create. So my question is : Do you think one of those path would be more interesting to follow first, or do you have some other ideas ? Thank you, Tuan -- You received this message because you are subscribed to the Google Groups "tesseract-ocr" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/tesseract-ocr/ae5b331f-cfec-4d93-9860-6a150a283c9a%40googlegroups.com.

