Re: Long time with OCR

Nick Burch Tue, 20 Feb 2018 05:00:06 -0800

On Mon, 19 Feb 2018, Mark Kerzner wrote:

Is that a good approach? Is the 10 seconds time normal? I am using thelatest most powerful Mac and I get similar results on an i7 processor inUbuntu.

Tika uses the open source Tesseract OCR engine. Tesseract is optimised forease of contributions and ease of implementing new approaches, rather thanfor performance, because as an (ex?-) accademic project that's more whatthey think's important

There's some advice on the Tesseract github issues + wiki on ways to speedit up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and

https://github.com/tesseract-ocr/tesseract/issues/1171 and
https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance

Otherwise you'd need to switch to a proprietary OCR tool. I understandthat the Google Cloud OCR is pretty good, if you don't mind pushing allyour files up to Gooogle and paying per file


Nick

Re: Long time with OCR

Reply via email to