On Mon, 19 Feb 2018, Mark Kerzner wrote:
Is that a good approach? Is the 10 seconds time normal? I am using the
latest most powerful Mac and I get similar results on an i7 processor in
Ubuntu.
Tika uses the open source Tesseract OCR engine. Tesseract is optimised for
ease of contributions and ease of implementing new approaches, rather than
for performance, because as an (ex?-) accademic project that's more what
they think's important
There's some advice on the Tesseract github issues + wiki on ways to speed
it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and
https://github.com/tesseract-ocr/tesseract/issues/1171 and
https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance
Otherwise you'd need to switch to a proprietary OCR tool. I understand
that the Google Cloud OCR is pretty good, if you don't mind pushing all
your files up to Gooogle and paying per file
Nick