Updated the wiki page with this info, thanks Nick!
From: Mark Kerzner <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Tuesday, February 20, 2018 at 6:36 AM To: Tika User <[email protected]> Subject: Re: Long time with OCR Hi, Nick, Thank you very much. Mark Mark Kerzner, SHMsoft, Book a call with me here Mobile: 713-724-2534 Skype: mark.kerzner1 On Tue, Feb 20, 2018 at 6:59 AM, Nick Burch <[email protected]> wrote: On Mon, 19 Feb 2018, Mark Kerzner wrote: Is that a good approach? Is the 10 seconds time normal? I am using the latest most powerful Mac and I get similar results on an i7 processor in Ubuntu. Tika uses the open source Tesseract OCR engine. Tesseract is optimised for ease of contributions and ease of implementing new approaches, rather than for performance, because as an (ex?-) accademic project that's more what they think's important There's some advice on the Tesseract github issues + wiki on ways to speed it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and https://github.com/tesseract-ocr/tesseract/issues/1171 and https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance Otherwise you'd need to switch to a proprietary OCR tool. I understand that the Google Cloud OCR is pretty good, if you don't mind pushing all your files up to Gooogle and paying per file Nick
