Hi, Nick, Thank you very much.
Mark Mark Kerzner, SHMsoft <http://shmsoft.com/>, Book a call with me here <http://www.meetme.so/markkerzner> Mobile: 713-724-2534 Skype: mark.kerzner1 <http://shmsoft.com/> On Tue, Feb 20, 2018 at 6:59 AM, Nick Burch <[email protected]> wrote: > On Mon, 19 Feb 2018, Mark Kerzner wrote: > >> Is that a good approach? Is the 10 seconds time normal? I am using the >> latest most powerful Mac and I get similar results on an i7 processor in >> Ubuntu. >> > > Tika uses the open source Tesseract OCR engine. Tesseract is optimised for > ease of contributions and ease of implementing new approaches, rather than > for performance, because as an (ex?-) accademic project that's more what > they think's important > > There's some advice on the Tesseract github issues + wiki on ways to speed > it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and > https://github.com/tesseract-ocr/tesseract/issues/1171 and > https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy > -and-Performance > > Otherwise you'd need to switch to a proprietary OCR tool. I understand > that the Google Cloud OCR is pretty good, if you don't mind pushing all > your files up to Gooogle and paying per file > > Nick >
