Thanks again Mark Kerzner, SHMsoft <http://shmsoft.com/>, Book a call with me here <http://www.meetme.so/markkerzner>
Mobile: 713-724-2534 Skype: mark.kerzner1 <http://shmsoft.com/> On Tue, Feb 20, 2018 at 1:24 PM, Allison, Timothy B. <[email protected]> wrote: > > These pages are hard because they have different fonts and maybe other > complications. > > > > +1 … As a side note, a colleague and I did an image degradation study, > and we noticed that tesseract took far longer on the degraded images than > on the originals. Your intuition is correct. This won’t help improve > your speed, but I thought I’d share. > > > > *From:* Chris Mattmann [mailto:[email protected]] > *Sent:* Tuesday, February 20, 2018 12:31 PM > *To:* [email protected] > *Subject:* Re: Long time with OCR > > > > Updated the wiki page with this info, thanks Nick! > > > > > > > > *From: *Mark Kerzner <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Tuesday, February 20, 2018 at 6:36 AM > *To: *Tika User <[email protected]> > *Subject: *Re: Long time with OCR > > > > Hi, Nick, > > > > Thank you very much. > > > > Mark > > > Mark Kerzner, SHMsoft <http://shmsoft.com/>, > > Book a call with me here <http://www.meetme.so/markkerzner> > > > Mobile: 713-724-2534 <(713)%20724-2534> > Skype: mark.kerzner1 > > > > On Tue, Feb 20, 2018 at 6:59 AM, Nick Burch <[email protected]> wrote: > > On Mon, 19 Feb 2018, Mark Kerzner wrote: > > Is that a good approach? Is the 10 seconds time normal? I am using the > latest most powerful Mac and I get similar results on an i7 processor in > Ubuntu. > > > Tika uses the open source Tesseract OCR engine. Tesseract is optimised for > ease of contributions and ease of implementing new approaches, rather than > for performance, because as an (ex?-) accademic project that's more what > they think's important > > There's some advice on the Tesseract github issues + wiki on ways to speed > it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and > https://github.com/tesseract-ocr/tesseract/issues/1171 and > https://github.com/tesseract-ocr/tesseract/wiki/4.0- > Accuracy-and-Performance > > Otherwise you'd need to switch to a proprietary OCR tool. I understand > that the Google Cloud OCR is pretty good, if you don't mind pushing all > your files up to Gooogle and paying per file > > Nick > > >
