> These pages are hard because they have different fonts and maybe other 
> complications.

+1 … As a side note, a colleague and I did an image degradation study, and we 
noticed that tesseract took far longer on the degraded images than on the 
originals.  Your intuition is correct.  This won’t help improve your speed, but 
I thought I’d share.

From: Chris Mattmann [mailto:[email protected]]
Sent: Tuesday, February 20, 2018 12:31 PM
To: [email protected]
Subject: Re: Long time with OCR

Updated the wiki page with this info, thanks Nick!



From: Mark Kerzner <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, February 20, 2018 at 6:36 AM
To: Tika User <[email protected]<mailto:[email protected]>>
Subject: Re: Long time with OCR

Hi, Nick,

Thank you very much.

Mark

Mark Kerzner, SHMsoft<http://shmsoft.com/>,
Book a call with me here<http://www.meetme.so/markkerzner>

Mobile: 713-724-2534
Skype: mark.kerzner1

On Tue, Feb 20, 2018 at 6:59 AM, Nick Burch 
<[email protected]<mailto:[email protected]>> wrote:
On Mon, 19 Feb 2018, Mark Kerzner wrote:
Is that a good approach? Is the 10 seconds time normal? I am using the latest 
most powerful Mac and I get similar results on an i7 processor in Ubuntu.

Tika uses the open source Tesseract OCR engine. Tesseract is optimised for ease 
of contributions and ease of implementing new approaches, rather than for 
performance, because as an (ex?-) accademic project that's more what they 
think's important

There's some advice on the Tesseract github issues + wiki on ways to speed it 
up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and
https://github.com/tesseract-ocr/tesseract/issues/1171 and
https://github.com/tesseract-ocr/tesseract/wiki/4.0-Accuracy-and-Performance

Otherwise you'd need to switch to a proprietary OCR tool. I understand that the 
Google Cloud OCR is pretty good, if you don't mind pushing all your files up to 
Gooogle and paying per file

Nick

Reply via email to