Re: Long time with OCR

Mark Kerzner Tue, 20 Feb 2018 11:30:53 -0800

Thanks again

Mark Kerzner, SHMsoft <http://shmsoft.com/>,
Book a call with me here <http://www.meetme.so/markkerzner>


Mobile: 713-724-2534
Skype: mark.kerzner1
<http://shmsoft.com/>

On Tue, Feb 20, 2018 at 1:24 PM, Allison, Timothy B. <[email protected]>
wrote:

> > These pages are hard because they have different fonts and maybe other
> complications.
>
>
>
> +1 … As a side note, a colleague and I did an image degradation study,
> and we noticed that tesseract took far longer on the degraded images than
> on the originals.  Your intuition is correct.  This won’t help improve
> your speed, but I thought I’d share.
>
>
>
> *From:* Chris Mattmann [mailto:[email protected]]
> *Sent:* Tuesday, February 20, 2018 12:31 PM
> *To:* [email protected]
> *Subject:* Re: Long time with OCR
>
>
>
> Updated the wiki page with this info, thanks Nick!
>
>
>
>
>
>
>
> *From: *Mark Kerzner <[email protected]>
> *Reply-To: *"[email protected]" <[email protected]>
> *Date: *Tuesday, February 20, 2018 at 6:36 AM
> *To: *Tika User <[email protected]>
> *Subject: *Re: Long time with OCR
>
>
>
> Hi, Nick,
>
>
>
> Thank you very much.
>
>
>
> Mark
>
>
> Mark Kerzner, SHMsoft <http://shmsoft.com/>,
>
> Book a call with me here <http://www.meetme.so/markkerzner>
>
>
> Mobile: 713-724-2534 <(713)%20724-2534>
> Skype: mark.kerzner1
>
>
>
> On Tue, Feb 20, 2018 at 6:59 AM, Nick Burch <[email protected]> wrote:
>
> On Mon, 19 Feb 2018, Mark Kerzner wrote:
>
> Is that a good approach? Is the 10 seconds time normal? I am using the
> latest most powerful Mac and I get similar results on an i7 processor in
> Ubuntu.
>
>
> Tika uses the open source Tesseract OCR engine. Tesseract is optimised for
> ease of contributions and ease of implementing new approaches, rather than
> for performance, because as an (ex?-) accademic project that's more what
> they think's important
>
> There's some advice on the Tesseract github issues + wiki on ways to speed
> it up, eg https://github.com/tesseract-ocr/tesseract/issues/263 and
> https://github.com/tesseract-ocr/tesseract/issues/1171 and
> https://github.com/tesseract-ocr/tesseract/wiki/4.0-
> Accuracy-and-Performance
>
> Otherwise you'd need to switch to a proprietary OCR tool. I understand
> that the Google Cloud OCR is pretty good, if you don't mind pushing all
> your files up to Gooogle and paying per file
>
> Nick
>
>
>

Re: Long time with OCR

Reply via email to