Re: want to disable tesseract ocr parser

Sergey Tsalkov Thu, 20 Aug 2015 22:52:35 -0700

It's great functionality to have, Chris, and tesseract is certainly my
choice for OCR, too! I'm certainly not suggesting that it be removed
-- maybe just that the user be made aware of it more deliberately,
with the official documentation mentioning that this happens and how
to disable it. In my case, it triggered on an image embedded within an
office doc, so it caught me by surprise more so than if I'd been
throwing jpegs at Tika directly.


But then again, maybe my use case is the oddball here -- most people
aren't cranking servers around the clock parsing countless millions of
documents, and therefore wouldn't notice some increase in CPU use!



On Thu, Aug 20, 2015 at 10:41 PM, Chris Mattmann <[email protected]> wrote:
> Thanks Sergey. It’s certainly something that adds overhead
> I’ve seen it too, but with all the capability that tesseract
> adds (the OCR) it’s something that we’re willing to trade since
> we can disable it pretty easily via configuration, etc.
>
> Speaking from a biased perspective of helping to implement it ;)
>
> Cheers,
> Chris
>
> —
> Chris Mattmann
> [email protected]
>
>
>
>
>
>
> -----Original Message-----
> From: Sergey Tsalkov <[email protected]>
> Reply-To: <[email protected]>
> Date: Thursday, August 20, 2015 at 10:40 PM
> To: <[email protected]>
> Subject: Re: want to disable tesseract ocr parser
>
>>I'll mention my situation again on the wiki, but if any Tika
>>developers are reading this, I'd consider this a bug report! I've been
>>using Tika for quite a while. I use very expensive hardware to churn
>>through tens of millions of documents very rapidly, pulling out
>>plaintext and metadata. Tika has generally performed extremely well
>>under this stress -- never a crash or screwup!
>>
>>But then one day it got about 50% slower, and I couldn't figure out
>>why for a while. I just happened to run 'ps xf' and noticed that Tika
>>was spawning all these tesseract processes. Turned out that I'd never
>>had tesseract installed before. I had installed it just recently for a
>>separate project, and Tika's behavior silently changed because of
>>that.
>>
>>Not sure if that fits your definition of a bug, but it's certainly
>>unexpected behavior as far as I'm concerned!
>>
>>Thanks again everyone!
>>
>>On Thu, Aug 20, 2015 at 10:31 PM, Sergey Tsalkov <[email protected]>
>>wrote:
>>> Happy to do that, Chris! I've created my account, username is
>>>SergeyTsalkov.
>>>
>>> On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980)
>>> <[email protected]> wrote:
>>>> Thanks Sergey!
>>>>
>>>> Please feel free to add a page on the wiki:
>>>>
>>>> http://wiki.apache.org/tika/
>>>>
>>>> Describing your use case. I would appreciate it!
>>>> If you remember to sign up, tell me your username, or tell anyone
>>>> on this list (dev@tika), we’ll get you permissions and you can
>>>> create the page.
>>>>
>>>> Cheers,
>>>> Chris
>>>>
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Chris Mattmann, Ph.D.
>>>> Chief Architect
>>>> Instrument Software and Science Data Systems Section (398)
>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>> Office: 168-519, Mailstop: 168-527
>>>> Email: [email protected]
>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>> Adjunct Associate Professor, Computer Science Department
>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: Sergey Tsalkov <[email protected]>
>>>> Reply-To: "[email protected]" <[email protected]>
>>>> Date: Thursday, August 20, 2015 at 10:22 PM
>>>> To: "[email protected]" <[email protected]>
>>>> Subject: Re: want to disable tesseract ocr parser
>>>>
>>>>>Thanks guys! Nick, your config file was exactly what I was looking
>>>>>for, though it took a minor tweak because you forgot to open the
>>>>>parser tag. I'm posting the corrected config below for anyone who
>>>>>refers to this thread in the future:
>>>>>
>>>>><?xml version="1.0" encoding="UTF-8"?>
>>>>><properties>
>>>>>  <parsers>
>>>>>    <parser class="org.apache.tika.parser.DefaultParser">
>>>>>      <parser-exclude
>>>>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>>    </parser>
>>>>>  </parsers>
>>>>></properties>
>>>>>
>>>>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <[email protected]> wrote:
>>>>>> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>>>>>>
>>>>>>> Then I thought I could pass a custom config.xml to disable it, but I
>>>>>>> can't figure out how to write the config file.
>>>>>>
>>>>>>
>>>>>> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers
>>>>>>for
>>>>>> details of the parser configuration
>>>>>>
>>>>>> You should be fine with a config file like:
>>>>>>
>>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>>> <properties>
>>>>>>   <parsers>
>>>>>>     <!-- Default Parser except no OCR -->
>>>>>>       <parser-exclude
>>>>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>>>     </parser>
>>>>>>   </parsers>
>>>>>> </properties>
>>>>>>
>>>>>> Thanks
>>>>>> Nick
>>>>
>
>

Re: want to disable tesseract ocr parser

Reply via email to