Re: want to disable tesseract ocr parser

Chris Mattmann Thu, 20 Aug 2015 22:58:39 -0700

You are totally right and your use case is as
valid as any other one! :-)

Thanks for clarifying and yes we can make this more
clear in the documentation and should.


Cheers,
Chris

—
Chris Mattmann
[email protected]






-----Original Message-----
From: Sergey Tsalkov <[email protected]>
Reply-To: <[email protected]>
Date: Thursday, August 20, 2015 at 10:51 PM
To: <[email protected]>
Subject: Re: want to disable tesseract ocr parser

>It's great functionality to have, Chris, and tesseract is certainly my
>choice for OCR, too! I'm certainly not suggesting that it be removed
>-- maybe just that the user be made aware of it more deliberately,
>with the official documentation mentioning that this happens and how
>to disable it. In my case, it triggered on an image embedded within an
>office doc, so it caught me by surprise more so than if I'd been
>throwing jpegs at Tika directly.
>
>But then again, maybe my use case is the oddball here -- most people
>aren't cranking servers around the clock parsing countless millions of
>documents, and therefore wouldn't notice some increase in CPU use!
>
>
>
>On Thu, Aug 20, 2015 at 10:41 PM, Chris Mattmann <[email protected]>
>wrote:
>> Thanks Sergey. It’s certainly something that adds overhead
>> I’ve seen it too, but with all the capability that tesseract
>> adds (the OCR) it’s something that we’re willing to trade since
>> we can disable it pretty easily via configuration, etc.
>>
>> Speaking from a biased perspective of helping to implement it ;)
>>
>> Cheers,
>> Chris
>>
>> —
>> Chris Mattmann
>> [email protected]
>>
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Tsalkov <[email protected]>
>> Reply-To: <[email protected]>
>> Date: Thursday, August 20, 2015 at 10:40 PM
>> To: <[email protected]>
>> Subject: Re: want to disable tesseract ocr parser
>>
>>>I'll mention my situation again on the wiki, but if any Tika
>>>developers are reading this, I'd consider this a bug report! I've been
>>>using Tika for quite a while. I use very expensive hardware to churn
>>>through tens of millions of documents very rapidly, pulling out
>>>plaintext and metadata. Tika has generally performed extremely well
>>>under this stress -- never a crash or screwup!
>>>
>>>But then one day it got about 50% slower, and I couldn't figure out
>>>why for a while. I just happened to run 'ps xf' and noticed that Tika
>>>was spawning all these tesseract processes. Turned out that I'd never
>>>had tesseract installed before. I had installed it just recently for a
>>>separate project, and Tika's behavior silently changed because of
>>>that.
>>>
>>>Not sure if that fits your definition of a bug, but it's certainly
>>>unexpected behavior as far as I'm concerned!
>>>
>>>Thanks again everyone!
>>>
>>>On Thu, Aug 20, 2015 at 10:31 PM, Sergey Tsalkov <[email protected]>
>>>wrote:
>>>> Happy to do that, Chris! I've created my account, username is
>>>>SergeyTsalkov.
>>>>
>>>> On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980)
>>>> <[email protected]> wrote:
>>>>> Thanks Sergey!
>>>>>
>>>>> Please feel free to add a page on the wiki:
>>>>>
>>>>> http://wiki.apache.org/tika/
>>>>>
>>>>> Describing your use case. I would appreciate it!
>>>>> If you remember to sign up, tell me your username, or tell anyone
>>>>> on this list (dev@tika), we’ll get you permissions and you can
>>>>> create the page.
>>>>>
>>>>> Cheers,
>>>>> Chris
>>>>>
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Chris Mattmann, Ph.D.
>>>>> Chief Architect
>>>>> Instrument Software and Science Data Systems Section (398)
>>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>>>>> Office: 168-519, Mailstop: 168-527
>>>>> Email: [email protected]
>>>>> WWW:  http://sunset.usc.edu/~mattmann/
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>> Adjunct Associate Professor, Computer Science Department
>>>>> University of Southern California, Los Angeles, CA 90089 USA
>>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Sergey Tsalkov <[email protected]>
>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>> Date: Thursday, August 20, 2015 at 10:22 PM
>>>>> To: "[email protected]" <[email protected]>
>>>>> Subject: Re: want to disable tesseract ocr parser
>>>>>
>>>>>>Thanks guys! Nick, your config file was exactly what I was looking
>>>>>>for, though it took a minor tweak because you forgot to open the
>>>>>>parser tag. I'm posting the corrected config below for anyone who
>>>>>>refers to this thread in the future:
>>>>>>
>>>>>><?xml version="1.0" encoding="UTF-8"?>
>>>>>><properties>
>>>>>>  <parsers>
>>>>>>    <parser class="org.apache.tika.parser.DefaultParser">
>>>>>>      <parser-exclude
>>>>>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>>>    </parser>
>>>>>>  </parsers>
>>>>>></properties>
>>>>>>
>>>>>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <[email protected]> wrote:
>>>>>>> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>>>>>>>
>>>>>>>> Then I thought I could pass a custom config.xml to disable it,
>>>>>>>>but I
>>>>>>>> can't figure out how to write the config file.
>>>>>>>
>>>>>>>
>>>>>>> See 
>>>>>>>http://tika.apache.org/1.10/configuring.html#Configuring_Parsers
>>>>>>>for
>>>>>>> details of the parser configuration
>>>>>>>
>>>>>>> You should be fine with a config file like:
>>>>>>>
>>>>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>>>>> <properties>
>>>>>>>   <parsers>
>>>>>>>     <!-- Default Parser except no OCR -->
>>>>>>>       <parser-exclude
>>>>>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>>>>     </parser>
>>>>>>>   </parsers>
>>>>>>> </properties>
>>>>>>>
>>>>>>> Thanks
>>>>>>> Nick
>>>>>
>>
>>

Re: want to disable tesseract ocr parser

Reply via email to