Re: want to disable tesseract ocr parser

Sergey Tsalkov Thu, 20 Aug 2015 22:40:28 -0700

I'll mention my situation again on the wiki, but if any Tika
developers are reading this, I'd consider this a bug report! I've been
using Tika for quite a while. I use very expensive hardware to churn
through tens of millions of documents very rapidly, pulling out
plaintext and metadata. Tika has generally performed extremely well
under this stress -- never a crash or screwup!


But then one day it got about 50% slower, and I couldn't figure out
why for a while. I just happened to run 'ps xf' and noticed that Tika
was spawning all these tesseract processes. Turned out that I'd never
had tesseract installed before. I had installed it just recently for a
separate project, and Tika's behavior silently changed because of
that.

Not sure if that fits your definition of a bug, but it's certainly
unexpected behavior as far as I'm concerned!

Thanks again everyone!

On Thu, Aug 20, 2015 at 10:31 PM, Sergey Tsalkov <[email protected]> wrote:
> Happy to do that, Chris! I've created my account, username is SergeyTsalkov.
>
> On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980)
> <[email protected]> wrote:
>> Thanks Sergey!
>>
>> Please feel free to add a page on the wiki:
>>
>> http://wiki.apache.org/tika/
>>
>> Describing your use case. I would appreciate it!
>> If you remember to sign up, tell me your username, or tell anyone
>> on this list (dev@tika), we’ll get you permissions and you can
>> create the page.
>>
>> Cheers,
>> Chris
>>
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Chris Mattmann, Ph.D.
>> Chief Architect
>> Instrument Software and Science Data Systems Section (398)
>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
>> Office: 168-519, Mailstop: 168-527
>> Email: [email protected]
>> WWW:  http://sunset.usc.edu/~mattmann/
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> Adjunct Associate Professor, Computer Science Department
>> University of Southern California, Los Angeles, CA 90089 USA
>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>
>>
>>
>>
>>
>> -----Original Message-----
>> From: Sergey Tsalkov <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Thursday, August 20, 2015 at 10:22 PM
>> To: "[email protected]" <[email protected]>
>> Subject: Re: want to disable tesseract ocr parser
>>
>>>Thanks guys! Nick, your config file was exactly what I was looking
>>>for, though it took a minor tweak because you forgot to open the
>>>parser tag. I'm posting the corrected config below for anyone who
>>>refers to this thread in the future:
>>>
>>><?xml version="1.0" encoding="UTF-8"?>
>>><properties>
>>>  <parsers>
>>>    <parser class="org.apache.tika.parser.DefaultParser">
>>>      <parser-exclude
>>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>    </parser>
>>>  </parsers>
>>></properties>
>>>
>>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <[email protected]> wrote:
>>>> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>>>>
>>>>> Then I thought I could pass a custom config.xml to disable it, but I
>>>>> can't figure out how to write the config file.
>>>>
>>>>
>>>> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers for
>>>> details of the parser configuration
>>>>
>>>> You should be fine with a config file like:
>>>>
>>>> <?xml version="1.0" encoding="UTF-8"?>
>>>> <properties>
>>>>   <parsers>
>>>>     <!-- Default Parser except no OCR -->
>>>>       <parser-exclude
>>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
>>>>     </parser>
>>>>   </parsers>
>>>> </properties>
>>>>
>>>> Thanks
>>>> Nick
>>

Re: want to disable tesseract ocr parser

Reply via email to