I'll mention my situation again on the wiki, but if any Tika developers are reading this, I'd consider this a bug report! I've been using Tika for quite a while. I use very expensive hardware to churn through tens of millions of documents very rapidly, pulling out plaintext and metadata. Tika has generally performed extremely well under this stress -- never a crash or screwup!
But then one day it got about 50% slower, and I couldn't figure out why for a while. I just happened to run 'ps xf' and noticed that Tika was spawning all these tesseract processes. Turned out that I'd never had tesseract installed before. I had installed it just recently for a separate project, and Tika's behavior silently changed because of that. Not sure if that fits your definition of a bug, but it's certainly unexpected behavior as far as I'm concerned! Thanks again everyone! On Thu, Aug 20, 2015 at 10:31 PM, Sergey Tsalkov <[email protected]> wrote: > Happy to do that, Chris! I've created my account, username is SergeyTsalkov. > > On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980) > <[email protected]> wrote: >> Thanks Sergey! >> >> Please feel free to add a page on the wiki: >> >> http://wiki.apache.org/tika/ >> >> Describing your use case. I would appreciate it! >> If you remember to sign up, tell me your username, or tell anyone >> on this list (dev@tika), we’ll get you permissions and you can >> create the page. >> >> Cheers, >> Chris >> >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Chris Mattmann, Ph.D. >> Chief Architect >> Instrument Software and Science Data Systems Section (398) >> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >> Office: 168-519, Mailstop: 168-527 >> Email: [email protected] >> WWW: http://sunset.usc.edu/~mattmann/ >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> Adjunct Associate Professor, Computer Science Department >> University of Southern California, Los Angeles, CA 90089 USA >> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >> >> >> >> >> >> -----Original Message----- >> From: Sergey Tsalkov <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Thursday, August 20, 2015 at 10:22 PM >> To: "[email protected]" <[email protected]> >> Subject: Re: want to disable tesseract ocr parser >> >>>Thanks guys! Nick, your config file was exactly what I was looking >>>for, though it took a minor tweak because you forgot to open the >>>parser tag. I'm posting the corrected config below for anyone who >>>refers to this thread in the future: >>> >>><?xml version="1.0" encoding="UTF-8"?> >>><properties> >>> <parsers> >>> <parser class="org.apache.tika.parser.DefaultParser"> >>> <parser-exclude >>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/> >>> </parser> >>> </parsers> >>></properties> >>> >>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <[email protected]> wrote: >>>> On 20/08/15 07:19, Sergey Tsalkov wrote: >>>>> >>>>> Then I thought I could pass a custom config.xml to disable it, but I >>>>> can't figure out how to write the config file. >>>> >>>> >>>> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers for >>>> details of the parser configuration >>>> >>>> You should be fine with a config file like: >>>> >>>> <?xml version="1.0" encoding="UTF-8"?> >>>> <properties> >>>> <parsers> >>>> <!-- Default Parser except no OCR --> >>>> <parser-exclude >>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/> >>>> </parser> >>>> </parsers> >>>> </properties> >>>> >>>> Thanks >>>> Nick >>
