You are totally right and your use case is as valid as any other one! :-) Thanks for clarifying and yes we can make this more clear in the documentation and should.
Cheers, Chris — Chris Mattmann [email protected] -----Original Message----- From: Sergey Tsalkov <[email protected]> Reply-To: <[email protected]> Date: Thursday, August 20, 2015 at 10:51 PM To: <[email protected]> Subject: Re: want to disable tesseract ocr parser >It's great functionality to have, Chris, and tesseract is certainly my >choice for OCR, too! I'm certainly not suggesting that it be removed >-- maybe just that the user be made aware of it more deliberately, >with the official documentation mentioning that this happens and how >to disable it. In my case, it triggered on an image embedded within an >office doc, so it caught me by surprise more so than if I'd been >throwing jpegs at Tika directly. > >But then again, maybe my use case is the oddball here -- most people >aren't cranking servers around the clock parsing countless millions of >documents, and therefore wouldn't notice some increase in CPU use! > > > >On Thu, Aug 20, 2015 at 10:41 PM, Chris Mattmann <[email protected]> >wrote: >> Thanks Sergey. It’s certainly something that adds overhead >> I’ve seen it too, but with all the capability that tesseract >> adds (the OCR) it’s something that we’re willing to trade since >> we can disable it pretty easily via configuration, etc. >> >> Speaking from a biased perspective of helping to implement it ;) >> >> Cheers, >> Chris >> >> — >> Chris Mattmann >> [email protected] >> >> >> >> >> >> >> -----Original Message----- >> From: Sergey Tsalkov <[email protected]> >> Reply-To: <[email protected]> >> Date: Thursday, August 20, 2015 at 10:40 PM >> To: <[email protected]> >> Subject: Re: want to disable tesseract ocr parser >> >>>I'll mention my situation again on the wiki, but if any Tika >>>developers are reading this, I'd consider this a bug report! I've been >>>using Tika for quite a while. I use very expensive hardware to churn >>>through tens of millions of documents very rapidly, pulling out >>>plaintext and metadata. Tika has generally performed extremely well >>>under this stress -- never a crash or screwup! >>> >>>But then one day it got about 50% slower, and I couldn't figure out >>>why for a while. I just happened to run 'ps xf' and noticed that Tika >>>was spawning all these tesseract processes. Turned out that I'd never >>>had tesseract installed before. I had installed it just recently for a >>>separate project, and Tika's behavior silently changed because of >>>that. >>> >>>Not sure if that fits your definition of a bug, but it's certainly >>>unexpected behavior as far as I'm concerned! >>> >>>Thanks again everyone! >>> >>>On Thu, Aug 20, 2015 at 10:31 PM, Sergey Tsalkov <[email protected]> >>>wrote: >>>> Happy to do that, Chris! I've created my account, username is >>>>SergeyTsalkov. >>>> >>>> On Thu, Aug 20, 2015 at 10:24 PM, Mattmann, Chris A (3980) >>>> <[email protected]> wrote: >>>>> Thanks Sergey! >>>>> >>>>> Please feel free to add a page on the wiki: >>>>> >>>>> http://wiki.apache.org/tika/ >>>>> >>>>> Describing your use case. I would appreciate it! >>>>> If you remember to sign up, tell me your username, or tell anyone >>>>> on this list (dev@tika), we’ll get you permissions and you can >>>>> create the page. >>>>> >>>>> Cheers, >>>>> Chris >>>>> >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Chris Mattmann, Ph.D. >>>>> Chief Architect >>>>> Instrument Software and Science Data Systems Section (398) >>>>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>>>> Office: 168-519, Mailstop: 168-527 >>>>> Email: [email protected] >>>>> WWW: http://sunset.usc.edu/~mattmann/ >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> Adjunct Associate Professor, Computer Science Department >>>>> University of Southern California, Los Angeles, CA 90089 USA >>>>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> -----Original Message----- >>>>> From: Sergey Tsalkov <[email protected]> >>>>> Reply-To: "[email protected]" <[email protected]> >>>>> Date: Thursday, August 20, 2015 at 10:22 PM >>>>> To: "[email protected]" <[email protected]> >>>>> Subject: Re: want to disable tesseract ocr parser >>>>> >>>>>>Thanks guys! Nick, your config file was exactly what I was looking >>>>>>for, though it took a minor tweak because you forgot to open the >>>>>>parser tag. I'm posting the corrected config below for anyone who >>>>>>refers to this thread in the future: >>>>>> >>>>>><?xml version="1.0" encoding="UTF-8"?> >>>>>><properties> >>>>>> <parsers> >>>>>> <parser class="org.apache.tika.parser.DefaultParser"> >>>>>> <parser-exclude >>>>>>class="org.apache.tika.parser.ocr.TesseractOCRParser"/> >>>>>> </parser> >>>>>> </parsers> >>>>>></properties> >>>>>> >>>>>>On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <[email protected]> wrote: >>>>>>> On 20/08/15 07:19, Sergey Tsalkov wrote: >>>>>>>> >>>>>>>> Then I thought I could pass a custom config.xml to disable it, >>>>>>>>but I >>>>>>>> can't figure out how to write the config file. >>>>>>> >>>>>>> >>>>>>> See >>>>>>>http://tika.apache.org/1.10/configuring.html#Configuring_Parsers >>>>>>>for >>>>>>> details of the parser configuration >>>>>>> >>>>>>> You should be fine with a config file like: >>>>>>> >>>>>>> <?xml version="1.0" encoding="UTF-8"?> >>>>>>> <properties> >>>>>>> <parsers> >>>>>>> <!-- Default Parser except no OCR --> >>>>>>> <parser-exclude >>>>>>> class="org.apache.tika.parser.ocr.TesseractOCRParser"/> >>>>>>> </parser> >>>>>>> </parsers> >>>>>>> </properties> >>>>>>> >>>>>>> Thanks >>>>>>> Nick >>>>> >> >>
