Thanks Sergey! Please feel free to add a page on the wiki:
http://wiki.apache.org/tika/ Describing your use case. I would appreciate it! If you remember to sign up, tell me your username, or tell anyone on this list (dev@tika), we’ll get you permissions and you can create the page. Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Sergey Tsalkov <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Thursday, August 20, 2015 at 10:22 PM To: "[email protected]" <[email protected]> Subject: Re: want to disable tesseract ocr parser >Thanks guys! Nick, your config file was exactly what I was looking >for, though it took a minor tweak because you forgot to open the >parser tag. I'm posting the corrected config below for anyone who >refers to this thread in the future: > ><?xml version="1.0" encoding="UTF-8"?> ><properties> > <parsers> > <parser class="org.apache.tika.parser.DefaultParser"> > <parser-exclude >class="org.apache.tika.parser.ocr.TesseractOCRParser"/> > </parser> > </parsers> ></properties> > >On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <[email protected]> wrote: >> On 20/08/15 07:19, Sergey Tsalkov wrote: >>> >>> Then I thought I could pass a custom config.xml to disable it, but I >>> can't figure out how to write the config file. >> >> >> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers for >> details of the parser configuration >> >> You should be fine with a config file like: >> >> <?xml version="1.0" encoding="UTF-8"?> >> <properties> >> <parsers> >> <!-- Default Parser except no OCR --> >> <parser-exclude >> class="org.apache.tika.parser.ocr.TesseractOCRParser"/> >> </parser> >> </parsers> >> </properties> >> >> Thanks >> Nick
