Thanks guys! Nick, your config file was exactly what I was looking
for, though it took a minor tweak because you forgot to open the
parser tag. I'm posting the corrected config below for anyone who
refers to this thread in the future:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
<parser-exclude class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
</parser>
</parsers>
</properties>
On Thu, Aug 20, 2015 at 1:26 AM, Nick Burch <[email protected]> wrote:
> On 20/08/15 07:19, Sergey Tsalkov wrote:
>>
>> Then I thought I could pass a custom config.xml to disable it, but I
>> can't figure out how to write the config file.
>
>
> See http://tika.apache.org/1.10/configuring.html#Configuring_Parsers for
> details of the parser configuration
>
> You should be fine with a config file like:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
> <parsers>
> <!-- Default Parser except no OCR -->
> <parser-exclude
> class="org.apache.tika.parser.ocr.TesseractOCRParser"/>
> </parser>
> </parsers>
> </properties>
>
> Thanks
> Nick