Got it working. In case someone else hits the same issue, here is my config
file... Well... That was obvious :D
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="ocrStrategy" type="string">ocr_and_text</param>
</params>
</parser>
</parsers>
</properties>
David
> Le 19 mai 2017 à 10:59, David Pilato <[email protected]> a écrit :
>
> So I saw in debug mode that indeed config.getExtractInlineImages() is false
> so I'm going to check my config.
>
> :D
>
> David
>
>> Le 18 mai 2017 à 22:18, David Pilato <[email protected]
>> <mailto:[email protected]>> a écrit :
>>
>> Hey guys
>>
>>
>> First post here ;)
>>
>> I'm trying to play with OCR with Tika. I installed Tesseract and I can
>> extract text from a PNG image.
>> I created a PDF document with this image embedded and I'm trying now to
>> extract the text out of it.
>>
>> I added this configuration but I guess I'm doing it wrong:
>>
>> <?xml version="1.0" encoding="UTF-8"?>
>> <properties>
>> <parsers>
>> <parser class="org.apache.tika.parser.DefaultParser">
>> </parser>
>> <parser class="org.apache.tika.parser.pdf.PDFParser">
>> <params>
>> <param name="extractInlineImages" type="bool">true</param>
>> </params>
>> </parser>
>> </parsers>
>> </properties>
>>
>> I'm creating my Tika instance with something like:
>>
>> TikaConfig config = new
>> TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml"));
>> detector = config.getDetector();
>> parser = new AutoDetectParser(config);
>> tika = new Tika(detector, parser);
>>
>> Any idea? I'm feeling that my xml config is wrong but can't find what should
>> be the right syntax.
>>
>> Thanks for your help guys!
>> David
>>
>