Re: Extracting Text from embedded images in PDF docs

David Pilato Fri, 19 May 2017 02:55:58 -0700

Got it working. In case someone else hits the same issue, here is my config 
file... Well... That was obvious :D


<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="ocrStrategy" type="string">ocr_and_text</param>
            </params>
        </parser>
    </parsers>
</properties>


David

> Le 19 mai 2017 à 10:59, David Pilato <[email protected]> a écrit :
> 
> So I saw in debug mode that indeed config.getExtractInlineImages() is false 
> so I'm going to check my config.
> 
> :D
> 
> David
> 
>> Le 18 mai 2017 à 22:18, David Pilato <[email protected] 
>> <mailto:[email protected]>> a écrit :
>> 
>> Hey guys
>> 
>> 
>> First post here ;)
>> 
>> I'm trying to play with OCR with Tika. I installed Tesseract and I can 
>> extract text from a PNG image.
>> I created a PDF document with this image embedded and I'm trying now to 
>> extract the text out of it.
>> 
>> I added this configuration but I guess I'm doing it wrong:
>> 
>> <?xml version="1.0" encoding="UTF-8"?>
>> <properties>
>>     <parsers>
>>         <parser class="org.apache.tika.parser.DefaultParser">
>>         </parser>
>>         <parser class="org.apache.tika.parser.pdf.PDFParser">
>>             <params>
>>                 <param name="extractInlineImages" type="bool">true</param>
>>             </params>
>>         </parser>
>>     </parsers>
>> </properties>
>> 
>> I'm creating my Tika instance with something like:
>> 
>> TikaConfig config = new 
>> TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml"));
>> detector = config.getDetector();
>> parser = new AutoDetectParser(config);
>> tika = new Tika(detector, parser);
>> 
>> Any idea? I'm feeling that my xml config is wrong but can't find what should 
>> be the right syntax.
>> 
>> Thanks for your help guys!
>> David
>> 
>

Re: Extracting Text from embedded images in PDF docs

Reply via email to