Extracting Text from embedded images in PDF docs

David Pilato Thu, 18 May 2017 13:18:57 -0700

Hey guys


First post here ;)

I'm trying to play with OCR with Tika. I installed Tesseract and I can extract 
text from a PNG image.
I created a PDF document with this image embedded and I'm trying now to extract 
the text out of it.

I added this configuration but I guess I'm doing it wrong:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
            </params>
        </parser>
    </parsers>
</properties>

I'm creating my Tika instance with something like:

TikaConfig config = new 
TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml"));
detector = config.getDetector();
parser = new AutoDetectParser(config);
tika = new Tika(detector, parser);

Any idea? I'm feeling that my xml config is wrong but can't find what should be 
the right syntax.

Thanks for your help guys!
David

Extracting Text from embedded images in PDF docs

Reply via email to