Hey guys

First post here ;)

I'm trying to play with OCR with Tika. I installed Tesseract and I can extract 
text from a PNG image.
I created a PDF document with this image embedded and I'm trying now to extract 
the text out of it.

I added this configuration but I guess I'm doing it wrong:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
            </params>
        </parser>
    </parsers>
</properties>

I'm creating my Tika instance with something like:

TikaConfig config = new 
TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml"));
detector = config.getDetector();
parser = new AutoDetectParser(config);
tika = new Tika(detector, parser);

Any idea? I'm feeling that my xml config is wrong but can't find what should be 
the right syntax.

Thanks for your help guys!
David

Reply via email to