Hey guys
First post here ;)
I'm trying to play with OCR with Tika. I installed Tesseract and I can extract
text from a PNG image.
I created a PDF document with this image embedded and I'm trying now to extract
the text out of it.
I added this configuration but I guess I'm doing it wrong:
<?xml version="1.0" encoding="UTF-8"?>
<properties>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser">
</parser>
<parser class="org.apache.tika.parser.pdf.PDFParser">
<params>
<param name="extractInlineImages" type="bool">true</param>
</params>
</parser>
</parsers>
</properties>
I'm creating my Tika instance with something like:
TikaConfig config = new
TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml"));
detector = config.getDetector();
parser = new AutoDetectParser(config);
tika = new Tika(detector, parser);
Any idea? I'm feeling that my xml config is wrong but can't find what should be
the right syntax.
Thanks for your help guys!
David