So I saw in debug mode that indeed config.getExtractInlineImages() is false so I'm going to check my config.
:D David > Le 18 mai 2017 à 22:18, David Pilato <[email protected]> a écrit : > > Hey guys > > > First post here ;) > > I'm trying to play with OCR with Tika. I installed Tesseract and I can > extract text from a PNG image. > I created a PDF document with this image embedded and I'm trying now to > extract the text out of it. > > I added this configuration but I guess I'm doing it wrong: > > <?xml version="1.0" encoding="UTF-8"?> > <properties> > <parsers> > <parser class="org.apache.tika.parser.DefaultParser"> > </parser> > <parser class="org.apache.tika.parser.pdf.PDFParser"> > <params> > <param name="extractInlineImages" type="bool">true</param> > </params> > </parser> > </parsers> > </properties> > > I'm creating my Tika instance with something like: > > TikaConfig config = new > TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml")); > detector = config.getDetector(); > parser = new AutoDetectParser(config); > tika = new Tika(detector, parser); > > Any idea? I'm feeling that my xml config is wrong but can't find what should > be the right syntax. > > Thanks for your help guys! > David >
