Re: Extracting Text from embedded images in PDF docs

David Pilato Fri, 19 May 2017 02:00:57 -0700

So I saw in debug mode that indeed config.getExtractInlineImages() is false so 
I'm going to check my config.


:D

David

> Le 18 mai 2017 à 22:18, David Pilato <[email protected]> a écrit :
> 
> Hey guys
> 
> 
> First post here ;)
> 
> I'm trying to play with OCR with Tika. I installed Tesseract and I can 
> extract text from a PNG image.
> I created a PDF document with this image embedded and I'm trying now to 
> extract the text out of it.
> 
> I added this configuration but I guess I'm doing it wrong:
> 
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser">
>         </parser>
>         <parser class="org.apache.tika.parser.pdf.PDFParser">
>             <params>
>                 <param name="extractInlineImages" type="bool">true</param>
>             </params>
>         </parser>
>     </parsers>
> </properties>
> 
> I'm creating my Tika instance with something like:
> 
> TikaConfig config = new 
> TikaConfig(TikaInstance.class.getResourceAsStream("/tika-config.xml"));
> detector = config.getDetector();
> parser = new AutoDetectParser(config);
> tika = new Tika(detector, parser);
> 
> Any idea? I'm feeling that my xml config is wrong but can't find what should 
> be the right syntax.
> 
> Thanks for your help guys!
> David
>

Re: Extracting Text from embedded images in PDF docs

Reply via email to