Re: Help with tika-app 1.13 to extract text from pdf with image

Tim Allison Wed, 15 May 2019 12:59:47 -0700

Hi Miguel,
  I just fixed our wiki...sorry...  Try excluding the PDFParser from
the default parser:


        <parser class="org.apache.tika.parser.DefaultParser">
            <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"/>
        </parser>
     <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
            </params>
        </parser>
...

  What _may_ be happening is that the PDFParser from within the
DefaultParser is being called, and your configured parser is not
working.  The other thing that is critical is that you include the
"optional" dependencies that aren't consistent with Apache 2.0:

            <dependency>
                <groupId>com.levigo.jbig2</groupId>
                <artifactId>levigo-jbig2-imageio</artifactId>
                <version>2.0</version>
            </dependency>
            <dependency>
                <groupId>com.github.jai-imageio</groupId>
                <artifactId>jai-imageio-core</artifactId>
                <version>1.3.1</version>
            </dependency>
            <dependency>
                <groupId>com.github.jai-imageio</groupId>
                <artifactId>jai-imageio-jpeg2000</artifactId>
                <version>1.3.0</version>
            </dependency>

Let me know how/if this works...

On Wed, May 15, 2019 at 12:44 PM Miguel Fernandes
<[email protected]> wrote:
>
> Hi,
>
> I would like to use tika-app version 1.13 from the command line to parse a 
> pdf file with images and extract the text from those with tesseract. I'm able 
> to do this with version 1.20 very easily but due to other dependencies i need 
> to use 1.13 which is quite old.
>
> I've tried several approaches but and my latest config xml looks like this
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser"/>
>         <parser class="org.apache.tika.parser.pdf.PDFParser">
>             <params>
>                 <param name="extractInlineImages" type="bool">true</param>
>             </params>
>         </parser>
>         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>             <params>
>                 <param name="setTesseractPath" 
> type="string">/bin/tesseract</param>
>             </params>
>         </parser>
>     </parsers>
> </properties>
>
> but this doesnt work and what i get from on the example file 
> pdf-with-image.pdf is
>
> <title/>
> </head>
> <body><div class="page"><p/>
> </div>
>
> from the TesseractOCRParser documentation i get the following
> "TesseractOCRParser powered by tesseract-ocr engine. To enable this parser, 
> create a TesseractOCRConfig object and pass it through a ParseContext."
>
> but i dont know how to enable it in tika-app. Can anyone help with getting 
> this to work?
>
> Thank you
> Miguel Fernandes

Re: Help with tika-app 1.13 to extract text from pdf with image

Reply via email to