Re: Help with tika-app 1.13 to extract text from pdf with image

Tim Allison Wed, 15 May 2019 13:04:42 -0700

>I'm able to do this with version 1.20 very easily but due to other 
>dependencies i need to use 1.13


I'm so sorry.  Please see: https://tika.apache.org/security.html for
reasons to upgrade...and/or consider using tika-server so that you
don't have jar hell/version conflicts.

On Wed, May 15, 2019 at 12:44 PM Miguel Fernandes
<[email protected]> wrote:
>
> Hi,
>
> I would like to use tika-app version 1.13 from the command line to parse a 
> pdf file with images and extract the text from those with tesseract. I'm able 
> to do this with version 1.20 very easily but due to other dependencies i need 
> to use 1.13 which is quite old.
>
> I've tried several approaches but and my latest config xml looks like this
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser"/>
>         <parser class="org.apache.tika.parser.pdf.PDFParser">
>             <params>
>                 <param name="extractInlineImages" type="bool">true</param>
>             </params>
>         </parser>
>         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>             <params>
>                 <param name="setTesseractPath" 
> type="string">/bin/tesseract</param>
>             </params>
>         </parser>
>     </parsers>
> </properties>
>
> but this doesnt work and what i get from on the example file 
> pdf-with-image.pdf is
>
> <title/>
> </head>
> <body><div class="page"><p/>
> </div>
>
> from the TesseractOCRParser documentation i get the following
> "TesseractOCRParser powered by tesseract-ocr engine. To enable this parser, 
> create a TesseractOCRConfig object and pass it through a ParseContext."
>
> but i dont know how to enable it in tika-app. Can anyone help with getting 
> this to work?
>
> Thank you
> Miguel Fernandes

Re: Help with tika-app 1.13 to extract text from pdf with image

Reply via email to