Re: Help with tika-app 1.13 to extract text from pdf with image

Tim Allison Wed, 15 May 2019 13:03:02 -0700

>but i dont know how to enable it in tika-app. Can anyone help with getting 
>this to work?
In 1.13, you couldn't configure settings via the config file...so, if
tesseract is on your path...e.g., you type tesseract and it runs, you
should be good to go.


On Wed, May 15, 2019 at 12:44 PM Miguel Fernandes
<[email protected]> wrote:
>
> Hi,
>
> I would like to use tika-app version 1.13 from the command line to parse a 
> pdf file with images and extract the text from those with tesseract. I'm able 
> to do this with version 1.20 very easily but due to other dependencies i need 
> to use 1.13 which is quite old.
>
> I've tried several approaches but and my latest config xml looks like this
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser"/>
>         <parser class="org.apache.tika.parser.pdf.PDFParser">
>             <params>
>                 <param name="extractInlineImages" type="bool">true</param>
>             </params>
>         </parser>
>         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>             <params>
>                 <param name="setTesseractPath" 
> type="string">/bin/tesseract</param>
>             </params>
>         </parser>
>     </parsers>
> </properties>
>
> but this doesnt work and what i get from on the example file 
> pdf-with-image.pdf is
>
> <title/>
> </head>
> <body><div class="page"><p/>
> </div>
>
> from the TesseractOCRParser documentation i get the following
> "TesseractOCRParser powered by tesseract-ocr engine. To enable this parser, 
> create a TesseractOCRConfig object and pass it through a ParseContext."
>
> but i dont know how to enable it in tika-app. Can anyone help with getting 
> this to work?
>
> Thank you
> Miguel Fernandes

Re: Help with tika-app 1.13 to extract text from pdf with image

Reply via email to