>I'm able to do this with version 1.20 very easily but due to other >dependencies i need to use 1.13
I'm so sorry. Please see: https://tika.apache.org/security.html for reasons to upgrade...and/or consider using tika-server so that you don't have jar hell/version conflicts. On Wed, May 15, 2019 at 12:44 PM Miguel Fernandes <[email protected]> wrote: > > Hi, > > I would like to use tika-app version 1.13 from the command line to parse a > pdf file with images and extract the text from those with tesseract. I'm able > to do this with version 1.20 very easily but due to other dependencies i need > to use 1.13 which is quite old. > > I've tried several approaches but and my latest config xml looks like this > > <?xml version="1.0" encoding="UTF-8"?> > <properties> > <parsers> > <parser class="org.apache.tika.parser.DefaultParser"/> > <parser class="org.apache.tika.parser.pdf.PDFParser"> > <params> > <param name="extractInlineImages" type="bool">true</param> > </params> > </parser> > <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"> > <params> > <param name="setTesseractPath" > type="string">/bin/tesseract</param> > </params> > </parser> > </parsers> > </properties> > > but this doesnt work and what i get from on the example file > pdf-with-image.pdf is > > <title/> > </head> > <body><div class="page"><p/> > </div> > > from the TesseractOCRParser documentation i get the following > "TesseractOCRParser powered by tesseract-ocr engine. To enable this parser, > create a TesseractOCRConfig object and pass it through a ParseContext." > > but i dont know how to enable it in tika-app. Can anyone help with getting > this to work? > > Thank you > Miguel Fernandes
