Hi Miguel, I downloaded 1.13, and I couldn't get it to work either. I looked at the source code back then, and it turns out that the PDFParserConfig configuration via tika-config.xml was not yet implemented. So, the only way to do it is programmatically in 1.13 :(
On Thu, May 16, 2019 at 10:10 AM Miguel Fernandes <[email protected]> wrote: > > Hi Tim, > > Thanks for the prompt reply. I havent been able to make it work. Seems like > the default parser is not being called, now its the composite parser and the > pdf parser. Also i've placed the dependency library jars > (jai-imageio-core-1.3.1.jar, jai-imageio-jpeg2000-1.3.0.jar, > levigo-jbig2-imageio-2.0.jar) in the working directory and exported its path > is variable CLASSPATH. I have tesseract 3.04.00 in my path > > tesseract 3.04.00 > leptonica-1.72 > libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : > libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0 > > Currently my tika-config.xml looks like: > > <?xml version="1.0" encoding="UTF-8"?> > <properties> > <parsers> > <parser class="org.apache.tika.parser.DefaultParser"> > <mime-exclude>application/pdf</mime-exclude> > <parser-exclude class="org.apache.tika.parser.pdf.PDFParser" /> > </parser> > <parser class="org.apache.tika.parser.pdf.PDFParser"> > <params> > <param name="extractInlineImages" type="bool">true</param> > <param name="sortByPosition" type="bool">true</param> > </params> > </parser> > <parser class="org.apache.tika.parser.ocr.TesseractOCRParser" /> > </parsers> > </properties> > > i'm running the following commands: > > export CLASSPATH=/home/web/apache-tika/1.13 > java -Djava.awt.headless=true -jar tika-app-1.13.jar --verbose > --config=tika-config.xml pdf-with-image.pdf > > And i get the following answer: > > <?xml version="1.0" encoding="UTF-8"?><html > xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="pdf:PDFVersion" content="1.4"/> > <meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/> > <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/> > <meta name="access_permission:modify_annotations" content="true"/> > <meta name="access_permission:can_print_degraded" content="true"/> > <meta name="access_permission:extract_for_accessibility" content="true"/> > <meta name="access_permission:assemble_document" content="true"/> > <meta name="xmpTPg:NPages" content="1"/> > <meta name="resourceName" content="pdf-with-image.pdf"/> > <meta name="dc:format" content="application/pdf; version=1.4"/> > <meta name="access_permission:extract_content" content="true"/> > <meta name="access_permission:can_print" content="true"/> > <meta name="access_permission:fill_in_form" content="true"/> > <meta name="pdf:encrypted" content="false"/> > <meta name="Content-Length" content="197791"/> > <meta name="access_permission:can_modify" content="true"/> > <meta name="Content-Type" content="application/pdf"/> > <title/> > </head> > <body><div class="page"><p/> > </div> > </body></html> > > Miguel > > On Wed, May 15, 2019 at 9:03 PM Tim Allison <[email protected]> wrote: >> >> >I'm able to do this with version 1.20 very easily but due to other >> >dependencies i need to use 1.13 >> >> I'm so sorry. Please see: https://tika.apache.org/security.html for >> reasons to upgrade...and/or consider using tika-server so that you >> don't have jar hell/version conflicts. >> >> On Wed, May 15, 2019 at 12:44 PM Miguel Fernandes >> <[email protected]> wrote: >> > >> > Hi, >> > >> > I would like to use tika-app version 1.13 from the command line to parse a >> > pdf file with images and extract the text from those with tesseract. I'm >> > able to do this with version 1.20 very easily but due to other >> > dependencies i need to use 1.13 which is quite old. >> > >> > I've tried several approaches but and my latest config xml looks like this >> > >> > <?xml version="1.0" encoding="UTF-8"?> >> > <properties> >> > <parsers> >> > <parser class="org.apache.tika.parser.DefaultParser"/> >> > <parser class="org.apache.tika.parser.pdf.PDFParser"> >> > <params> >> > <param name="extractInlineImages" type="bool">true</param> >> > </params> >> > </parser> >> > <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"> >> > <params> >> > <param name="setTesseractPath" >> > type="string">/bin/tesseract</param> >> > </params> >> > </parser> >> > </parsers> >> > </properties> >> > >> > but this doesnt work and what i get from on the example file >> > pdf-with-image.pdf is >> > >> > <title/> >> > </head> >> > <body><div class="page"><p/> >> > </div> >> > >> > from the TesseractOCRParser documentation i get the following >> > "TesseractOCRParser powered by tesseract-ocr engine. To enable this >> > parser, create a TesseractOCRConfig object and pass it through a >> > ParseContext." >> > >> > but i dont know how to enable it in tika-app. Can anyone help with getting >> > this to work? >> > >> > Thank you >> > Miguel Fernandes
