Hi Tim, Thanks a lot for your efforts and help in clarifying this. I will look at another approach.
Miguel On Thu, May 16, 2019 at 9:00 PM Tim Allison <[email protected]> wrote: > Hi Miguel, > I downloaded 1.13, and I couldn't get it to work either. I looked > at the source code back then, and it turns out that the > PDFParserConfig configuration via tika-config.xml was not yet > implemented. So, the only way to do it is programmatically in 1.13 :( > > > On Thu, May 16, 2019 at 10:10 AM Miguel Fernandes > <[email protected]> wrote: > > > > Hi Tim, > > > > Thanks for the prompt reply. I havent been able to make it work. Seems > like the default parser is not being called, now its the composite parser > and the pdf parser. Also i've placed the dependency library jars > (jai-imageio-core-1.3.1.jar, jai-imageio-jpeg2000-1.3.0.jar, > levigo-jbig2-imageio-2.0.jar) in the working directory and exported its > path is variable CLASSPATH. I have tesseract 3.04.00 in my path > > > > tesseract 3.04.00 > > leptonica-1.72 > > libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : > libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0 > > > > Currently my tika-config.xml looks like: > > > > <?xml version="1.0" encoding="UTF-8"?> > > <properties> > > <parsers> > > <parser class="org.apache.tika.parser.DefaultParser"> > > <mime-exclude>application/pdf</mime-exclude> > > <parser-exclude class="org.apache.tika.parser.pdf.PDFParser" > /> > > </parser> > > <parser class="org.apache.tika.parser.pdf.PDFParser"> > > <params> > > <param name="extractInlineImages" > type="bool">true</param> > > <param name="sortByPosition" type="bool">true</param> > > </params> > > </parser> > > <parser class="org.apache.tika.parser.ocr.TesseractOCRParser" /> > > </parsers> > > </properties> > > > > i'm running the following commands: > > > > export CLASSPATH=/home/web/apache-tika/1.13 > > java -Djava.awt.headless=true -jar tika-app-1.13.jar --verbose > --config=tika-config.xml pdf-with-image.pdf > > > > And i get the following answer: > > > > <?xml version="1.0" encoding="UTF-8"?><html xmlns=" > http://www.w3.org/1999/xhtml"> > > <head> > > <meta name="pdf:PDFVersion" content="1.4"/> > > <meta name="X-Parsed-By" > content="org.apache.tika.parser.CompositeParser"/> > > <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/> > > <meta name="access_permission:modify_annotations" content="true"/> > > <meta name="access_permission:can_print_degraded" content="true"/> > > <meta name="access_permission:extract_for_accessibility" content="true"/> > > <meta name="access_permission:assemble_document" content="true"/> > > <meta name="xmpTPg:NPages" content="1"/> > > <meta name="resourceName" content="pdf-with-image.pdf"/> > > <meta name="dc:format" content="application/pdf; version=1.4"/> > > <meta name="access_permission:extract_content" content="true"/> > > <meta name="access_permission:can_print" content="true"/> > > <meta name="access_permission:fill_in_form" content="true"/> > > <meta name="pdf:encrypted" content="false"/> > > <meta name="Content-Length" content="197791"/> > > <meta name="access_permission:can_modify" content="true"/> > > <meta name="Content-Type" content="application/pdf"/> > > <title/> > > </head> > > <body><div class="page"><p/> > > </div> > > </body></html> > > > > Miguel > > > > On Wed, May 15, 2019 at 9:03 PM Tim Allison <[email protected]> wrote: > >> > >> >I'm able to do this with version 1.20 very easily but due to other > dependencies i need to use 1.13 > >> > >> I'm so sorry. Please see: https://tika.apache.org/security.html for > >> reasons to upgrade...and/or consider using tika-server so that you > >> don't have jar hell/version conflicts. > >> > >> On Wed, May 15, 2019 at 12:44 PM Miguel Fernandes > >> <[email protected]> wrote: > >> > > >> > Hi, > >> > > >> > I would like to use tika-app version 1.13 from the command line to > parse a pdf file with images and extract the text from those with > tesseract. I'm able to do this with version 1.20 very easily but due to > other dependencies i need to use 1.13 which is quite old. > >> > > >> > I've tried several approaches but and my latest config xml looks like > this > >> > > >> > <?xml version="1.0" encoding="UTF-8"?> > >> > <properties> > >> > <parsers> > >> > <parser class="org.apache.tika.parser.DefaultParser"/> > >> > <parser class="org.apache.tika.parser.pdf.PDFParser"> > >> > <params> > >> > <param name="extractInlineImages" > type="bool">true</param> > >> > </params> > >> > </parser> > >> > <parser class="org.apache.tika.parser.ocr.TesseractOCRParser"> > >> > <params> > >> > <param name="setTesseractPath" > type="string">/bin/tesseract</param> > >> > </params> > >> > </parser> > >> > </parsers> > >> > </properties> > >> > > >> > but this doesnt work and what i get from on the example file > pdf-with-image.pdf is > >> > > >> > <title/> > >> > </head> > >> > <body><div class="page"><p/> > >> > </div> > >> > > >> > from the TesseractOCRParser documentation i get the following > >> > "TesseractOCRParser powered by tesseract-ocr engine. To enable this > parser, create a TesseractOCRConfig object and pass it through a > ParseContext." > >> > > >> > but i dont know how to enable it in tika-app. Can anyone help with > getting this to work? > >> > > >> > Thank you > >> > Miguel Fernandes >
