Re: Help with tika-app 1.13 to extract text from pdf with image

Miguel Fernandes Fri, 17 May 2019 04:16:21 -0700

Hi Tim,

Thanks a lot for your efforts and help in clarifying this. I will look at
another approach.


Miguel

On Thu, May 16, 2019 at 9:00 PM Tim Allison <[email protected]> wrote:

> Hi Miguel,
>   I downloaded 1.13, and I couldn't get it to work either.  I looked
> at the source code back then, and it turns out that the
> PDFParserConfig configuration via tika-config.xml was not yet
> implemented.  So, the only way to do it is programmatically in 1.13 :(
>
>
> On Thu, May 16, 2019 at 10:10 AM Miguel Fernandes
> <[email protected]> wrote:
> >
> > Hi Tim,
> >
> > Thanks for the prompt reply. I havent been able to make it work. Seems
> like the default parser is not being called, now its the composite parser
> and the pdf parser. Also i've placed the dependency library jars
> (jai-imageio-core-1.3.1.jar, jai-imageio-jpeg2000-1.3.0.jar,
> levigo-jbig2-imageio-2.0.jar)  in the working directory and exported its
> path is variable CLASSPATH. I have tesseract 3.04.00 in my path
> >
> > tesseract 3.04.00
> >  leptonica-1.72
> >   libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 :
> libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0
> >
> > Currently my tika-config.xml looks like:
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <properties>
> >     <parsers>
> >         <parser class="org.apache.tika.parser.DefaultParser">
> >             <mime-exclude>application/pdf</mime-exclude>
> >             <parser-exclude class="org.apache.tika.parser.pdf.PDFParser"
> />
> >         </parser>
> >         <parser class="org.apache.tika.parser.pdf.PDFParser">
> >             <params>
> >                 <param name="extractInlineImages"
> type="bool">true</param>
> >                 <param name="sortByPosition" type="bool">true</param>
> >             </params>
> >         </parser>
> >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser" />
> >     </parsers>
> > </properties>
> >
> > i'm running the following commands:
> >
> > export CLASSPATH=/home/web/apache-tika/1.13
> > java -Djava.awt.headless=true -jar tika-app-1.13.jar --verbose
> --config=tika-config.xml pdf-with-image.pdf
> >
> > And i get the following answer:
> >
> > <?xml version="1.0" encoding="UTF-8"?><html xmlns="
> http://www.w3.org/1999/xhtml";>
> > <head>
> > <meta name="pdf:PDFVersion" content="1.4"/>
> > <meta name="X-Parsed-By"
> content="org.apache.tika.parser.CompositeParser"/>
> > <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
> > <meta name="access_permission:modify_annotations" content="true"/>
> > <meta name="access_permission:can_print_degraded" content="true"/>
> > <meta name="access_permission:extract_for_accessibility" content="true"/>
> > <meta name="access_permission:assemble_document" content="true"/>
> > <meta name="xmpTPg:NPages" content="1"/>
> > <meta name="resourceName" content="pdf-with-image.pdf"/>
> > <meta name="dc:format" content="application/pdf; version=1.4"/>
> > <meta name="access_permission:extract_content" content="true"/>
> > <meta name="access_permission:can_print" content="true"/>
> > <meta name="access_permission:fill_in_form" content="true"/>
> > <meta name="pdf:encrypted" content="false"/>
> > <meta name="Content-Length" content="197791"/>
> > <meta name="access_permission:can_modify" content="true"/>
> > <meta name="Content-Type" content="application/pdf"/>
> > <title/>
> > </head>
> > <body><div class="page"><p/>
> > </div>
> > </body></html>
> >
> > Miguel
> >
> > On Wed, May 15, 2019 at 9:03 PM Tim Allison <[email protected]> wrote:
> >>
> >> >I'm able to do this with version 1.20 very easily but due to other
> dependencies i need to use 1.13
> >>
> >> I'm so sorry.  Please see: https://tika.apache.org/security.html for
> >> reasons to upgrade...and/or consider using tika-server so that you
> >> don't have jar hell/version conflicts.
> >>
> >> On Wed, May 15, 2019 at 12:44 PM Miguel Fernandes
> >> <[email protected]> wrote:
> >> >
> >> > Hi,
> >> >
> >> > I would like to use tika-app version 1.13 from the command line to
> parse a pdf file with images and extract the text from those with
> tesseract. I'm able to do this with version 1.20 very easily but due to
> other dependencies i need to use 1.13 which is quite old.
> >> >
> >> > I've tried several approaches but and my latest config xml looks like
> this
> >> >
> >> > <?xml version="1.0" encoding="UTF-8"?>
> >> > <properties>
> >> >     <parsers>
> >> >         <parser class="org.apache.tika.parser.DefaultParser"/>
> >> >         <parser class="org.apache.tika.parser.pdf.PDFParser">
> >> >             <params>
> >> >                 <param name="extractInlineImages"
> type="bool">true</param>
> >> >             </params>
> >> >         </parser>
> >> >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> >> >             <params>
> >> >                 <param name="setTesseractPath"
> type="string">/bin/tesseract</param>
> >> >             </params>
> >> >         </parser>
> >> >     </parsers>
> >> > </properties>
> >> >
> >> > but this doesnt work and what i get from on the example file
> pdf-with-image.pdf is
> >> >
> >> > <title/>
> >> > </head>
> >> > <body><div class="page"><p/>
> >> > </div>
> >> >
> >> > from the TesseractOCRParser documentation i get the following
> >> > "TesseractOCRParser powered by tesseract-ocr engine. To enable this
> parser, create a TesseractOCRConfig object and pass it through a
> ParseContext."
> >> >
> >> > but i dont know how to enable it in tika-app. Can anyone help with
> getting this to work?
> >> >
> >> > Thank you
> >> > Miguel Fernandes
>

Re: Help with tika-app 1.13 to extract text from pdf with image

Reply via email to