Re: Help with tika-app 1.13 to extract text from pdf with image

Tim Allison Thu, 16 May 2019 13:08:38 -0700

Hi Miguel,
  I downloaded 1.13, and I couldn't get it to work either.  I looked
at the source code back then, and it turns out that the
PDFParserConfig configuration via tika-config.xml was not yet
implemented.  So, the only way to do it is programmatically in 1.13 :(



On Thu, May 16, 2019 at 10:10 AM Miguel Fernandes
<[email protected]> wrote:
>
> Hi Tim,
>
> Thanks for the prompt reply. I havent been able to make it work. Seems like 
> the default parser is not being called, now its the composite parser and the 
> pdf parser. Also i've placed the dependency library jars 
> (jai-imageio-core-1.3.1.jar, jai-imageio-jpeg2000-1.3.0.jar, 
> levigo-jbig2-imageio-2.0.jar)  in the working directory and exported its path 
> is variable CLASSPATH. I have tesseract 3.04.00 in my path
>
> tesseract 3.04.00
>  leptonica-1.72
>   libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 : 
> libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0
>
> Currently my tika-config.xml looks like:
>
> <?xml version="1.0" encoding="UTF-8"?>
> <properties>
>     <parsers>
>         <parser class="org.apache.tika.parser.DefaultParser">
>             <mime-exclude>application/pdf</mime-exclude>
>             <parser-exclude class="org.apache.tika.parser.pdf.PDFParser" />
>         </parser>
>         <parser class="org.apache.tika.parser.pdf.PDFParser">
>             <params>
>                 <param name="extractInlineImages" type="bool">true</param>
>                 <param name="sortByPosition" type="bool">true</param>
>             </params>
>         </parser>
>         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser" />
>     </parsers>
> </properties>
>
> i'm running the following commands:
>
> export CLASSPATH=/home/web/apache-tika/1.13
> java -Djava.awt.headless=true -jar tika-app-1.13.jar --verbose 
> --config=tika-config.xml pdf-with-image.pdf
>
> And i get the following answer:
>
> <?xml version="1.0" encoding="UTF-8"?><html 
> xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="pdf:PDFVersion" content="1.4"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
> <meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
> <meta name="access_permission:modify_annotations" content="true"/>
> <meta name="access_permission:can_print_degraded" content="true"/>
> <meta name="access_permission:extract_for_accessibility" content="true"/>
> <meta name="access_permission:assemble_document" content="true"/>
> <meta name="xmpTPg:NPages" content="1"/>
> <meta name="resourceName" content="pdf-with-image.pdf"/>
> <meta name="dc:format" content="application/pdf; version=1.4"/>
> <meta name="access_permission:extract_content" content="true"/>
> <meta name="access_permission:can_print" content="true"/>
> <meta name="access_permission:fill_in_form" content="true"/>
> <meta name="pdf:encrypted" content="false"/>
> <meta name="Content-Length" content="197791"/>
> <meta name="access_permission:can_modify" content="true"/>
> <meta name="Content-Type" content="application/pdf"/>
> <title/>
> </head>
> <body><div class="page"><p/>
> </div>
> </body></html>
>
> Miguel
>
> On Wed, May 15, 2019 at 9:03 PM Tim Allison <[email protected]> wrote:
>>
>> >I'm able to do this with version 1.20 very easily but due to other 
>> >dependencies i need to use 1.13
>>
>> I'm so sorry.  Please see: https://tika.apache.org/security.html for
>> reasons to upgrade...and/or consider using tika-server so that you
>> don't have jar hell/version conflicts.
>>
>> On Wed, May 15, 2019 at 12:44 PM Miguel Fernandes
>> <[email protected]> wrote:
>> >
>> > Hi,
>> >
>> > I would like to use tika-app version 1.13 from the command line to parse a 
>> > pdf file with images and extract the text from those with tesseract. I'm 
>> > able to do this with version 1.20 very easily but due to other 
>> > dependencies i need to use 1.13 which is quite old.
>> >
>> > I've tried several approaches but and my latest config xml looks like this
>> >
>> > <?xml version="1.0" encoding="UTF-8"?>
>> > <properties>
>> >     <parsers>
>> >         <parser class="org.apache.tika.parser.DefaultParser"/>
>> >         <parser class="org.apache.tika.parser.pdf.PDFParser">
>> >             <params>
>> >                 <param name="extractInlineImages" type="bool">true</param>
>> >             </params>
>> >         </parser>
>> >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
>> >             <params>
>> >                 <param name="setTesseractPath" 
>> > type="string">/bin/tesseract</param>
>> >             </params>
>> >         </parser>
>> >     </parsers>
>> > </properties>
>> >
>> > but this doesnt work and what i get from on the example file 
>> > pdf-with-image.pdf is
>> >
>> > <title/>
>> > </head>
>> > <body><div class="page"><p/>
>> > </div>
>> >
>> > from the TesseractOCRParser documentation i get the following
>> > "TesseractOCRParser powered by tesseract-ocr engine. To enable this 
>> > parser, create a TesseractOCRConfig object and pass it through a 
>> > ParseContext."
>> >
>> > but i dont know how to enable it in tika-app. Can anyone help with getting 
>> > this to work?
>> >
>> > Thank you
>> > Miguel Fernandes

Re: Help with tika-app 1.13 to extract text from pdf with image

Reply via email to