Re: Help with tika-app 1.13 to extract text from pdf with image

Miguel Fernandes Thu, 16 May 2019 07:10:30 -0700

Hi Tim,

Thanks for the prompt reply. I havent been able to make it work. Seems like
the default parser is not being called, now its the composite parser and
the pdf parser. Also i've placed the dependency library jars
(jai-imageio-core-1.3.1.jar, jai-imageio-jpeg2000-1.3.0.jar,
levigo-jbig2-imageio-2.0.jar)  in the working directory and exported its
path is variable CLASSPATH. I have tesseract 3.04.00 in my path


tesseract 3.04.00
 leptonica-1.72
  libgif 4.1.6(?) : libjpeg 6b (libjpeg-turbo 1.2.90) : libpng 1.5.13 :
libtiff 4.0.3 : zlib 1.2.7 : libwebp 0.3.0

Currently my tika-config.xml looks like:

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser">
            <mime-exclude>application/pdf</mime-exclude>
            <parser-exclude class="org.apache.tika.parser.pdf.PDFParser" />
        </parser>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
                <param name="sortByPosition" type="bool">true</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser" />
    </parsers>
</properties>

i'm running the following commands:

export CLASSPATH=/home/web/apache-tika/1.13
java -Djava.awt.headless=true -jar tika-app-1.13.jar --verbose
--config=tika-config.xml pdf-with-image.pdf

And i get the following answer:

<?xml version="1.0" encoding="UTF-8"?><html xmlns="
http://www.w3.org/1999/xhtml";>
<head>
<meta name="pdf:PDFVersion" content="1.4"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.CompositeParser"/>
<meta name="X-Parsed-By" content="org.apache.tika.parser.pdf.PDFParser"/>
<meta name="access_permission:modify_annotations" content="true"/>
<meta name="access_permission:can_print_degraded" content="true"/>
<meta name="access_permission:extract_for_accessibility" content="true"/>
<meta name="access_permission:assemble_document" content="true"/>
<meta name="xmpTPg:NPages" content="1"/>
<meta name="resourceName" content="pdf-with-image.pdf"/>
<meta name="dc:format" content="application/pdf; version=1.4"/>
<meta name="access_permission:extract_content" content="true"/>
<meta name="access_permission:can_print" content="true"/>
<meta name="access_permission:fill_in_form" content="true"/>
<meta name="pdf:encrypted" content="false"/>
<meta name="Content-Length" content="197791"/>
<meta name="access_permission:can_modify" content="true"/>
<meta name="Content-Type" content="application/pdf"/>
<title/>
</head>
<body><div class="page"><p/>
</div>
</body></html>

Miguel

On Wed, May 15, 2019 at 9:03 PM Tim Allison <[email protected]> wrote:

> >I'm able to do this with version 1.20 very easily but due to other
> dependencies i need to use 1.13
>
> I'm so sorry.  Please see: https://tika.apache.org/security.html for
> reasons to upgrade...and/or consider using tika-server so that you
> don't have jar hell/version conflicts.
>
> On Wed, May 15, 2019 at 12:44 PM Miguel Fernandes
> <[email protected]> wrote:
> >
> > Hi,
> >
> > I would like to use tika-app version 1.13 from the command line to parse
> a pdf file with images and extract the text from those with tesseract. I'm
> able to do this with version 1.20 very easily but due to other dependencies
> i need to use 1.13 which is quite old.
> >
> > I've tried several approaches but and my latest config xml looks like
> this
> >
> > <?xml version="1.0" encoding="UTF-8"?>
> > <properties>
> >     <parsers>
> >         <parser class="org.apache.tika.parser.DefaultParser"/>
> >         <parser class="org.apache.tika.parser.pdf.PDFParser">
> >             <params>
> >                 <param name="extractInlineImages"
> type="bool">true</param>
> >             </params>
> >         </parser>
> >         <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
> >             <params>
> >                 <param name="setTesseractPath"
> type="string">/bin/tesseract</param>
> >             </params>
> >         </parser>
> >     </parsers>
> > </properties>
> >
> > but this doesnt work and what i get from on the example file
> pdf-with-image.pdf is
> >
> > <title/>
> > </head>
> > <body><div class="page"><p/>
> > </div>
> >
> > from the TesseractOCRParser documentation i get the following
> > "TesseractOCRParser powered by tesseract-ocr engine. To enable this
> parser, create a TesseractOCRConfig object and pass it through a
> ParseContext."
> >
> > but i dont know how to enable it in tika-app. Can anyone help with
> getting this to work?
> >
> > Thank you
> > Miguel Fernandes
>

Re: Help with tika-app 1.13 to extract text from pdf with image

Reply via email to