Help with tika-app 1.13 to extract text from pdf with image

Miguel Fernandes Wed, 15 May 2019 09:44:50 -0700

 Hi,

I would like to use tika-app version 1.13 from the command line to parse a
pdf file with images and extract the text from those with tesseract. I'm
able to do this with version 1.20 very easily but due to other dependencies
i need to use 1.13 which is quite old.


I've tried several approaches but and my latest config xml looks like this

<?xml version="1.0" encoding="UTF-8"?>
<properties>
    <parsers>
        <parser class="org.apache.tika.parser.DefaultParser"/>
        <parser class="org.apache.tika.parser.pdf.PDFParser">
            <params>
                <param name="extractInlineImages" type="bool">true</param>
            </params>
        </parser>
        <parser class="org.apache.tika.parser.ocr.TesseractOCRParser">
            <params>
                <param name="setTesseractPath"
type="string">/bin/tesseract</param>
            </params>
        </parser>
    </parsers>
</properties>

but this doesnt work and what i get from on the example file
pdf-with-image.pdf is

<title/>
</head>
<body><div class="page"><p/>
</div>

from the TesseractOCRParser documentation i get the following
"*TesseractOCRParser powered by tesseract-ocr engine. To enable this
parser, create a TesseractOCRConfig
<https://tika.apache.org/1.13/api/org/apache/tika/parser/ocr/TesseractOCRConfig.html>
object and pass it through a ParseContext.*"

but i dont know how to enable it in tika-app. Can anyone help with getting
this to work?

Thank you
Miguel Fernandes

Help with tika-app 1.13 to extract text from pdf with image

Reply via email to