Re: Extracting Text from embedded images in PDF docs

Sergey Beryozkin Fri, 19 May 2017 08:20:57 -0700

Hi Tim

and when is "extractInlineImages" actually effective ?


Thanks, Sergey
On 19/05/17 16:16, Allison, Timothy B. wrote:

Y, well, sorry.  I’m thrilled someone is using it!

I tried to document that here:

https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29

See the OCR section.

And there’s a link to that page fromhttps://wiki.apache.org/tika/TikaOCR (See OCR on PDFs)


How can we improve the documentation so that you don’t waste an hour?

*From:*David Pilato [mailto:[email protected]]
*Sent:* Friday, May 19, 2017 5:55 AM
*To:* [email protected]
*Subject:* Re: Extracting Text from embedded images in PDF docs

Got it working. In case someone else hits the same issue, here is myconfig file... Well... That was obvious :D


/<?/*xml version**="1.0" **encoding**="UTF-8"*/?>
/<*properties*>
<*parsers*>
<*parser class="org.apache.tika.parser.DefaultParser"*/>
<*parser class="org.apache.tika.parser.pdf.PDFParser"*>
<*params*>
<*param name="ocrStrategy" type="string"*>ocr_and_text</*param*>
</*params*>
</*parser*>
</*parsers*>
</*properties*>


David

    Le 19 mai 2017 à 10:59, David Pilato <[email protected]
    <mailto:[email protected]>> a écrit :

    So I saw in debug mode that indeed config.getExtractInlineImages()
    is false so I'm going to check my config.

    :D


    David

        Le 18 mai 2017 à 22:18, David Pilato <[email protected]
        <mailto:[email protected]>> a écrit :

        Hey guys

        First post here ;)

        I'm trying to play with OCR with Tika. I installed Tesseract and
        I can extract text from a PNG image.

        I created a PDF document with this image embedded and I'm trying
        now to extract the text out of it.

        I added this configuration but I guess I'm doing it wrong:

        /<?/*xml version**="1.0" **encoding**="UTF-8"*/?>
        /<*properties*>
        <*parsers*>
        <*parser class="org.apache.tika.parser.DefaultParser"*>
        </*parser*>
        <*parser class="org.apache.tika.parser.pdf.PDFParser"*>
        <*params*>
        <*param name="extractInlineImages" type="bool"*>true</*param*>
        </*params*>
        </*parser*>
        </*parsers*>
        </*properties*>

        I'm creating my Tika instance with something like:

        TikaConfig config = *new
        
*TikaConfig(TikaInstance.*class*.getResourceAsStream(*"/tika-config.xml"*));
        detector = config.getDetector();
        parser = *new *AutoDetectParser(config);

        /tika /= *new *Tika(detector, parser);

        Any idea? I'm feeling that my xml config is wrong but can't find
        what should be the right syntax.

        Thanks for your help guys!
        David

Re: Extracting Text from embedded images in PDF docs

Reply via email to