Re: OCR Strategy ocr_only extracts also text

Tim Allison Wed, 13 Mar 2019 11:22:23 -0700

Sorry for my delay.  I'm not able to replicate this behavior. :(

When I parse this file:
https://github.com/apache/tika/blob/master/tika-parsers/src/test/resources/test-documents/testPDFVarious.pdf


This way:
        PDFParser pdfParser = new PDFParser();
        pdfParser.setOcrStrategy("ocr_only");
        ContentHandler handler = new ToXMLContentHandler();
        Metadata metadata = new Metadata();
        ParseContext parseContext = new ParseContext();
        try (InputStream is =
getResourceAsStream("/test-documents/testPDFVarious.pdf")) {
            pdfParser.parse(is, handler, metadata, parseContext);
        }

Or better:
        AutoDetectParser parser = new AutoDetectParser();

        PDFParserConfig pdfParserConfig  = new PDFParserConfig();
        pdfParserConfig.setOcrStrategy("ocr_only");
        ParseContext parseContext = new ParseContext();
        parseContext.set(PDFParserConfig.class, pdfParserConfig);
        ContentHandler handler = new ToXMLContentHandler();
        Metadata metadata = new Metadata();

        try (InputStream is =
getResourceAsStream("/test-documents/testPDFVarious.pdf")) {
            parser.parse(is, handler, metadata, parseContext);
        }

I'm only seeing a <div class="ocr"/>...

When I run this with "ocr_and_text", I get the extracted text and the <div
class="ocr">... too...

Help!

On Sat, Mar 9, 2019 at 7:44 AM David Pilato <[email protected]> wrote:

> So I tried with
>
> Parser parser = new AutoDetectParser(pdfParser);
>
> And with:
>
> Parser parser = pdfParser;
>
> I'm still seeing the same behavior.
> Does it look like an issue? Or something wrong on my side (well this is
> often the case :) ).
>
>
> Le 7 mars 2019 à 01:30 +0100, David Pilato <[email protected]>, a écrit :
>
> Sadly not yet. I added this on my todo but what you said makes sense to
> me.
>
> I'll check this later.
>
>
> Thanks for answering ! 🤗
> Le 6 mars 2019 à 23:11 +0100, Tim Allison <[email protected]>, a écrit :
>
> David,
>  Are you all set w this or are there still surprises?
>
> On Sat, Mar 2, 2019 at 3:04 PM Tim Allison <[email protected]> wrote:
>
>> Hi David,
>>  I’m afk...take following w grain of salt. If you aren’t excluding the
>> PDFParser from your DefaultParser, there’s a chance that one is being
>> called rather than the one you’re adding.
>>   Try creating a PDFParserConfig, setting it as you want, add it to the
>> ParseContext that you send into the parse() on the regular DefaultParser.
>>   If you’re still finding surprises, please let us know.
>>
>>     Best,
>>
>>       Tim
>>
>> On Sat, Mar 2, 2019 at 9:04 AM David Pilato <[email protected]> wrote:
>>
>>> Hey team,
>>>
>>>
>>> I'm wondering if I'm misunderstanding the purpose of ocr_only in
>>> the PDFParser.
>>>
>>> I have a PDF which is containing a text within an image block and a text.
>>>
>>> <D64DD4D0-2F44-4C21-A3D0-79D8CFAA00CA.png>
>>> When I run Tika with a PDFParser configured with:
>>>
>>> PDFParser pdfParser = new PDFParser();
>>> pdfParser.setOcrStrategy("ocr_only");
>>> Parser PARSERS[] = new Parser[2];
>>> PARSERS[0] = new DefaultParser();
>>> PARSERS[1] = pdfParser;
>>> Parser parser = new AutoDetectParser(PARSERS);
>>>
>>>
>>> Both text are extracted from the PDF file.
>>> I'd have expected that:
>>>
>>>
>>>    - *no_ocr* does not do any OCR (this is working fine: "This file
>>>    contains some words." text is not extracted but "This file also
>>>    contains text." is)
>>>    - *ocr_and_text* extracts both (this is working: "This file contains
>>>    some words." and "This file also contains text." texts are extracted)
>>>    - *ocr_only* extracts only OCR based text (this is not working as
>>>    both "This file contains some words." and "This file also contains
>>>    text." texts are extracted where I'd expect to have only "This file
>>>    contains some words.").
>>>
>>> Is my understanding of the *ocr_only* value incorrect? This page (
>>> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is
>>> saying:
>>>
>>> For ocrStrategy, we currently have: *no_ocr* (rely on regular text
>>> extraction only), *ocr_only* (don't bother extracting text, just run
>>> OCR on each page), *ocr_and_text* (both extract text and run OCR).
>>>
>>>
>>> Thanks!
>>>
>>>

Re: OCR Strategy ocr_only extracts also text

Reply via email to