Re: OCR Strategy ocr_only extracts also text

Tim Allison Wed, 06 Mar 2019 14:11:48 -0800

David,
 Are you all set w this or are there still surprises?

On Sat, Mar 2, 2019 at 3:04 PM Tim Allison <[email protected]> wrote:


> Hi David,
>  I’m afk...take following w grain of salt. If you aren’t excluding the
> PDFParser from your DefaultParser, there’s a chance that one is being
> called rather than the one you’re adding.
>   Try creating a PDFParserConfig, setting it as you want, add it to the
> ParseContext that you send into the parse() on the regular DefaultParser.
>   If you’re still finding surprises, please let us know.
>
>     Best,
>
>       Tim
>
> On Sat, Mar 2, 2019 at 9:04 AM David Pilato <[email protected]> wrote:
>
>> Hey team,
>>
>>
>> I'm wondering if I'm misunderstanding the purpose of ocr_only in
>> the PDFParser.
>>
>> I have a PDF which is containing a text within an image block and a text.
>>
>>
>> When I run Tika with a PDFParser configured with:
>>
>> PDFParser pdfParser = new PDFParser();
>> pdfParser.setOcrStrategy("ocr_only");
>> Parser PARSERS[] = new Parser[2];
>> PARSERS[0] = new DefaultParser();
>> PARSERS[1] = pdfParser;
>> Parser parser = new AutoDetectParser(PARSERS);
>>
>>
>> Both text are extracted from the PDF file.
>> I'd have expected that:
>>
>>
>>    - *no_ocr* does not do any OCR (this is working fine: "This file
>>    contains some words." text is not extracted but "This file also
>>    contains text." is)
>>    - *ocr_and_text* extracts both (this is working: "This file contains
>>    some words." and "This file also contains text." texts are extracted)
>>    - *ocr_only* extracts only OCR based text (this is not working as
>>    both "This file contains some words." and "This file also contains
>>    text." texts are extracted where I'd expect to have only "This file
>>    contains some words.").
>>
>> Is my understanding of the *ocr_only* value incorrect? This page (
>> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is
>> saying:
>>
>> For ocrStrategy, we currently have: *no_ocr* (rely on regular text
>> extraction only), *ocr_only* (don't bother extracting text, just run OCR
>> on each page), *ocr_and_text* (both extract text and run OCR).
>>
>>
>> Thanks!
>>
>>

Re: OCR Strategy ocr_only extracts also text

Reply via email to