Hi David,
I’m afk...take following w grain of salt. If you aren’t excluding the
PDFParser from your DefaultParser, there’s a chance that one is being
called rather than the one you’re adding.
Try creating a PDFParserConfig, setting it as you want, add it to the
ParseContext that you send into the parse() on the regular DefaultParser.
If you’re still finding surprises, please let us know.
Best,
Tim
On Sat, Mar 2, 2019 at 9:04 AM David Pilato <[email protected]> wrote:
> Hey team,
>
>
> I'm wondering if I'm misunderstanding the purpose of ocr_only in
> the PDFParser.
>
> I have a PDF which is containing a text within an image block and a text.
>
>
> When I run Tika with a PDFParser configured with:
>
> PDFParser pdfParser = new PDFParser();
> pdfParser.setOcrStrategy("ocr_only");
> Parser PARSERS[] = new Parser[2];
> PARSERS[0] = new DefaultParser();
> PARSERS[1] = pdfParser;
> Parser parser = new AutoDetectParser(PARSERS);
>
>
> Both text are extracted from the PDF file.
> I'd have expected that:
>
>
> - *no_ocr* does not do any OCR (this is working fine: "This file
> contains some words." text is not extracted but "This file also
> contains text." is)
> - *ocr_and_text* extracts both (this is working: "This file contains
> some words." and "This file also contains text." texts are extracted)
> - *ocr_only* extracts only OCR based text (this is not working as both
> "This
> file contains some words." and "This file also contains text." texts are
> extracted where I'd expect to have only "This file contains some words.
> ").
>
> Is my understanding of the *ocr_only* value incorrect? This page (
> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is saying:
>
> For ocrStrategy, we currently have: *no_ocr* (rely on regular text
> extraction only), *ocr_only* (don't bother extracting text, just run OCR
> on each page), *ocr_and_text* (both extract text and run OCR).
>
>
> Thanks!
>
>