Hey team,
I'm wondering if I'm misunderstanding the purpose of ocr_only in the PDFParser.
I have a PDF which is containing a text within an image block and a text.
When I run Tika with a PDFParser configured with:
> quote_type
> PDFParser pdfParser = new PDFParser();
> pdfParser.setOcrStrategy("ocr_only");
> Parser PARSERS[] = new Parser[2];
> PARSERS[0] = new DefaultParser();
> PARSERS[1] = pdfParser;
> Parser parser = new AutoDetectParser(PARSERS);
Both text are extracted from the PDF file.
I'd have expected that:
• no_ocr does not do any OCR (this is working fine: "This file contains some
words." text is not extracted but "This file also contains text." is)
• ocr_and_text extracts both (this is working: "This file contains some words."
and "This file also contains text." texts are extracted)
• ocr_only extracts only OCR based text (this is not working as both "This file
contains some words." and "This file also contains text." texts are extracted
where I'd expect to have only "This file contains some words.").
Is my understanding of the ocr_only value incorrect? This page
(https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is saying:
> quote_type
> For ocrStrategy, we currently have: no_ocr (rely on regular text extraction
> only), ocr_only (don't bother extracting text, just run OCR on each page),
> ocr_and_text (both extract text and run OCR).
Thanks!