David, Are you all set w this or are there still surprises? On Sat, Mar 2, 2019 at 3:04 PM Tim Allison <[email protected]> wrote:
> Hi David, > I’m afk...take following w grain of salt. If you aren’t excluding the > PDFParser from your DefaultParser, there’s a chance that one is being > called rather than the one you’re adding. > Try creating a PDFParserConfig, setting it as you want, add it to the > ParseContext that you send into the parse() on the regular DefaultParser. > If you’re still finding surprises, please let us know. > > Best, > > Tim > > On Sat, Mar 2, 2019 at 9:04 AM David Pilato <[email protected]> wrote: > >> Hey team, >> >> >> I'm wondering if I'm misunderstanding the purpose of ocr_only in >> the PDFParser. >> >> I have a PDF which is containing a text within an image block and a text. >> >> >> When I run Tika with a PDFParser configured with: >> >> PDFParser pdfParser = new PDFParser(); >> pdfParser.setOcrStrategy("ocr_only"); >> Parser PARSERS[] = new Parser[2]; >> PARSERS[0] = new DefaultParser(); >> PARSERS[1] = pdfParser; >> Parser parser = new AutoDetectParser(PARSERS); >> >> >> Both text are extracted from the PDF file. >> I'd have expected that: >> >> >> - *no_ocr* does not do any OCR (this is working fine: "This file >> contains some words." text is not extracted but "This file also >> contains text." is) >> - *ocr_and_text* extracts both (this is working: "This file contains >> some words." and "This file also contains text." texts are extracted) >> - *ocr_only* extracts only OCR based text (this is not working as >> both "This file contains some words." and "This file also contains >> text." texts are extracted where I'd expect to have only "This file >> contains some words."). >> >> Is my understanding of the *ocr_only* value incorrect? This page ( >> https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is >> saying: >> >> For ocrStrategy, we currently have: *no_ocr* (rely on regular text >> extraction only), *ocr_only* (don't bother extracting text, just run OCR >> on each page), *ocr_and_text* (both extract text and run OCR). >> >> >> Thanks! >> >>
