Re: OCR Strategy ocr_only extracts also text

David Pilato Sat, 09 Mar 2019 04:45:03 -0800

So I tried with

Parser parser = new AutoDetectParser(pdfParser);


And with:

Parser parser = pdfParser;

I'm still seeing the same behavior.
Does it look like an issue? Or something wrong on my side (well this is often 
the case :) ).


Le 7 mars 2019 à 01:30 +0100, David Pilato <[email protected]>, a écrit :
> Sadly not yet. I added this on my todo but what you said makes sense to me.
>
> I'll check this later.
>
>
> Thanks for answering ! 🤗
> Le 6 mars 2019 à 23:11 +0100, Tim Allison <[email protected]>, a écrit :
> > David,
> >  Are you all set w this or are there still surprises?
> >
> > > On Sat, Mar 2, 2019 at 3:04 PM Tim Allison <[email protected]> wrote:
> > > > Hi David,
> > > >  I’m afk...take following w grain of salt. If you aren’t excluding the 
> > > > PDFParser from your DefaultParser, there’s a chance that one is being 
> > > > called rather than the one you’re adding.
> > > >   Try creating a PDFParserConfig, setting it as you want, add it to the 
> > > > ParseContext that you send into the parse() on the regular 
> > > > DefaultParser.
> > > >   If you’re still finding surprises, please let us know.
> > > >
> > > >     Best,
> > > >
> > > >       Tim
> > > >
> > > > > On Sat, Mar 2, 2019 at 9:04 AM David Pilato <[email protected]> wrote:
> > > > > > Hey team,
> > > > > >
> > > > > >
> > > > > > I'm wondering if I'm misunderstanding the purpose of ocr_only in 
> > > > > > the PDFParser.
> > > > > >
> > > > > > I have a PDF which is containing a text within an image block and a 
> > > > > > text.
> > > > > >
> > > > > > <D64DD4D0-2F44-4C21-A3D0-79D8CFAA00CA.png>
> > > > > > When I run Tika with a PDFParser configured with:
> > > > > >
> > > > > > > PDFParser pdfParser = new PDFParser();
> > > > > > > pdfParser.setOcrStrategy("ocr_only");
> > > > > > > Parser PARSERS[] = new Parser[2];
> > > > > > > PARSERS[0] = new DefaultParser();
> > > > > > > PARSERS[1] = pdfParser;
> > > > > > > Parser parser = new AutoDetectParser(PARSERS);
> > > > > >
> > > > > > Both text are extracted from the PDF file.
> > > > > > I'd have expected that:
> > > > > >
> > > > > >
> > > > > > • no_ocr does not do any OCR (this is working fine: "This file 
> > > > > > contains some words." text is not extracted but "This file also 
> > > > > > contains text." is)
> > > > > > • ocr_and_text extracts both (this is working: "This file contains 
> > > > > > some words." and "This file also contains text." texts are 
> > > > > > extracted)
> > > > > > • ocr_only extracts only OCR based text (this is not working as 
> > > > > > both "This file contains some words." and "This file also contains 
> > > > > > text." texts are extracted where I'd expect to have only "This file 
> > > > > > contains some words.").
> > > > > >
> > > > > > Is my understanding of the ocr_only value incorrect? This page 
> > > > > > (https://wiki.apache.org/tika/PDFParser%20%28Apache%20PDFBox%29) is 
> > > > > > saying:
> > > > > >
> > > > > > > For ocrStrategy, we currently have: no_ocr (rely on regular text 
> > > > > > > extraction only), ocr_only (don't bother extracting text, just 
> > > > > > > run OCR on each page), ocr_and_text (both extract text and run 
> > > > > > > OCR).
> > > > > >
> > > > > > Thanks!
> > > > > >

Re: OCR Strategy ocr_only extracts also text

Reply via email to