Re: OCR and Raw text

David Pilato Mon, 04 Mar 2019 01:53:07 -0800

Old thread but I never answered to this one so let's give a closure here :)



The latest tests I ran actually work as expecting. That was a mistake in my 
code which was causing a misconfiguration of the parsers.

Thanks Tim!


--
David Pilato, elastic.co
Developer | Evangelist,
Le 21 déc. 2018 à 16:38 +0100, Tim Allison <[email protected]>, a écrit :
> Hi David,
> I'm sorry for my slow response!
>
> That behavior isn't expected. How have you configured Tika to run
> OCR on pdfs?
> 1) extractInlineImages
> 2) render the page and then run OCR
> a) no_ocr
> b) ocr_only
> c) ocr_and_text
>
> Is there any chance that "foo bar" is in the title of the PDF for the
> image-only pdf? We do write title info into the body.
>
>
>
>
> 1
>
> On Fri, Dec 21, 2018 at 8:04 AM David Pilato <[email protected]> wrote:
> >
> > Anyone knows?
> > I guess if no one I need to look at the code or use log debug. :)
> >
> >
> >
> > David
> >
> > --
> > David Pilato, elastic.co
> > Developer | Evangelist,
> > Le 18 déc. 2018 à 21:43 +0100, David Pilato <[email protected]>, a écrit :
> >
> > Heya
> >
> >
> > When OCR is available, what should happen when I have a document containing 
> > both text and images with text.
> >
> > For example I have a PDF with a text "hello world" and an image containing 
> > "foo bar".
> > When I run Tika with Tesseract to extract text, I can see that only the 
> > text part is extracted, "hello world" that is.
> >
> > If I run the same configuration on a PDF which contains only an image with 
> > "foo bar" then "foo bar" is extracted.
> >
> > Is that expected?
> > If so, does this mean that as soon as some text is extracted from a 
> > document we don't run OCR at all?
> >
> > Thanks for your insights.
> >
> >
> > David
> >
> > --
> > David Pilato, elastic.co
> > Developer | Evangelist,

Re: OCR and Raw text

Reply via email to