Hi David,
  I'm sorry for my slow response!

  That behavior isn't expected.  How have you configured Tika to run
OCR on pdfs?
1) extractInlineImages
2) render the page and then run OCR
    a) no_ocr
    b) ocr_only
    c) ocr_and_text

Is there any chance that "foo bar" is in the title of the PDF for the
image-only pdf?  We do write title info into the body.




1

On Fri, Dec 21, 2018 at 8:04 AM David Pilato <[email protected]> wrote:
>
> Anyone knows?
> I guess if no one I need to look at the code or use log debug. :)
>
>
>
> David
>
> --
> David Pilato, elastic.co
> Developer | Evangelist,
> Le 18 déc. 2018 à 21:43 +0100, David Pilato <[email protected]>, a écrit :
>
> Heya
>
>
> When OCR is available, what should happen when I have a document containing 
> both text and images with text.
>
> For example I have a  PDF with a text "hello world" and an image containing 
> "foo bar".
> When I run Tika with Tesseract to extract text, I can see that only the text 
> part is extracted, "hello world" that is.
>
> If I run the same configuration on a PDF which contains only an image with 
> "foo bar" then "foo bar" is extracted.
>
> Is that expected?
> If so, does this mean that as soon as some text is extracted from a document 
> we don't run OCR at all?
>
> Thanks for your insights.
>
>
> David
>
> --
> David Pilato, elastic.co
> Developer | Evangelist,

Reply via email to