Heya
When OCR is available, what should happen when I have a document containing both text and images with text. For example I have a PDF with a text "hello world" and an image containing "foo bar". When I run Tika with Tesseract to extract text, I can see that only the text part is extracted, "hello world" that is. If I run the same configuration on a PDF which contains only an image with "foo bar" then "foo bar" is extracted. Is that expected? If so, does this mean that as soon as some text is extracted from a document we don't run OCR at all? Thanks for your insights. David -- David Pilato, elastic.co Developer | Evangelist,
