OCR and Raw text

David Pilato Tue, 18 Dec 2018 12:44:05 -0800

Heya


When OCR is available, what should happen when I have a document containing 
both text and images with text.

For example I have a  PDF with a text "hello world" and an image containing 
"foo bar".
When I run Tika with Tesseract to extract text, I can see that only the text 
part is extracted, "hello world" that is.

If I run the same configuration on a PDF which contains only an image with "foo 
bar" then "foo bar" is extracted.

Is that expected?
If so, does this mean that as soon as some text is extracted from a document we 
don't run OCR at all?

Thanks for your insights.


David

--
David Pilato, elastic.co
Developer | Evangelist,

OCR and Raw text

Reply via email to