Re: OCR and Raw text

David Pilato Fri, 21 Dec 2018 05:04:40 -0800

Anyone knows?
I guess if no one I need to look at the code or use log debug. :)




David

--
David Pilato, elastic.co
Developer | Evangelist,
Le 18 déc. 2018 à 21:43 +0100, David Pilato <[email protected]>, a écrit :
> Heya
>
>
> When OCR is available, what should happen when I have a document containing 
> both text and images with text.
>
> For example I have a  PDF with a text "hello world" and an image containing 
> "foo bar".
> When I run Tika with Tesseract to extract text, I can see that only the text 
> part is extracted, "hello world" that is.
>
> If I run the same configuration on a PDF which contains only an image with 
> "foo bar" then "foo bar" is extracted.
>
> Is that expected?
> If so, does this mean that as soon as some text is extracted from a document 
> we don't run OCR at all?
>
> Thanks for your insights.
>
>
> David
>
> --
> David Pilato, elastic.co
> Developer | Evangelist,

Re: OCR and Raw text

Reply via email to