Old thread but I never answered to this one so let's give a closure here :)
The latest tests I ran actually work as expecting. That was a mistake in my code which was causing a misconfiguration of the parsers. Thanks Tim! -- David Pilato, elastic.co Developer | Evangelist, Le 21 déc. 2018 à 16:38 +0100, Tim Allison <[email protected]>, a écrit : > Hi David, > I'm sorry for my slow response! > > That behavior isn't expected. How have you configured Tika to run > OCR on pdfs? > 1) extractInlineImages > 2) render the page and then run OCR > a) no_ocr > b) ocr_only > c) ocr_and_text > > Is there any chance that "foo bar" is in the title of the PDF for the > image-only pdf? We do write title info into the body. > > > > > 1 > > On Fri, Dec 21, 2018 at 8:04 AM David Pilato <[email protected]> wrote: > > > > Anyone knows? > > I guess if no one I need to look at the code or use log debug. :) > > > > > > > > David > > > > -- > > David Pilato, elastic.co > > Developer | Evangelist, > > Le 18 déc. 2018 à 21:43 +0100, David Pilato <[email protected]>, a écrit : > > > > Heya > > > > > > When OCR is available, what should happen when I have a document containing > > both text and images with text. > > > > For example I have a PDF with a text "hello world" and an image containing > > "foo bar". > > When I run Tika with Tesseract to extract text, I can see that only the > > text part is extracted, "hello world" that is. > > > > If I run the same configuration on a PDF which contains only an image with > > "foo bar" then "foo bar" is extracted. > > > > Is that expected? > > If so, does this mean that as soon as some text is extracted from a > > document we don't run OCR at all? > > > > Thanks for your insights. > > > > > > David > > > > -- > > David Pilato, elastic.co > > Developer | Evangelist,
