Re: disable extraction of images

2016-04-13 Thread Jukka Zitting
Hi, Some scanning software include OCR features and include hidden text behind the scanned images to make the resulting PDF searchable. I suspect this may be happening in your case. It would be technically possible to detect such hidden text and have an option for excluding it from the output, bu

Re: disable extraction of images

2016-04-13 Thread ron.vandenbranden
Hi again, On 13/04/2016 13:18, ron.vandenbranden wrote: I wasn't aware of tesseract; I definitely don't have it on my classpath. I'm just testing with the stand-alone tika jar file. My Java skills are close to zero (apart from copy/paste and recompiling things). Could you tell me how to conf

Re: disable extraction of images

2016-04-13 Thread ron.vandenbranden
Thanks, I wasn't aware of tesseract; I definitely don't have it on my classpath. I'm just testing with the stand-alone tika jar file. My Java skills are close to zero (apart from copy/paste and recompiling things). Could you tell me how to configure this for the standalone jar file, please?

Re: disable extraction of images

2016-04-13 Thread Nick Burch
On Wed, 13 Apr 2016, ron.vandenbranden wrote: Is it possible to disable text extraction from images inside a PDF file? I'm testing with the CLI tika app, which has "extractInlineImages" set to false by default, if I'm not mistaken. Yet, the text of the images still is present in the generated H

disable extraction of images

2016-04-13 Thread ron.vandenbranden
Hi, I've just happily discovered Tika and am sorting out how well it fits our needs. I'm trying to create a searchable index for PDF files that contain typed pages and pages with scanned text facsimile's. Some of those facsimile's are scans from print source materials, in which case Tika see