Hi, there as been a similar question on the Tika mailing list recently:
http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3cdm2pr09mb071346d01729fc9367308e94c7...@dm2pr09mb0713.namprd09.prod.outlook.com%3E If you get Tika to OCR the embedded images, the parse-tika plugin will probably also do if the Tika jar is replaced. Sebastian On 10/06/2015 03:55 PM, [email protected] wrote: > Hello, > > I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can > natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika do > not convert images from PDF. I use Elastic to index. > > Thank you >

