Hi, sorry, but I didn't try this by myself, just had in mind that there has been a thread on the Tika mailing list.
> What is difference between ./plugins/parse-tika/parse-tika.jar and > ./plugins/parse-tika/tika-parsers-1.8.jar ? parse-tika.jar contains the classes of Nutch's parse-tika plugin which depends on the library tika-parsers-1.x.jar. Sebastian On 10/09/2015 02:54 PM, [email protected] wrote: > Hello, > > I try do edit JAR file and edit > 'org/apache/tika/parser/pdf/PDFParser.properties' : > > enableAutospace true > extractAnnotationText true > sortByPosition false > suppressDuplicateOverlappingText false > useNonSequentialParser false > extractAcroFormContent true > extractInlineImages true > extractUniqueInlineImagesOnly false > checkExtractAccessPermission false > allowExtractionForAccessibility true > > but same result. Tesseract has also been installed. > > What is difference between ./plugins/parse-tika/parse-tika.jar and > ./plugins/parse-tika/tika-parsers-1.8.jar ? > > Thank for your help ! > > 8. Oct 2015 20:43 by [email protected]: > > >> Hi, >> >> there as been a similar question on the Tika mailing list recently: >> >> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3cdm2pr09mb071346d01729fc9367308e94c7...@dm2pr09mb0713.namprd09.prod.outlook.com%3E >> >> If you get Tika to OCR the embedded images, the parse-tika >> plugin will probably also do if the Tika jar is repla steps ced. >> >> Sebastian >> >> On 10/06/2015 03:55 PM, > [email protected]> wrote: >>> Hello, >>> >>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can >>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika >>> do >>> not convert images from PDF. I use Elastic to index. >>> >>> Thank you >>>

