Hi, I've just verified with Nutch trunk (upcoming 1.11): - Tika 1.10 is able to OCR embedded images if PDFParser.properties is modified accordingly in tika-app-1.10.jar - but parse-tika doesn't if same modifications are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar
Needs some debugging to find out what is wrong. Please, feel free to file a bug report on https://issues.apache.org/jira/browse/NUTCH Thanks, Sebastian On 10/09/2015 06:21 PM, Sebastian Nagel wrote: > Hi, > > sorry, but I didn't try this by myself, just had > in mind that there has been a thread on the Tika > mailing list. > >> What is difference between ./plugins/parse-tika/parse-tika.jar and >> ./plugins/parse-tika/tika-parsers-1.8.jar ? > > parse-tika.jar contains the classes of Nutch's parse-tika plugin > which depends on the library tika-parsers-1.x.jar. > > Sebastian > > On 10/09/2015 02:54 PM, [email protected] wrote: >> Hello, >> >> I try do edit JAR file and edit >> 'org/apache/tika/parser/pdf/PDFParser.properties' : >> >> enableAutospace true >> extractAnnotationText true >> sortByPosition false >> suppressDuplicateOverlappingText false >> useNonSequentialParser false >> extractAcroFormContent true >> extractInlineImages true >> extractUniqueInlineImagesOnly false >> checkExtractAccessPermission false >> allowExtractionForAccessibility true >> >> but same result. Tesseract has also been installed. >> >> What is difference between ./plugins/parse-tika/parse-tika.jar and >> ./plugins/parse-tika/tika-parsers-1.8.jar ? >> >> Thank for your help ! >> >> 8. Oct 2015 20:43 by [email protected]: >> >> >>> Hi, >>> >>> there as been a similar question on the Tika mailing list recently: >>> >>> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3cdm2pr09mb071346d01729fc9367308e94c7...@dm2pr09mb0713.namprd09.prod.outlook.com%3E >>> >>> If you get Tika to OCR the embedded images, the parse-tika >>> plugin will probably also do if the Tika jar is repla steps > > ced. >>> >>> Sebastian >>> >>> On 10/06/2015 03:55 PM, > [email protected]> wrote: >>>> Hello, >>>> >>>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can >>>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika >>>> do >>>> not convert images from PDF. I use Elastic to index. >>>> >>>> Thank you >>>> >

