Hello, I try do edit JAR file and edit 'org/apache/tika/parser/pdf/PDFParser.properties' :
enableAutospace true extractAnnotationText true sortByPosition false suppressDuplicateOverlappingText false useNonSequentialParser false extractAcroFormContent true extractInlineImages true extractUniqueInlineImagesOnly false checkExtractAccessPermission false allowExtractionForAccessibility true but same result. Tesseract has also been installed. What is difference between ./plugins/parse-tika/parse-tika.jar and ./plugins/parse-tika/tika-parsers-1.8.jar ? Thank for your help ! 8. Oct 2015 20:43 by [email protected]: > Hi, > > there as been a similar question on the Tika mailing list recently: > > http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3cdm2pr09mb071346d01729fc9367308e94c7...@dm2pr09mb0713.namprd09.prod.outlook.com%3E > > If you get Tika to OCR the embedded images, the parse-tika > plugin will probably also do if the Tika jar is replaced. > > Sebastian > > On 10/06/2015 03:55 PM, > [email protected]> wrote: >> Hello, >> >> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can >> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika >> do >> not convert images from PDF. I use Elastic to index. >> >> Thank you >>

