Re: OCR images from PDF with Tika

jeanblue Fri, 09 Oct 2015 05:56:03 -0700

Hello,

I try do edit JAR file and edit 
'org/apache/tika/parser/pdf/PDFParser.properties' :


  enableAutospace true
  extractAnnotationText true
  sortByPosition  false
  suppressDuplicateOverlappingText  false
  useNonSequentialParser  false
  extractAcroFormContent  true
  extractInlineImages true
  extractUniqueInlineImagesOnly false
  checkExtractAccessPermission false
  allowExtractionForAccessibility true

but same result. Tesseract has also been installed.

What is difference between ./plugins/parse-tika/parse-tika.jar and  
./plugins/parse-tika/tika-parsers-1.8.jar ?

Thank for your help !

8. Oct 2015 20:43 by [email protected]:


> Hi,
>
> there as been a similar question on the Tika mailing list recently:
>
> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3cdm2pr09mb071346d01729fc9367308e94c7...@dm2pr09mb0713.namprd09.prod.outlook.com%3E
>
> If you get Tika to OCR the embedded images, the parse-tika
> plugin will probably also do if the Tika jar is replaced.
>
> Sebastian
>
> On 10/06/2015 03:55 PM, > [email protected]>  wrote:
>> Hello,
>>
>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika 
>> do
>> not convert images from PDF. I use Elastic to index.
>>
>> Thank you
>>

Re: OCR images from PDF with Tika

Reply via email to