Re: OCR images from PDF with Tika

Sebastian Nagel Fri, 09 Oct 2015 13:23:53 -0700

Hi,

I've just verified with Nutch trunk (upcoming 1.11):
- Tika 1.10 is able to OCR embedded images if
  PDFParser.properties is modified accordingly
  in tika-app-1.10.jar
- but parse-tika doesn't if same modifications
  are made in runtime/local/plugins/parse-tika/tika-parsers-1.10.jar


Needs some debugging to find out what is wrong.

Please, feel free to file a bug report on
https://issues.apache.org/jira/browse/NUTCH

Thanks,
Sebastian

On 10/09/2015 06:21 PM, Sebastian Nagel wrote:
> Hi,
> 
> sorry, but I didn't try this by myself, just had
> in mind that there has been a thread on the Tika
> mailing list.
> 
>> What is difference between ./plugins/parse-tika/parse-tika.jar and
>> ./plugins/parse-tika/tika-parsers-1.8.jar ?
> 
> parse-tika.jar contains the classes of Nutch's parse-tika plugin
> which depends on the library tika-parsers-1.x.jar.
> 
> Sebastian
> 
> On 10/09/2015 02:54 PM, [email protected] wrote:
>> Hello,
>>
>> I try do edit JAR file and edit 
>> 'org/apache/tika/parser/pdf/PDFParser.properties' :
>>
>>   enableAutospace true
>>   extractAnnotationText true
>>   sortByPosition  false
>>   suppressDuplicateOverlappingText  false
>>   useNonSequentialParser  false
>>   extractAcroFormContent  true
>>   extractInlineImages true
>>   extractUniqueInlineImagesOnly false
>>   checkExtractAccessPermission false
>>   allowExtractionForAccessibility true
>>
>> but same result. Tesseract has also been installed.
>>
>> What is difference between ./plugins/parse-tika/parse-tika.jar and  
>> ./plugins/parse-tika/tika-parsers-1.8.jar ?
>>
>> Thank for your help !
>>
>> 8. Oct 2015 20:43 by [email protected]:
>>
>>
>>> Hi,
>>>
>>> there as been a similar question on the Tika mailing list recently:
>>>
>>> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3cdm2pr09mb071346d01729fc9367308e94c7...@dm2pr09mb0713.namprd09.prod.outlook.com%3E
>>>
>>> If you get Tika to OCR the embedded images, the parse-tika
>>> plugin will probably also do if the Tika jar is repla    steps
> 
> ced.
>>>
>>> Sebastian
>>>
>>> On 10/06/2015 03:55 PM, > [email protected]>  wrote:
>>>> Hello,
>>>>
>>>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
>>>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika 
>>>> do
>>>> not convert images from PDF. I use Elastic to index.
>>>>
>>>> Thank you
>>>>
>

Re: OCR images from PDF with Tika

Reply via email to