Re: OCR images from PDF with Tika

Sebastian Nagel Fri, 09 Oct 2015 09:21:57 -0700

Hi,

sorry, but I didn't try this by myself, just had
in mind that there has been a thread on the Tika
mailing list.


> What is difference between ./plugins/parse-tika/parse-tika.jar and
> ./plugins/parse-tika/tika-parsers-1.8.jar ?

parse-tika.jar contains the classes of Nutch's parse-tika plugin
which depends on the library tika-parsers-1.x.jar.

Sebastian

On 10/09/2015 02:54 PM, [email protected] wrote:
> Hello,
> 
> I try do edit JAR file and edit 
> 'org/apache/tika/parser/pdf/PDFParser.properties' :
> 
>   enableAutospace true
>   extractAnnotationText true
>   sortByPosition  false
>   suppressDuplicateOverlappingText  false
>   useNonSequentialParser  false
>   extractAcroFormContent  true
>   extractInlineImages true
>   extractUniqueInlineImagesOnly false
>   checkExtractAccessPermission false
>   allowExtractionForAccessibility true
> 
> but same result. Tesseract has also been installed.
> 
> What is difference between ./plugins/parse-tika/parse-tika.jar and  
> ./plugins/parse-tika/tika-parsers-1.8.jar ?
> 
> Thank for your help !
> 
> 8. Oct 2015 20:43 by [email protected]:
> 
> 
>> Hi,
>>
>> there as been a similar question on the Tika mailing list recently:
>>
>> http://mail-archives.apache.org/mod_mbox/tika-user/201505.mbox/%3cdm2pr09mb071346d01729fc9367308e94c7...@dm2pr09mb0713.namprd09.prod.outlook.com%3E
>>
>> If you get Tika to OCR the embedded images, the parse-tika
>> plugin will probably also do if the Tika jar is repla    steps

ced.
>>
>> Sebastian
>>
>> On 10/06/2015 03:55 PM, > [email protected]>  wrote:
>>> Hello,
>>>
>>> I use Nutch v1.10, i just want to know if Nutch with Tika parser v1.8 can
>>> natively OCR images from PDF files? I can OCR JPEG or PNG files but Tika 
>>> do
>>> not convert images from PDF. I use Elastic to index.
>>>
>>> Thank you
>>>

Re: OCR images from PDF with Tika

Reply via email to