[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651448#action_12651448
 ] 

Jukka Zitting commented on TIKA-93:
-----------------------------------

OCRopus (http://code.google.com/p/ocropus/) seems like a nice tool for this. 
It's a command like tool so we'd need to use something like the ExternalParser 
class to use it, but the annotated HTML output it generates is already very 
close to what Tika uses, so the integration should be easy.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to