[ 
https://issues.apache.org/jira/browse/TIKA-93?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12746586#action_12746586
 ] 

Jukka Zitting commented on TIKA-93:
-----------------------------------

> are there any updates regarding this issue?

Not really. I've done some simple tests with ExternalParser invoking Tesseract 
and OCRopus, but neither is really suited for simple OOTB integration.

I also tried the commercial Asprise OCR SDK 
(http://asprise.com/product/ocr/index.php?lang=java) which was much easier to 
set up and get reasonable results from, but obviously it's something that we 
can't use in an Apache project.

If someone wants to help with this, the first step would be to come up with 
reasonably simple steps to get a liberally licensed OCR engine like OCRopus 
installed and configured so that you can invoke it using a simple command line 
like "ocr image.gif" and get the extracted text on the standard output. It 
should work for at least a few simple test cases. Note that this work should be 
contributed back to the upstream project.

Once we have something like that, we can move forward with integrating it to 
Tika.

> OCR support
> -----------
>
>                 Key: TIKA-93
>                 URL: https://issues.apache.org/jira/browse/TIKA-93
>             Project: Tika
>          Issue Type: New Feature
>          Components: parser
>            Reporter: Jukka Zitting
>            Priority: Minor
>
> I don't know of any decent open source pure Java OCR libraries, but there are 
> command line OCR tools like Tesseract 
> (http://code.google.com/p/tesseract-ocr/) that could be invoked by Tika to 
> extract text content (where available) from image files.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to