RE: OCRing extracted inline images vs. fully rendered pages?

Allison, Timothy B. Tue, 17 May 2016 10:58:29 -0700

>We have an experimental integration with Tesseract which was created a while 
>ago by a GSoC student. Because it requires >building C++ we’ve not integrated 
>it into trunk, but do have it on the todo list for 2.1.


Ah, very cool.  Y, I'd trust you all to do a better job of integrating OCR for 
PDFs than we'd do. :)

>The advantage of this approach is that we can keep any embedded text in the 
>PDF and embellish it with the output.

It would be neat to have an OCR-only option for documents where the text 
extraction yields complete garbage (...garbage detector...on our todo list 
TIKA-1443).

I'll hold off then on doing anything on our end.  Thank you!

Best,

         Tim

RE: OCRing extracted inline images vs. fully rendered pages?

Reply via email to