>We have an experimental integration with Tesseract which was created a while
>ago by a GSoC student. Because it requires >building C++ we’ve not integrated
>it into trunk, but do have it on the todo list for 2.1.
Ah, very cool. Y, I'd trust you all to do a better job of integrating OCR for
PDFs than we'd do. :)
>The advantage of this approach is that we can keep any embedded text in the
>PDF and embellish it with the output.
It would be neat to have an OCR-only option for documents where the text
extraction yields complete garbage (...garbage detector...on our todo list
TIKA-1443).
I'll hold off then on doing anything on our end. Thank you!
Best,
Tim