Behemoth [https://github.com/jnioche/behemoth] has a module for Tika which allows you to use it over Hadoop. Re-Tesseract : if you can call it on the command line then the external parser could help; not sure how this would work on Hadoop, possibly by installing Tesseract on all the slaves at the same location
HTH Julien On 25 November 2011 07:29, chethan <[email protected]> wrote: > hi, > > as i am new to tika, i want to know following things. > > 1. how to integrate tika within hadoop, so that tika will use map > reduce to implement the parsing. > 2. we wanted tika to parse ocr files too...but as tika is not > supporting ocr parsing and also recommending to use tesseract, i want > to > know how to call tesseract ( command line operation ) through tika > ( which in-turn uses map reduce to parse ocr files ). > > thanks and regards > chethan > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
