should manually configure tika to reside on hadoop( like pig runs in tow modes local and map reduce ) or by default will it work on hadoop with map reduce jobs......?
On Mon, Nov 28, 2011 at 10:53 AM, Mark Kerzner <[email protected]>wrote: > I used tesseract and lots of other Linux utilities on the cluster. You > need to create your image with everything installed - which is possible > even on EC2 and certainly on your hardware -- and then everything works > beautifully. > > Mark > > > On Sun, Nov 27, 2011 at 11:20 PM, chethan <[email protected]> wrote: > >> can anyone give me an example of how to bind external parser with tika. >> as i couldn't find any of the blog or article which illustrates binding >> external parser( tesseract ) along with tika. >> >> On Fri, Nov 25, 2011 at 10:50 PM, Julien Nioche < >> [email protected]> wrote: >> >>> Behemoth [https://github.com/jnioche/behemoth] has a module for Tika >>> which allows you to use it over Hadoop. Re-Tesseract : if you can call it >>> on the command line then the external parser could help; not sure how this >>> would work on Hadoop, possibly by installing Tesseract on all the slaves at >>> the same location >>> >>> HTH >>> >>> Julien >>> >>> >>> On 25 November 2011 07:29, chethan <[email protected]> wrote: >>> >>>> hi, >>>> >>>> as i am new to tika, i want to know following things. >>>> >>>> 1. how to integrate tika within hadoop, so that tika will use map >>>> reduce to implement the parsing. >>>> 2. we wanted tika to parse ocr files too...but as tika is not >>>> supporting ocr parsing and also recommending to use tesseract, i want >>>> to >>>> know how to call tesseract ( command line operation ) through tika >>>> ( which in-turn uses map reduce to parse ocr files ). >>>> >>>> thanks and regards >>>> chethan >>>> >>> >>> >>> >>> -- >>> * >>> *Open Source Solutions for Text Engineering >>> >>> http://digitalpebble.blogspot.com/ >>> http://www.digitalpebble.com >>> >> >> >
