I used tesseract and lots of other Linux utilities on the cluster. You need to create your image with everything installed - which is possible even on EC2 and certainly on your hardware -- and then everything works beautifully.
Mark On Sun, Nov 27, 2011 at 11:20 PM, chethan <[email protected]> wrote: > can anyone give me an example of how to bind external parser with tika. as > i couldn't find any of the blog or article which illustrates binding > external parser( tesseract ) along with tika. > > On Fri, Nov 25, 2011 at 10:50 PM, Julien Nioche < > [email protected]> wrote: > >> Behemoth [https://github.com/jnioche/behemoth] has a module for Tika >> which allows you to use it over Hadoop. Re-Tesseract : if you can call it >> on the command line then the external parser could help; not sure how this >> would work on Hadoop, possibly by installing Tesseract on all the slaves at >> the same location >> >> HTH >> >> Julien >> >> >> On 25 November 2011 07:29, chethan <[email protected]> wrote: >> >>> hi, >>> >>> as i am new to tika, i want to know following things. >>> >>> 1. how to integrate tika within hadoop, so that tika will use map >>> reduce to implement the parsing. >>> 2. we wanted tika to parse ocr files too...but as tika is not >>> supporting ocr parsing and also recommending to use tesseract, i want >>> to >>> know how to call tesseract ( command line operation ) through tika >>> ( which in-turn uses map reduce to parse ocr files ). >>> >>> thanks and regards >>> chethan >>> >> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> > >
