can anyone give me an example of how to bind external parser with tika. as i couldn't find any of the blog or article which illustrates binding external parser( tesseract ) along with tika.
On Fri, Nov 25, 2011 at 10:50 PM, Julien Nioche < [email protected]> wrote: > Behemoth [https://github.com/jnioche/behemoth] has a module for Tika > which allows you to use it over Hadoop. Re-Tesseract : if you can call it > on the command line then the external parser could help; not sure how this > would work on Hadoop, possibly by installing Tesseract on all the slaves at > the same location > > HTH > > Julien > > > On 25 November 2011 07:29, chethan <[email protected]> wrote: > >> hi, >> >> as i am new to tika, i want to know following things. >> >> 1. how to integrate tika within hadoop, so that tika will use map >> reduce to implement the parsing. >> 2. we wanted tika to parse ocr files too...but as tika is not >> supporting ocr parsing and also recommending to use tesseract, i want >> to >> know how to call tesseract ( command line operation ) through tika >> ( which in-turn uses map reduce to parse ocr files ). >> >> thanks and regards >> chethan >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com >
