Behemoth [https://github.com/jnioche/behemoth] has a module for Tika which
allows you to use it over Hadoop. Re-Tesseract : if you can call it on the
command line then the external parser could help; not sure how this would
work on Hadoop, possibly by installing Tesseract on all the slaves at the
same location

HTH

Julien

On 25 November 2011 07:29, chethan <[email protected]> wrote:

> hi,
>
> as i am new to tika, i want to know following things.
>
> 1. how to integrate tika within hadoop, so that tika will use map
> reduce to implement the parsing.
> 2. we wanted tika to parse ocr files too...but as tika is not
> supporting ocr parsing and also recommending to use tesseract, i want
> to
>   know how to call tesseract ( command line operation ) through tika
> ( which in-turn uses map reduce to parse ocr files ).
>
> thanks and regards
> chethan
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to