I used tesseract and lots of other Linux utilities on the cluster. You need
to create your image with everything installed - which is possible even on
EC2 and certainly on your hardware -- and then everything works beautifully.

Mark

On Sun, Nov 27, 2011 at 11:20 PM, chethan <[email protected]> wrote:

> can anyone give me an example of how to bind external parser with tika. as
> i couldn't find any of the blog or article which illustrates binding
> external parser( tesseract ) along with tika.
>
> On Fri, Nov 25, 2011 at 10:50 PM, Julien Nioche <
> [email protected]> wrote:
>
>> Behemoth [https://github.com/jnioche/behemoth] has a module for Tika
>> which allows you to use it over Hadoop. Re-Tesseract : if you can call it
>> on the command line then the external parser could help; not sure how this
>> would work on Hadoop, possibly by installing Tesseract on all the slaves at
>> the same location
>>
>> HTH
>>
>> Julien
>>
>>
>> On 25 November 2011 07:29, chethan <[email protected]> wrote:
>>
>>> hi,
>>>
>>> as i am new to tika, i want to know following things.
>>>
>>> 1. how to integrate tika within hadoop, so that tika will use map
>>> reduce to implement the parsing.
>>> 2. we wanted tika to parse ocr files too...but as tika is not
>>> supporting ocr parsing and also recommending to use tesseract, i want
>>> to
>>>   know how to call tesseract ( command line operation ) through tika
>>> ( which in-turn uses map reduce to parse ocr files ).
>>>
>>> thanks and regards
>>> chethan
>>>
>>
>>
>>
>> --
>> *
>> *Open Source Solutions for Text Engineering
>>
>> http://digitalpebble.blogspot.com/
>> http://www.digitalpebble.com
>>
>
>

Reply via email to