Re: integrating tika into hadoop and tika with tesseract.

chethan Mon, 28 Nov 2011 02:23:23 -0800

should manually configure tika to reside on hadoop( like pig runs in tow
modes local and map reduce ) or by default will it work on hadoop with map
reduce jobs......?


On Mon, Nov 28, 2011 at 10:53 AM, Mark Kerzner <[email protected]>wrote:

> I used tesseract and lots of other Linux utilities on the cluster. You
> need to create your image with everything installed - which is possible
> even on EC2 and certainly on your hardware -- and then everything works
> beautifully.
>
> Mark
>
>
> On Sun, Nov 27, 2011 at 11:20 PM, chethan <[email protected]> wrote:
>
>> can anyone give me an example of how to bind external parser with tika.
>> as i couldn't find any of the blog or article which illustrates binding
>> external parser( tesseract ) along with tika.
>>
>> On Fri, Nov 25, 2011 at 10:50 PM, Julien Nioche <
>> [email protected]> wrote:
>>
>>> Behemoth [https://github.com/jnioche/behemoth] has a module for Tika
>>> which allows you to use it over Hadoop. Re-Tesseract : if you can call it
>>> on the command line then the external parser could help; not sure how this
>>> would work on Hadoop, possibly by installing Tesseract on all the slaves at
>>> the same location
>>>
>>> HTH
>>>
>>> Julien
>>>
>>>
>>> On 25 November 2011 07:29, chethan <[email protected]> wrote:
>>>
>>>> hi,
>>>>
>>>> as i am new to tika, i want to know following things.
>>>>
>>>> 1. how to integrate tika within hadoop, so that tika will use map
>>>> reduce to implement the parsing.
>>>> 2. we wanted tika to parse ocr files too...but as tika is not
>>>> supporting ocr parsing and also recommending to use tesseract, i want
>>>> to
>>>>   know how to call tesseract ( command line operation ) through tika
>>>> ( which in-turn uses map reduce to parse ocr files ).
>>>>
>>>> thanks and regards
>>>> chethan
>>>>
>>>
>>>
>>>
>>> --
>>> *
>>> *Open Source Solutions for Text Engineering
>>>
>>> http://digitalpebble.blogspot.com/
>>> http://www.digitalpebble.com
>>>
>>
>>
>

Re: integrating tika into hadoop and tika with tesseract.

Reply via email to