well. It was a long story.

First of all you need to make a AMI with hadoop installed and configured.

Then I had to develop a script which would spawn required number of the AMI
server instances.

Then I coppied nutch job executable to the master node.

After that you are able to crawl. Nutch stores retrieved data on hadoop file
system.
to consume them I had to copy index to local FS.


This is general steps. So in short

1. Launch necessary hadoop AMIs.
2. Wait till all servers boot.
3. copy nutch.job
4. launch crawler
5. copy index to local FS.


For this purpose I developed a few scripts. The main issue is to make
correct hadoop AMI (so that hadoop commands could execute) and launch hadoop
cluster.

And it doesn't depend on whether you run nutch or anything else.

try searching public AMIs, there are a few already having hadoop installed
and configured.

Also you may want to look at amazon elastic mapreduce service. I believe it
uses similar approach.


Best Regards
Alexander Aristov


On 23 February 2011 17:04, Paul Tomblin <[email protected]> wrote:

> Everything.  How did you set it up, how did you configure it, did you need
> to modify the nutch code to run on ElasticMapReduce, etc.
>
> On Tue, Feb 22, 2011 at 3:29 AM, Alexander Aristov <
> [email protected]> wrote:
>
>> Hi
>>
>> I did this. Run crawler on 5 EC2 nodes. What are you interesed in?
>>
>>
>> Best Regards
>> Alexander Aristov
>>
>>
>>
>> On 21 February 2011 22:44, Paul Tomblin <[email protected]> wrote:
>>
>>> This is something I'm also interested in.  Please let me know if you get
>>> any
>>> responses.
>>>
>>> On Sun, Feb 6, 2011 at 9:34 PM, Amin Bandeali
>>> <[email protected]>wrote:
>>>
>>> > Has anybody installed Nutch on ec2 with using aws elastic map reduce
>>> > underneath?
>>> >
>>> > --
>>> > Amin Bandeali
>>> > Cell: 714.757.9544
>>> >
>>> > Follow me on twitter
>>> > http://twitter.com/aminbandeali
>>> >
>>> > DISCLAIMER
>>> > This e-mail is confidential and intended solely for the use of the
>>> > individual to whom it is addressed. If you have received this e-mail in
>>> > error please notify me. Although this message and any attachments are
>>> > believed to be free of any virus or other defect, it is the
>>> responsibility
>>> > of the recipient to ensure that it is virus free and no responsibility
>>> is
>>> > accepted by me for any loss or damage in any way arising from its use.
>>> >
>>>
>>>
>>>
>>> --
>>> http://www.linkedin.com/in/paultomblin
>>> http://careers.stackoverflow.com/ptomblin
>>>
>>
>>
>
>
> --
> http://www.linkedin.com/in/paultomblin
> http://careers.stackoverflow.com/ptomblin
>
>

Reply via email to