Hello Sachin,

Nutch can run on Amazon AWS without trouble, and probably on any Hadoop based 
provider. This is the most expensive option you have.

Cheaper would be to rent some servers and install Hadoop yourself, getting it 
up and running by hand on some servers will take the better part of a day.

The cheapest and easiest, and in almost all cases the best option, is not to 
run Nutch on Hadoop and stay local. A local Nutch can easily handle a couple of 
million URLs. So unless you want to crawl many different domains and expect 
10M+ URLs, stay local.

When we first started our business almost a decade ago we rented VPSs first and 
then physical machines. This ran fine for some years but when we had the option 
to make some good investments, we bought our own hardware and have been scaling 
up the cluster ever since. And with the previous and most recent AMD based 
servers processing power became increasingly cheaper.

If you need to scale up for long term, getting your own hardware is indeed the 
best option.

Regards,
Markus
 
 
-----Original message-----
> From:Sachin Mittal <sjmit...@gmail.com>
> Sent: Tuesday 22nd October 2019 15:59
> To: user@nutch.apache.org
> Subject: Best and economical way of setting hadoop cluster for distributed 
> crawling
> 
> Hi,
> I have been running nutch in local mode and so far I am able to have a good
> understanding on how it all works.
> 
> I wanted to start with distributed crawling using some public cloud
> provider.
> 
> I just wanted to know if fellow users have any experience in setting up
> nutch for distributed crawling.
> 
> From nutch wiki I have some idea on what hardware requirements should be.
> 
> I just wanted to know which of the public cloud providers (IaaS or PaaS)
> are good to setup hadoop clusters on. Basically ones on which it is easy to
> setup/manage the cluster and ones which are easy on budget.
> 
> Please let me know if you folks have any insights based on your experiences.
> 
> Thanks and Regards
> Sachin
> 

Reply via email to