Re: Best and economical way of setting hadoop cluster for distributed crawling

Sachin Mittal Thu, 31 Oct 2019 22:54:22 -0700

Hi,
I understood the point.
I would also like to run nutch on my local machine.


So far I am running in standalone mode with default crawl script where
fetch time limit is 180 minutes.
What I have observed is that it usually fetches, parses and indexes 1800
web pages.
I am basically fetching the entire page and fetch process is one that takes
maximum time.

I have a i7 processor with 16GB of RAM.

How can I increase the throughput here?
What I have understood here is that in local mode there is only one thread
doing the fetch?

I guess I would need multiple threads running in parallel.
Would running nutch in pseudo distributed mode and answer here?
It will then run multiple fetchers and I can increase my throughput.

Please let me know.

Thanks
Sachin






On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma <markus.jel...@openindex.io>
wrote:

> Hello Sachin,
>
> Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
> based provider. This is the most expensive option you have.
>
> Cheaper would be to rent some servers and install Hadoop yourself, getting
> it up and running by hand on some servers will take the better part of a
> day.
>
> The cheapest and easiest, and in almost all cases the best option, is not
> to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
> couple of million URLs. So unless you want to crawl many different domains
> and expect 10M+ URLs, stay local.
>
> When we first started our business almost a decade ago we rented VPSs
> first and then physical machines. This ran fine for some years but when we
> had the option to make some good investments, we bought our own hardware
> and have been scaling up the cluster ever since. And with the previous and
> most recent AMD based servers processing power became increasingly cheaper.
>
> If you need to scale up for long term, getting your own hardware is indeed
> the best option.
>
> Regards,
> Markus
>
>
> -----Original message-----
> > From:Sachin Mittal <sjmit...@gmail.com>
> > Sent: Tuesday 22nd October 2019 15:59
> > To: user@nutch.apache.org
> > Subject: Best and economical way of setting hadoop cluster for
> distributed crawling
> >
> > Hi,
> > I have been running nutch in local mode and so far I am able to have a
> good
> > understanding on how it all works.
> >
> > I wanted to start with distributed crawling using some public cloud
> > provider.
> >
> > I just wanted to know if fellow users have any experience in setting up
> > nutch for distributed crawling.
> >
> > From nutch wiki I have some idea on what hardware requirements should be.
> >
> > I just wanted to know which of the public cloud providers (IaaS or PaaS)
> > are good to setup hadoop clusters on. Basically ones on which it is easy
> to
> > setup/manage the cluster and ones which are easy on budget.
> >
> > Please let me know if you folks have any insights based on your
> experiences.
> >
> > Thanks and Regards
> > Sachin
> >
>

Re: Best and economical way of setting hadoop cluster for distributed crawling

Reply via email to