RE: Best and economical way of setting hadoop cluster for distributed crawling

Markus Jelsma Fri, 01 Nov 2019 05:22:27 -0700

Hello Sachin,

You might want to check out the fetcher.* settings in your configuration. They 
control how many threads in total, how they are queued, what the delay between 
fetchers is, how many threads per queue etc.


Keep in mind, if you do not own the server or have no explicit permission, it 
is wise not to over do it (the default settings are recommended) you can easily 
bring down a website using Nutch in local mode.

Regards,
Markus
 
 
-----Original message-----
> From:Sachin Mittal <sjmit...@gmail.com>
> Sent: Friday 1st November 2019 6:53
> To: user@nutch.apache.org
> Subject: Re: Best and economical way of setting hadoop cluster for 
> distributed crawling
> 
> Hi,
> I understood the point.
> I would also like to run nutch on my local machine.
> 
> So far I am running in standalone mode with default crawl script where
> fetch time limit is 180 minutes.
> What I have observed is that it usually fetches, parses and indexes 1800
> web pages.
> I am basically fetching the entire page and fetch process is one that takes
> maximum time.
> 
> I have a i7 processor with 16GB of RAM.
> 
> How can I increase the throughput here?
> What I have understood here is that in local mode there is only one thread
> doing the fetch?
> 
> I guess I would need multiple threads running in parallel.
> Would running nutch in pseudo distributed mode and answer here?
> It will then run multiple fetchers and I can increase my throughput.
> 
> Please let me know.
> 
> Thanks
> Sachin
> 
> 
> 
> 
> 
> 
> On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma <markus.jel...@openindex.io>
> wrote:
> 
> > Hello Sachin,
> >
> > Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
> > based provider. This is the most expensive option you have.
> >
> > Cheaper would be to rent some servers and install Hadoop yourself, getting
> > it up and running by hand on some servers will take the better part of a
> > day.
> >
> > The cheapest and easiest, and in almost all cases the best option, is not
> > to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
> > couple of million URLs. So unless you want to crawl many different domains
> > and expect 10M+ URLs, stay local.
> >
> > When we first started our business almost a decade ago we rented VPSs
> > first and then physical machines. This ran fine for some years but when we
> > had the option to make some good investments, we bought our own hardware
> > and have been scaling up the cluster ever since. And with the previous and
> > most recent AMD based servers processing power became increasingly cheaper.
> >
> > If you need to scale up for long term, getting your own hardware is indeed
> > the best option.
> >
> > Regards,
> > Markus
> >
> >
> > -----Original message-----
> > > From:Sachin Mittal <sjmit...@gmail.com>
> > > Sent: Tuesday 22nd October 2019 15:59
> > > To: user@nutch.apache.org
> > > Subject: Best and economical way of setting hadoop cluster for
> > distributed crawling
> > >
> > > Hi,
> > > I have been running nutch in local mode and so far I am able to have a
> > good
> > > understanding on how it all works.
> > >
> > > I wanted to start with distributed crawling using some public cloud
> > > provider.
> > >
> > > I just wanted to know if fellow users have any experience in setting up
> > > nutch for distributed crawling.
> > >
> > > From nutch wiki I have some idea on what hardware requirements should be.
> > >
> > > I just wanted to know which of the public cloud providers (IaaS or PaaS)
> > > are good to setup hadoop clusters on. Basically ones on which it is easy
> > to
> > > setup/manage the cluster and ones which are easy on budget.
> > >
> > > Please let me know if you folks have any insights based on your
> > experiences.
> > >
> > > Thanks and Regards
> > > Sachin
> > >
> >
>

RE: Best and economical way of setting hadoop cluster for distributed crawling

Reply via email to