Re: Best and economical way of setting hadoop cluster for distributed crawling

Sachin Mittal Fri, 01 Nov 2019 09:22:37 -0700

OK understood.
I am using nutch defaults and they are set optimally especially for polite
crawling.
I am indeed right now crawling just one host and given the defaults the
throughput is what it should be.


Yes one need not to be aggressive here and just be patient.

I think no where in near future I would have over 10M urls to crawl for
1000s of host and local crawling is just fine in my case.
So I would just continue the way it is right now.

Thanks
Sachin




On Fri, Nov 1, 2019 at 7:36 PM Sebastian Nagel
<wastl.na...@googlemail.com.invalid> wrote:

> Hi Sachin,
>
> > What I have observed is that it usually fetches, parses and indexes
> > 1800 web pages.
>
> This means 10 pages per minute.
>
> How are the 1800 pages distributed over hosts?
>
> The default delay between successive fetches to the same host is
> 5 seconds. If all pages belong to the same host, the crawler is
> waiting 50 sec. every minute and the fetching is done in the remaining
> 10 sec.
>
> If you have the explicit permission to access the host(s) aggressively,
> you can decrease the delay
> (fetcher.server.delay) or even fetch in parallel from the same host
> (fetcher.threads.per.queue).
> Otherwise, please keep the delay as is and be patient and polite! You also
> risk to get blocked by
> the web admin.
>
> > What I have understood here is that in local mode there is only one
> > thread doing the fetch?
>
> No. The number of parallel threads used in bin/crawl is 50.
>  --num-threads <num_threads>
>     Number of threads for fetching / sitemap processing [default: 50]
>
> I can only second Markus: local mode is sufficient unless you're crawling
> - significantly more than 10M+ URLs
> - from 1000+ domains
>
> With less domains/hosts there's nothing to distribute because all
> URLs of one domain/host are processed in one fetcher task to ensure
> politeness.
>
> Best,
> Sebastian
>
> On 11/1/19 6:53 AM, Sachin Mittal wrote:
> > Hi,
> > I understood the point.
> > I would also like to run nutch on my local machine.
> >
> > So far I am running in standalone mode with default crawl script where
> > fetch time limit is 180 minutes.
> > What I have observed is that it usually fetches, parses and indexes 1800
> > web pages.
> > I am basically fetching the entire page and fetch process is one that
> takes
> > maximum time.
> >
> > I have a i7 processor with 16GB of RAM.
> >
> > How can I increase the throughput here?
> > What I have understood here is that in local mode there is only one
> thread
> > doing the fetch?
> >
> > I guess I would need multiple threads running in parallel.
> > Would running nutch in pseudo distributed mode and answer here?
> > It will then run multiple fetchers and I can increase my throughput.
> >
> > Please let me know.
> >
> > Thanks
> > Sachin
> >
> >
> >
> >
> >
> >
> > On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma <
> markus.jel...@openindex.io>
> > wrote:
> >
> >> Hello Sachin,
> >>
> >> Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
> >> based provider. This is the most expensive option you have.
> >>
> >> Cheaper would be to rent some servers and install Hadoop yourself,
> getting
> >> it up and running by hand on some servers will take the better part of a
> >> day.
> >>
> >> The cheapest and easiest, and in almost all cases the best option, is
> not
> >> to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
> >> couple of million URLs. So unless you want to crawl many different
> domains
> >> and expect 10M+ URLs, stay local.
> >>
> >> When we first started our business almost a decade ago we rented VPSs
> >> first and then physical machines. This ran fine for some years but when
> we
> >> had the option to make some good investments, we bought our own hardware
> >> and have been scaling up the cluster ever since. And with the previous
> and
> >> most recent AMD based servers processing power became increasingly
> cheaper.
> >>
> >> If you need to scale up for long term, getting your own hardware is
> indeed
> >> the best option.
> >>
> >> Regards,
> >> Markus
> >>
> >>
> >> -----Original message-----
> >>> From:Sachin Mittal <sjmit...@gmail.com>
> >>> Sent: Tuesday 22nd October 2019 15:59
> >>> To: user@nutch.apache.org
> >>> Subject: Best and economical way of setting hadoop cluster for
> >> distributed crawling
> >>>
> >>> Hi,
> >>> I have been running nutch in local mode and so far I am able to have a
> >> good
> >>> understanding on how it all works.
> >>>
> >>> I wanted to start with distributed crawling using some public cloud
> >>> provider.
> >>>
> >>> I just wanted to know if fellow users have any experience in setting up
> >>> nutch for distributed crawling.
> >>>
> >>> From nutch wiki I have some idea on what hardware requirements should
> be.
> >>>
> >>> I just wanted to know which of the public cloud providers (IaaS or
> PaaS)
> >>> are good to setup hadoop clusters on. Basically ones on which it is
> easy
> >> to
> >>> setup/manage the cluster and ones which are easy on budget.
> >>>
> >>> Please let me know if you folks have any insights based on your
> >> experiences.
> >>>
> >>> Thanks and Regards
> >>> Sachin
> >>>
> >>
> >
>
>

Re: Best and economical way of setting hadoop cluster for distributed crawling

Reply via email to