Re: Best and economical way of setting hadoop cluster for distributed crawling

Sebastian Nagel Fri, 01 Nov 2019 07:07:24 -0700

Hi Sachin,

> What I have observed is that it usually fetches, parses and indexes
> 1800 web pages.


This means 10 pages per minute.

How are the 1800 pages distributed over hosts?

The default delay between successive fetches to the same host is
5 seconds. If all pages belong to the same host, the crawler is
waiting 50 sec. every minute and the fetching is done in the remaining
10 sec.

If you have the explicit permission to access the host(s) aggressively, you can 
decrease the delay
(fetcher.server.delay) or even fetch in parallel from the same host 
(fetcher.threads.per.queue).
Otherwise, please keep the delay as is and be patient and polite! You also risk 
to get blocked by
the web admin.

> What I have understood here is that in local mode there is only one
> thread doing the fetch?

No. The number of parallel threads used in bin/crawl is 50.
 --num-threads <num_threads>
    Number of threads for fetching / sitemap processing [default: 50]

I can only second Markus: local mode is sufficient unless you're crawling
- significantly more than 10M+ URLs
- from 1000+ domains

With less domains/hosts there's nothing to distribute because all
URLs of one domain/host are processed in one fetcher task to ensure
politeness.

Best,
Sebastian

On 11/1/19 6:53 AM, Sachin Mittal wrote:
> Hi,
> I understood the point.
> I would also like to run nutch on my local machine.
> 
> So far I am running in standalone mode with default crawl script where
> fetch time limit is 180 minutes.
> What I have observed is that it usually fetches, parses and indexes 1800
> web pages.
> I am basically fetching the entire page and fetch process is one that takes
> maximum time.
> 
> I have a i7 processor with 16GB of RAM.
> 
> How can I increase the throughput here?
> What I have understood here is that in local mode there is only one thread
> doing the fetch?
> 
> I guess I would need multiple threads running in parallel.
> Would running nutch in pseudo distributed mode and answer here?
> It will then run multiple fetchers and I can increase my throughput.
> 
> Please let me know.
> 
> Thanks
> Sachin
> 
> 
> 
> 
> 
> 
> On Thu, Oct 31, 2019 at 2:40 AM Markus Jelsma <[email protected]>
> wrote:
> 
>> Hello Sachin,
>>
>> Nutch can run on Amazon AWS without trouble, and probably on any Hadoop
>> based provider. This is the most expensive option you have.
>>
>> Cheaper would be to rent some servers and install Hadoop yourself, getting
>> it up and running by hand on some servers will take the better part of a
>> day.
>>
>> The cheapest and easiest, and in almost all cases the best option, is not
>> to run Nutch on Hadoop and stay local. A local Nutch can easily handle a
>> couple of million URLs. So unless you want to crawl many different domains
>> and expect 10M+ URLs, stay local.
>>
>> When we first started our business almost a decade ago we rented VPSs
>> first and then physical machines. This ran fine for some years but when we
>> had the option to make some good investments, we bought our own hardware
>> and have been scaling up the cluster ever since. And with the previous and
>> most recent AMD based servers processing power became increasingly cheaper.
>>
>> If you need to scale up for long term, getting your own hardware is indeed
>> the best option.
>>
>> Regards,
>> Markus
>>
>>
>> -----Original message-----
>>> From:Sachin Mittal <[email protected]>
>>> Sent: Tuesday 22nd October 2019 15:59
>>> To: [email protected]
>>> Subject: Best and economical way of setting hadoop cluster for
>> distributed crawling
>>>
>>> Hi,
>>> I have been running nutch in local mode and so far I am able to have a
>> good
>>> understanding on how it all works.
>>>
>>> I wanted to start with distributed crawling using some public cloud
>>> provider.
>>>
>>> I just wanted to know if fellow users have any experience in setting up
>>> nutch for distributed crawling.
>>>
>>> From nutch wiki I have some idea on what hardware requirements should be.
>>>
>>> I just wanted to know which of the public cloud providers (IaaS or PaaS)
>>> are good to setup hadoop clusters on. Basically ones on which it is easy
>> to
>>> setup/manage the cluster and ones which are easy on budget.
>>>
>>> Please let me know if you folks have any insights based on your
>> experiences.
>>>
>>> Thanks and Regards
>>> Sachin
>>>
>>
>

Re: Best and economical way of setting hadoop cluster for distributed crawling

Reply via email to