Hello Joseph, see inline.

Regards,
Markus
 
-----Original message-----
> From:Joseph Naegele <[email protected]>
> Sent: Monday 16th May 2016 20:40
> To: [email protected]
> Subject: pros/cons of many nodes
> 
> Hi folks,
> 
>  
> 
> Would anyone be willing to share a few pros/cons of using many nodes vs. 1
> very powerful machine for large-scale crawling? Of course many advantages
> and disadvantages overlap with Hadoop and distributed computing in general,
> but what I'm actually looking for are good reasons not to use a single
> machine for Nutch.

You want or need many/multiple machines for two reasons:
1. the sheer volume of your crawldb demands it, or,
2. you just want replication or data and high availability.

The issue of high availability with one machine is clear, it is not high 
available. I certainly prefer at least three nodes, and two YARN and HDFS 
masters, although that high availability-thing doesn't always work, yet. If you 
have a few million URL's to crawl, just take three small and cheap nodes.

> 
>  
> 
> One example could be that more machines give you more IP addresses for
> fetching, and therefore less opportunity for being blocked by web admins,
> correct?

Correct, but if a web admin really doesn't like your crawler, you are 
guaranteed that our other IP is going to be blocked eventually. And besides, 
you cannot control on which fetcher that queues is going to run.

> 
>  
> 
> Joe
> 
> 

Reply via email to