RE: Nutch 1.x or 2.x

Markus Jelsma Mon, 31 Oct 2016 09:40:20 -0700

Hello - if you want to crawl big, performance is not really a problem, 
especially using Hadoop output file compression. We chose 1.x, simply because 
it is more stable and feature rich.


Using 1.x, it is quite easy to crawl a billion records.

Also, do not run on many small machines, your overhead will kill your cluster 
wide performance. It is a complete waste of resources.
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Sunday 30th October 2016 18:22
> To: [email protected]
> Subject: Re: Nutch 1.x or 2.x
> 
> Newbie question: I am trying to decide between Nutch 1.x or 2.x. The 
> application is to crawl a large portion of the www using a massive number 
> (thousands) of small machines (<= 2GB RAM each). I like the idea of the 
> simpler architecture and pluggable storage backend of 2.x. However, I am 
> concerned about things I've read about 2.x being less stable and possibly 
> less efficient than 1.x. Are these concerns valid at this time?
> 
> 
> 
> 
>

RE: Nutch 1.x or 2.x

Reply via email to