Hello - if you want to crawl big, performance is not really a problem, especially using Hadoop output file compression. We chose 1.x, simply because it is more stable and feature rich.
Using 1.x, it is quite easy to crawl a billion records. Also, do not run on many small machines, your overhead will kill your cluster wide performance. It is a complete waste of resources. -----Original message----- > From:Michael Coffey <[email protected]> > Sent: Sunday 30th October 2016 18:22 > To: [email protected] > Subject: Re: Nutch 1.x or 2.x > > Newbie question: I am trying to decide between Nutch 1.x or 2.x. The > application is to crawl a large portion of the www using a massive number > (thousands) of small machines (<= 2GB RAM each). I like the idea of the > simpler architecture and pluggable storage backend of 2.x. However, I am > concerned about things I've read about 2.x being less stable and possibly > less efficient than 1.x. Are these concerns valid at this time? > > > > >

