It is stable in the sense that it relies on old proven technology. The underlying principle of 1.x has not changed much over the years. 2.x, with Gora, had trouble years ago, although much less these days.
The point with Gora is that Gora itself, and the chosen storage backend could introduce problems. There are simply more points for failure, one example is chosing Mongo as backend, with a 512 byte limit in the key field. This will cause problems for long URL's, especially 4 byte CJK URL's, limiting such a URL to 128 character length. The list is almost endless, Cassandra is not very stable out-of-the-box, and HBase has peculiar errors sometimes coming from nowhere and recently lead to data loss. Does Gora have support for Solr? Solr cloud is finally very stable since a few years. This just illustrates the point that 2.x introduces new pieces the developer or your system administrator can worry about. It will hurt you if you haven't got the experience and knowledge of these systems. 1.x doesn't in the same sense, and it provides more features you probably end up porting to 2.x if you want them. I also would like to take the opportunity again to advice not to use many low powered machines versus less high octane machines, it is a very bad idea and extremely cost ineffective. This set up will also for certain break default Hadoop settings. Settings must change in large scale clusters, settings you might not yet know about. The number of needed file descriptor alone requires reconfiguring certain settings. -----Original message----- > From:Michael Coffey <[email protected]> > Sent: Monday 31st October 2016 19:24 > To: [email protected] > Subject: Re: Nutch 1.x or 2.x > > When you say that 1.x is more stable, what does that mean? > > > From: Markus Jelsma <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Monday, October 31, 2016 9:39 AM > Subject: RE: Nutch 1.x or 2.x > > Hello - if you want to crawl big, performance is not really a problem, > especially using Hadoop output file compression. We chose 1.x, simply because > it is more stable and feature rich. > > Using 1.x, it is quite easy to crawl a billion records. > > Also, do not run on many small machines, your overhead will kill your cluster > wide performance. It is a complete waste of resources. > > -----Original message----- > > From:Michael Coffey <[email protected]> > > Sent: Sunday 30th October 2016 18:22 > > To: [email protected] > > Subject: Re: Nutch 1.x or 2.x > > > > Newbie question: I am trying to decide between Nutch 1.x or 2.x. The > > application is to crawl a large portion of the www using a massive number > > (thousands) of small machines (<= 2GB RAM each). I like the idea of the > > simpler architecture and pluggable storage backend of 2.x. However, I am > > concerned about things I've read about 2.x being less stable and possibly > > less efficient than 1.x. Are these concerns valid at this time? > > > > > > > > > > > >

