RE: Nutch 1.x or 2.x

Markus Jelsma Mon, 31 Oct 2016 12:25:08 -0700

It is stable in the sense that it relies on old proven technology. The 
underlying principle of 1.x has not changed much over the years. 2.x, with 
Gora, had trouble years ago, although much less these days.


The point with Gora is that Gora itself, and the chosen storage backend could 
introduce problems. There are simply more points for failure, one example is 
chosing Mongo as backend, with a 512 byte limit in the key field. This will 
cause problems for long URL's, especially 4 byte CJK URL's, limiting such a URL 
to 128 character length. The list is almost endless, Cassandra is not very 
stable out-of-the-box, and HBase has peculiar errors sometimes coming from 
nowhere and recently lead to data loss. Does Gora have support for Solr? Solr 
cloud is finally very stable since a few years.

This just illustrates the point that 2.x introduces new pieces the developer or 
your system administrator can worry about. It will hurt you if you haven't got 
the experience and knowledge of these systems. 1.x doesn't in the same sense, 
and it provides more features you probably end up porting to 2.x if you want 
them.

I also would like to take the opportunity again to advice not to use many low 
powered machines versus less high octane machines, it is a very bad idea and 
extremely cost ineffective. This set up will also for certain break default 
Hadoop settings. Settings must change in large scale clusters, settings you 
might not yet know about. The number of needed file descriptor alone requires 
reconfiguring certain settings.
 
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Monday 31st October 2016 19:24
> To: [email protected]
> Subject: Re: Nutch 1.x or 2.x
> 
> When you say that 1.x is more stable, what does that mean?
> 
> 
>       From: Markus Jelsma <[email protected]>
>  To: "[email protected]" <[email protected]> 
>  Sent: Monday, October 31, 2016 9:39 AM
>  Subject: RE: Nutch 1.x or 2.x
>    
> Hello - if you want to crawl big, performance is not really a problem, 
> especially using Hadoop output file compression. We chose 1.x, simply because 
> it is more stable and feature rich.
> 
> Using 1.x, it is quite easy to crawl a billion records.
> 
> Also, do not run on many small machines, your overhead will kill your cluster 
> wide performance. It is a complete waste of resources.
>  
> -----Original message-----
> > From:Michael Coffey <[email protected]>
> > Sent: Sunday 30th October 2016 18:22
> > To: [email protected]
> > Subject: Re: Nutch 1.x or 2.x
> > 
> > Newbie question: I am trying to decide between Nutch 1.x or 2.x. The 
> > application is to crawl a large portion of the www using a massive number 
> > (thousands) of small machines (<= 2GB RAM each). I like the idea of the 
> > simpler architecture and pluggable storage backend of 2.x. However, I am 
> > concerned about things I've read about 2.x being less stable and possibly 
> > less efficient than 1.x. Are these concerns valid at this time?
> > 
> > 
> > 
> > 
> >    
> 
>

RE: Nutch 1.x or 2.x

Reply via email to