Hi Guy, I'm confused as to what are the significant differences between 1.x and > 2.x. > Is there a bit of history that I could read about why the development of > the two parallel to each other happened? >
See for instance https://www.youtube.com/watch?v=KyHPBtRlo80 (in particular around 28:00). There are other resources in http://wiki.apache.org/nutch/Presentations which explain the differences. As I'm just starting out with Nutch/Solr/Hadoop, I'd like to know which > path would be best for me to follow. So far, 1.x has appeared to be the > best choice for me, but is that going to change in the next iteration? > Confused. And a little scared. > Don't worry, Nutch 1.x (i.e HDFS-based) will definitely stay. As explained in the discussion with Lewis, naming Nutch-GORA as '2.x' as probably a bit of a mistake. Both flavours of Nutch will keep living parallel existences. Julien PS: all this and a lot more will be explained at the Nutch workshop at ApacheCon EU http://sched.co/1pbE15n <http://wiki.apache.org/nutch/Presentations> as well as Sebastian's talk http://sched.co/1nyYa7b > > Guy McDowell > [email protected] > http://www.GuyMcDowell.com > > > > > > On Fri, Aug 29, 2014 at 11:29 AM, Mattmann, Chris A (3980) < > [email protected]> wrote: > > > +1, great. > > > > I'd like to have a conversation about versioning. > > > > Since we're at 1.9, my suggestion would be to have the > > next in the trunk series (1.x) move to version 3.x post > > 1.9 for the release. > > > > Nutch2 remains Nutch and can be worked on there. That > > would give us a nice split in the diversionary branch > > paths for Nutch. > > > > Cheers, > > Chris > > > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Chris Mattmann, Ph.D. > > Chief Architect > > Instrument Software and Science Data Systems Section (398) > > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > > Office: 168-519, Mailstop: 168-527 > > Email: [email protected] > > WWW: http://sunset.usc.edu/~mattmann/ > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > Adjunct Associate Professor, Computer Science Department > > University of Southern California, Los Angeles, CA 90089 USA > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > > > > > > > > -----Original Message----- > > From: Julien Nioche <[email protected]> > > Reply-To: "[email protected]" <[email protected]> > > Date: Friday, August 29, 2014 1:35 AM > > To: "[email protected]" <[email protected]> > > Subject: Re: [RELEASE] Apache Nutch 1.9 > > > > >Hi Lewis, > > > > > >A few comments below. > > > > > >I use Nutch 2.x as it enables me to do analytics over the data I am > > >> crawling. This is my justification for trying to maintain an further > the > > >> development on that branch over the last while. > > >> > > > > > >Just out of interest, what sort of analytics do you do and why is it > > >better > > >to do it in 2.x than 1.x? > > > > > > > > >> I am also extremely interested in the technologies supported within > the > > >> Nutch 2.X stack and I like keeping up with their development and using > > >>them > > >> to fix my problems if and when the problems arise. > > >> I like having fine grained control over my storage architecture. This > is > > >> also a pro for me. > > >> > > > > > >Another way to look at it is that having to maintain 2 versions in Nutch > > >is > > >an absolute pain, especially given that there aren't very many active > > >committers. > > >IMHO the mistake we made a few years ago was to name the GORA-based > branch > > >'2.x' as it leads people to think that it is an improvement over 1.x. We > > >should have called it something like Nutch-GORA or something along these > > >lines (the original version was called NutchBase) to underline that it > is > > >a > > >different beast, not necessarily a better one. > > > > > >Most users are probably not bothered in the underlying technologies so > > >much > > >and just want the stuff to work, not fix problems. In my view 2.x is not > > >production ready, but an experimental branch. > > > > > > > > > > > >> The performance Julien talks about (and please correct me if I am > wrong > > >> Julien) is not so much Nutch related as it is Gora. Different Gora > > >>backends > > >> perform differently, this is itself driven by who wishes to maintain > > >>them. > > >> > > > > > >Not really. The overall performance has improved a bit with the latest > > >version of GORA but not that different from what we reported in > > >http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html. > > >Some backends are probably better than others indeed but all of them are > > >atrocious compared to 1.x, I think the reason for that is that these > NoSQL > > >tools are optimize to provide random reads/writes to the data and in > Nutch > > >we use them mostly in a sequential manner. Whether the functionalities > we > > >gain are worth the effort depends on everyone's use case. > > > > > > > > >> On another note, we've identified that for users, Nutch 2.X is a > bloody > > >> pain to provision and get running. This is a problem for this branch > and > > >> for the people that invest and possibly waste time trying to determine > > >> revisions, etc. > > >> > > > > > >Could not agree more. That and the fact that it puts additional > > >constraints > > >on the hardware and means servers with bigger specs (££££) > > > > > > > > >> > > >> It is my intention to build different Vagrant flavours for each Nutch > > >>2.X > > >> stack. > > >> https://issues.apache.org/jira/browse/NUTCH-1812 > > >> > > >> If ANYONE on this list is intersted in helping with this effort them I > > >> would dedicate some time to document the process on the wiki so that > it > > >>can > > >> be reproduced for everyone's benefit. I feel that this would be a huge > > >>move > > >> forward for the 2.X branch. > > >> > > > > > > Thanks for your enthusiasm and efforts Lewis! > > > > > >For anyone insterested in 2.x - there are quite a few issues you can > help > > >with if you feel so inclined, see > > > > > > https://issues.apache.org/jira/browse/NUTCH/fixforversion/12324325/?select > > >edTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel > > > > > >Julien > > > > > >-- > > > > > >Open Source Solutions for Text Engineering > > > > > >http://digitalpebble.blogspot.com/ > > >http://www.digitalpebble.com > > >http://twitter.com/digitalpebble > > > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

