Hi Julien, On Fri, Aug 29, 2014 at 6:01 AM, <[email protected]> wrote:
> > Just out of interest, what sort of analytics do you do and why is it better > to do it in 2.x than 1.x? > Nowhere did I say it was better or worse than in 1.X. Let me be clear here. I use Nutch 2.X, as I indicated because it provides me with fine grained control over my storage layer. I have for some time, since presenting some of our work on Gora at Cassandra Summit in 2013, been working with DataStax products and Cassandra in general. I like the technology and I am interested in it. http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-hadoop The scope and nature of analytics I can do once my data in in C* is wide and varied. Some of this is for my own projects, some of it is more closely aligned with work and then some of it it for the ongoing work I do with Gora. To bring this home, there is certainly a wide and varied set of tools out there to do analytics on data once stored in HDFS or somewhere else by Nutch 1.X. Non-one is disputing this so far AFAICT. > > Another way to look at it is that having to maintain 2 versions in Nutch is > an absolute pain, especially given that there aren't very many active > committers. > My immediate answer to this one is to agree with you completely. However I suppose it depends on which way you look at this one. User Vs Developer Vs Committer. Users who already have an HBase cluster may with to use Nutch 2.X over say 1.X, this is an entirely reasonable assumption. On this assumption alone, it justifies continued development of the 2.X from anyone who is willing to do so. > IMHO the mistake we made a few years ago was to name the GORA-based branch > '2.x' as it leads people to think that it is an improvement over 1.x. We > should have called it something like Nutch-GORA or something along these > lines (the original version was called NutchBase) to underline that it is a > different beast, not necessarily a better one. > Yeah I must admit I jumped on the train when it was leaving the station however I was not the driver. In hindsight I think you guys did an absolutely sterling job and i really take my hat off not only for the engineering that you put into pre-2.0 development but for the time and effort that it must have taken. This is in the past now however and since then we've made 4 releases including one bug fix, 2.3 is on the horizon. I've recently finished GSoC and this will be an excellent addition to the 2.3 release if we can get it in there. If not then that is OK. It is not all bad from my point of view. > > Most users are probably not bothered in the underlying technologies so much > and just want the stuff to work, not fix problems. In my view 2.x is not > production ready, but an experimental branch. > And as a long time user, developer and current PMC Chair I absolutely respect this opinion. Lets however acknowledge that people have and continue to use 2.X in a wide variety (as do users of 1.X) because it solves a particular problem for them. When do we make the assertion that the 2.X branch has moved from experimental to production ready? The following is merely an observation. Take it as you will. >From Jira, Nutch 2.X has 112 open issues, Nutch 1.10 has 149. As you've pointed out before i think there is a lot of work to be done on all Nutch sooftware that our appetites allow us to work on :) > > Not really. The overall performance has improved a bit with the latest > version of GORA but not that different from what we reported in > http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html. > The positive point I take from this is that it has improved. > Some backends are probably better than others indeed but all of them are > atrocious compared to 1.x, I think the reason for that is that these NoSQL > tools are optimize to provide random reads/writes to the data and in Nutch > we use them mostly in a sequential manner. Whether the functionalities we > gain are worth the effort depends on everyone's use case. > +1 > Could not agree more. That and the fact that it puts additional constraints > on the hardware and means servers with bigger specs (££££) > You said it. > > > Thanks for your enthusiasm and efforts Lewis! > > For anyone insterested in 2.x - there are quite a few issues you can help > with if you feel so inclined, see > > https://issues.apache.org/jira/browse/NUTCH/fixforversion/12324325/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel > > Thanks Julien. we all have our own interests, itches and requirements. Mine personally have been advanced (interests, esp.) by being part of the Nutch community. I've learned a lot from the people here. Lewis

