Hi Julien,

On Fri, Aug 29, 2014 at 6:01 AM, <[email protected]> wrote:

>
> Just out of interest, what sort of analytics do you do and why is it better
> to do it in 2.x than 1.x?
>

Nowhere did I say it was better or worse than in 1.X. Let me be clear here.
I use Nutch 2.X, as I indicated because it provides me with fine grained
control over my storage layer.
I have for some time, since presenting some of our work on Gora at
Cassandra Summit in 2013, been working with DataStax products and Cassandra
in general. I like the technology and I am interested in it.
http://www.datastax.com/what-we-offer/products-services/datastax-enterprise/apache-hadoop
The scope and nature of analytics I can do once my data in in C* is wide
and varied. Some of this is for my own projects, some of it is more closely
aligned with work and then some of it it for the ongoing work I do with
Gora. To bring this home, there is certainly a wide and varied set of tools
out there to do analytics on data once stored in HDFS or somewhere else by
Nutch 1.X. Non-one is disputing this so far AFAICT.



>
> Another way to look at it is that having to maintain 2 versions in Nutch is
> an absolute pain, especially given that there aren't very many active
> committers.
>

My immediate answer to this one is to agree with you completely. However I
suppose it depends on which way you look at this one. User Vs Developer Vs
Committer. Users who already have an HBase cluster may with to use Nutch
2.X over say 1.X, this is an entirely reasonable assumption. On this
assumption alone, it justifies continued development of the 2.X from anyone
who is willing to do so.


> IMHO the mistake we made a few years ago was to name the GORA-based branch
> '2.x' as it leads people to think that it is an improvement over 1.x. We
> should have called it something like Nutch-GORA or something along these
> lines (the original version was called NutchBase) to underline that it is a
> different beast, not necessarily a better one.
>

Yeah I must admit I jumped on the train when it was leaving the station
however I was not the driver. In hindsight I think you guys did an
absolutely sterling job and i really take my hat off not only for the
engineering that you put into pre-2.0 development but for the time and
effort that it must have taken. This is in the past now however and since
then we've made 4 releases including one bug fix, 2.3 is on the horizon.
I've recently finished GSoC and this will be an excellent addition to the
2.3 release if we can get it in there. If not then that is OK. It is not
all bad from my point of view.


>
> Most users are probably not bothered in the underlying technologies so much
> and just want the stuff to work, not fix problems. In my view 2.x is not
> production ready, but an experimental branch.
>

And as a long time user, developer and current PMC Chair I absolutely
respect this opinion. Lets however acknowledge that people have and
continue to use 2.X in a wide variety (as do users of 1.X) because it
solves a particular problem for them. When do we make the assertion that
the 2.X branch has moved from experimental to production ready?
The following is merely an observation. Take it as you will.
>From Jira, Nutch 2.X has 112 open issues, Nutch 1.10 has 149.
As you've pointed out before i think there is a lot of work to be done on
all Nutch sooftware that our appetites allow us to work on :)


>
> Not really. The overall performance has improved a bit with the latest
> version of GORA but not that different from what we reported in
> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html.
>

The positive point I take from this is that it has improved.


> Some backends are probably better than others indeed but all of them are
> atrocious compared to 1.x, I think the reason for that is that these NoSQL
> tools are optimize to provide random reads/writes to the data and in Nutch
> we use them mostly in a sequential manner. Whether the functionalities we
> gain are worth the effort depends on everyone's use case.
>

+1


> Could not agree more. That and the fact that it puts additional constraints
> on the hardware and means servers with bigger specs (££££)
>

You said it.


>
>
>  Thanks for your enthusiasm and efforts Lewis!
>
> For anyone insterested in 2.x - there are quite a few issues you can help
> with if you feel so inclined, see
>
> https://issues.apache.org/jira/browse/NUTCH/fixforversion/12324325/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
>
>
Thanks Julien. we all have our own interests, itches and requirements. Mine
personally have been advanced (interests, esp.) by being part of the Nutch
community. I've learned a lot from the people here.
Lewis

Reply via email to