Hi Guy,

I'm confused as to what are the significant differences between 1.x and
> 2.x.
> Is there a bit of history that I could read about why the development of
> the two parallel to each other happened?
>

See for instance https://www.youtube.com/watch?v=KyHPBtRlo80 (in particular
around 28:00). There are other resources in
http://wiki.apache.org/nutch/Presentations which explain the differences.

As I'm just starting out with Nutch/Solr/Hadoop, I'd like to know which
> path would be best for me to follow. So far, 1.x has appeared to be the
> best choice for me, but is that going to change in the next iteration?
> Confused. And a little scared.
>

Don't worry, Nutch 1.x (i.e HDFS-based) will definitely stay. As explained
in the discussion with Lewis, naming Nutch-GORA as '2.x' as probably a bit
of a mistake. Both flavours of Nutch will keep living parallel existences.

Julien

PS: all this and a lot more will be explained at the Nutch workshop at
ApacheCon EU http://sched.co/1pbE15n
<http://wiki.apache.org/nutch/Presentations> as well as Sebastian's talk
http://sched.co/1nyYa7b


>
> Guy McDowell
> [email protected]
> http://www.GuyMcDowell.com
>
>
>
>
>
> On Fri, Aug 29, 2014 at 11:29 AM, Mattmann, Chris A (3980) <
> [email protected]> wrote:
>
> > +1, great.
> >
> > I'd like to have a conversation about versioning.
> >
> > Since we're at 1.9, my suggestion would be to have the
> > next in the trunk series (1.x) move to version 3.x post
> > 1.9 for the release.
> >
> > Nutch2 remains Nutch and can be worked on there. That
> > would give us a nice split in the diversionary branch
> > paths for Nutch.
> >
> > Cheers,
> > Chris
> >
> >
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Chris Mattmann, Ph.D.
> > Chief Architect
> > Instrument Software and Science Data Systems Section (398)
> > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> > Office: 168-519, Mailstop: 168-527
> > Email: [email protected]
> > WWW:  http://sunset.usc.edu/~mattmann/
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > Adjunct Associate Professor, Computer Science Department
> > University of Southern California, Los Angeles, CA 90089 USA
> > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >
> >
> >
> >
> >
> >
> > -----Original Message-----
> > From: Julien Nioche <[email protected]>
> > Reply-To: "[email protected]" <[email protected]>
> > Date: Friday, August 29, 2014 1:35 AM
> > To: "[email protected]" <[email protected]>
> > Subject: Re: [RELEASE] Apache Nutch 1.9
> >
> > >Hi Lewis,
> > >
> > >A few comments below.
> > >
> > >I use Nutch 2.x as it enables me to do analytics over the data I am
> > >> crawling. This is my justification for trying to maintain an further
> the
> > >> development on that branch over the last while.
> > >>
> > >
> > >Just out of interest, what sort of analytics do you do and why is it
> > >better
> > >to do it in 2.x than 1.x?
> > >
> > >
> > >> I am also extremely interested in the technologies supported within
> the
> > >> Nutch 2.X stack and I like keeping up with their development and using
> > >>them
> > >> to fix my problems if and when the problems arise.
> > >> I like having fine grained control over my storage architecture. This
> is
> > >> also a pro for me.
> > >>
> > >
> > >Another way to look at it is that having to maintain 2 versions in Nutch
> > >is
> > >an absolute pain, especially given that there aren't very many active
> > >committers.
> > >IMHO the mistake we made a few years ago was to name the GORA-based
> branch
> > >'2.x' as it leads people to think that it is an improvement over 1.x. We
> > >should have called it something like Nutch-GORA or something along these
> > >lines (the original version was called NutchBase) to underline that it
> is
> > >a
> > >different beast, not necessarily a better one.
> > >
> > >Most users are probably not bothered in the underlying technologies so
> > >much
> > >and just want the stuff to work, not fix problems. In my view 2.x is not
> > >production ready, but an experimental branch.
> > >
> > >
> > >
> > >> The performance Julien talks about (and please correct me if I am
> wrong
> > >> Julien) is not so much Nutch related as it is Gora. Different Gora
> > >>backends
> > >> perform differently, this is itself driven by who wishes to maintain
> > >>them.
> > >>
> > >
> > >Not really. The overall performance has improved a bit with the latest
> > >version of GORA but not that different from what we reported in
> > >http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html.
> > >Some backends are probably better than others indeed but all of them are
> > >atrocious compared to 1.x, I think the reason for that is that these
> NoSQL
> > >tools are optimize to provide random reads/writes to the data and in
> Nutch
> > >we use them mostly in a sequential manner. Whether the functionalities
> we
> > >gain are worth the effort depends on everyone's use case.
> > >
> > >
> > >> On another note, we've identified that for users, Nutch 2.X is a
> bloody
> > >> pain to provision and get running. This is a problem for this branch
> and
> > >> for the people that invest and possibly waste time trying to determine
> > >> revisions, etc.
> > >>
> > >
> > >Could not agree more. That and the fact that it puts additional
> > >constraints
> > >on the hardware and means servers with bigger specs (££££)
> > >
> > >
> > >>
> > >> It is my intention to build different Vagrant flavours for each Nutch
> > >>2.X
> > >> stack.
> > >> https://issues.apache.org/jira/browse/NUTCH-1812
> > >>
> > >> If ANYONE on this list is intersted in helping with this effort them I
> > >> would dedicate some time to document the process on the wiki so that
> it
> > >>can
> > >> be reproduced for everyone's benefit. I feel that this would be a huge
> > >>move
> > >> forward for the 2.X branch.
> > >>
> > >
> > > Thanks for your enthusiasm and efforts Lewis!
> > >
> > >For anyone insterested in 2.x - there are quite a few issues you can
> help
> > >with if you feel so inclined, see
> > >
> >
> https://issues.apache.org/jira/browse/NUTCH/fixforversion/12324325/?select
> > >edTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
> > >
> > >Julien
> > >
> > >--
> > >
> > >Open Source Solutions for Text Engineering
> > >
> > >http://digitalpebble.blogspot.com/
> > >http://www.digitalpebble.com
> > >http://twitter.com/digitalpebble
> >
> >
>



-- 

Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to