All, terribly sorry for MY late replies too! I have a docker container set up for 2.2.1; if anyone is interested, I can make an open one available; it's configured with Maestro NG to allow you to start it up with the knowledge of where your solr instance is (ie just passing an env variable).
I'll see if I can carve out some time to help on the issues, but I'm pretty swamped at the moment with work, meetup groups etc. Let me know if I can help out in any way that's not super time critical, Mo This message was drafted on a tiny touch screen; please forgive brevity & tpyos > On Sep 1, 2014, at 4:11 AM, Julien Nioche <[email protected]> > wrote: > > Hi Guy, > > I'm confused as to what are the significant differences between 1.x and >> 2.x. >> Is there a bit of history that I could read about why the development of >> the two parallel to each other happened? > > See for instance https://www.youtube.com/watch?v=KyHPBtRlo80 (in particular > around 28:00). There are other resources in > http://wiki.apache.org/nutch/Presentations which explain the differences. > > As I'm just starting out with Nutch/Solr/Hadoop, I'd like to know which >> path would be best for me to follow. So far, 1.x has appeared to be the >> best choice for me, but is that going to change in the next iteration? >> Confused. And a little scared. > > Don't worry, Nutch 1.x (i.e HDFS-based) will definitely stay. As explained > in the discussion with Lewis, naming Nutch-GORA as '2.x' as probably a bit > of a mistake. Both flavours of Nutch will keep living parallel existences. > > Julien > > PS: all this and a lot more will be explained at the Nutch workshop at > ApacheCon EU http://sched.co/1pbE15n > <http://wiki.apache.org/nutch/Presentations> as well as Sebastian's talk > http://sched.co/1nyYa7b > > >> >> Guy McDowell >> [email protected] >> http://www.GuyMcDowell.com >> >> >> >> >> >> On Fri, Aug 29, 2014 at 11:29 AM, Mattmann, Chris A (3980) < >> [email protected]> wrote: >> >>> +1, great. >>> >>> I'd like to have a conversation about versioning. >>> >>> Since we're at 1.9, my suggestion would be to have the >>> next in the trunk series (1.x) move to version 3.x post >>> 1.9 for the release. >>> >>> Nutch2 remains Nutch and can be worked on there. That >>> would give us a nice split in the diversionary branch >>> paths for Nutch. >>> >>> Cheers, >>> Chris >>> >>> >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Chris Mattmann, Ph.D. >>> Chief Architect >>> Instrument Software and Science Data Systems Section (398) >>> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA >>> Office: 168-519, Mailstop: 168-527 >>> Email: [email protected] >>> WWW: http://sunset.usc.edu/~mattmann/ >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> Adjunct Associate Professor, Computer Science Department >>> University of Southern California, Los Angeles, CA 90089 USA >>> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ >>> >>> >>> >>> >>> >>> >>> -----Original Message----- >>> From: Julien Nioche <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Friday, August 29, 2014 1:35 AM >>> To: "[email protected]" <[email protected]> >>> Subject: Re: [RELEASE] Apache Nutch 1.9 >>> >>>> Hi Lewis, >>>> >>>> A few comments below. >>>> >>>> I use Nutch 2.x as it enables me to do analytics over the data I am >>>>> crawling. This is my justification for trying to maintain an further >> the >>>>> development on that branch over the last while. >>>> >>>> Just out of interest, what sort of analytics do you do and why is it >>>> better >>>> to do it in 2.x than 1.x? >>>> >>>> >>>>> I am also extremely interested in the technologies supported within >> the >>>>> Nutch 2.X stack and I like keeping up with their development and using >>>>> them >>>>> to fix my problems if and when the problems arise. >>>>> I like having fine grained control over my storage architecture. This >> is >>>>> also a pro for me. >>>> >>>> Another way to look at it is that having to maintain 2 versions in Nutch >>>> is >>>> an absolute pain, especially given that there aren't very many active >>>> committers. >>>> IMHO the mistake we made a few years ago was to name the GORA-based >> branch >>>> '2.x' as it leads people to think that it is an improvement over 1.x. We >>>> should have called it something like Nutch-GORA or something along these >>>> lines (the original version was called NutchBase) to underline that it >> is >>>> a >>>> different beast, not necessarily a better one. >>>> >>>> Most users are probably not bothered in the underlying technologies so >>>> much >>>> and just want the stuff to work, not fix problems. In my view 2.x is not >>>> production ready, but an experimental branch. >>>> >>>> >>>> >>>>> The performance Julien talks about (and please correct me if I am >> wrong >>>>> Julien) is not so much Nutch related as it is Gora. Different Gora >>>>> backends >>>>> perform differently, this is itself driven by who wishes to maintain >>>>> them. >>>> >>>> Not really. The overall performance has improved a bit with the latest >>>> version of GORA but not that different from what we reported in >>>> http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html. >>>> Some backends are probably better than others indeed but all of them are >>>> atrocious compared to 1.x, I think the reason for that is that these >> NoSQL >>>> tools are optimize to provide random reads/writes to the data and in >> Nutch >>>> we use them mostly in a sequential manner. Whether the functionalities >> we >>>> gain are worth the effort depends on everyone's use case. >>>> >>>> >>>>> On another note, we've identified that for users, Nutch 2.X is a >> bloody >>>>> pain to provision and get running. This is a problem for this branch >> and >>>>> for the people that invest and possibly waste time trying to determine >>>>> revisions, etc. >>>> >>>> Could not agree more. That and the fact that it puts additional >>>> constraints >>>> on the hardware and means servers with bigger specs (££££) >>>> >>>> >>>>> >>>>> It is my intention to build different Vagrant flavours for each Nutch >>>>> 2.X >>>>> stack. >>>>> https://issues.apache.org/jira/browse/NUTCH-1812 >>>>> >>>>> If ANYONE on this list is intersted in helping with this effort them I >>>>> would dedicate some time to document the process on the wiki so that >> it >>>>> can >>>>> be reproduced for everyone's benefit. I feel that this would be a huge >>>>> move >>>>> forward for the 2.X branch. >>>> >>>> Thanks for your enthusiasm and efforts Lewis! >>>> >>>> For anyone insterested in 2.x - there are quite a few issues you can >> help >>>> with if you feel so inclined, see >> https://issues.apache.org/jira/browse/NUTCH/fixforversion/12324325/?select >>>> edTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel >>>> >>>> Julien >>>> >>>> -- >>>> >>>> Open Source Solutions for Text Engineering >>>> >>>> http://digitalpebble.blogspot.com/ >>>> http://www.digitalpebble.com >>>> http://twitter.com/digitalpebble > > > > -- > > Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble

