I think it is a great idea of build images with an environment correctly set up. I think two types of images would be helpful.
1. Development (Virtualbox) Here, we have Eclipse, plugin, pseudo hadoop...etc correctly installed maybe on a ubuntu box with 3D-acceleration enabled. Then people can download that image and start the virtualbox if they want to develop and debug the code. Like what Tejas has done in his runNutchInEclipse <http://wiki.apache.org/nutch/RunNutchInEclipse>. 2. Application (Docker) If people just want to use Nutch2.X, and don't care what is behind the scene, Docker containers is a good way to distribute and you don't need virtualbox installed.. For example, the selenium grid docker containers <https://github.com/momer/nutch-selenium-grid-plugin> contributed by momer is super handy to use and save me tons of time when integrating Nutch with Selenium. I personally have Virtualbox image installed on my gaming desktop :) and I have some experience with docker which I am willing to spend more time on it. Bin On Fri, Aug 29, 2014 at 8:29 AM, Mattmann, Chris A (3980) < [email protected]> wrote: > +1, great. > > I'd like to have a conversation about versioning. > > Since we're at 1.9, my suggestion would be to have the > next in the trunk series (1.x) move to version 3.x post > 1.9 for the release. > > Nutch2 remains Nutch and can be worked on there. That > would give us a nice split in the diversionary branch > paths for Nutch. > > Cheers, > Chris > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Chief Architect > Instrument Software and Science Data Systems Section (398) > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 168-519, Mailstop: 168-527 > Email: [email protected] > WWW: http://sunset.usc.edu/~mattmann/ > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Associate Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > > > > > > -----Original Message----- > From: Julien Nioche <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Friday, August 29, 2014 1:35 AM > To: "[email protected]" <[email protected]> > Subject: Re: [RELEASE] Apache Nutch 1.9 > > >Hi Lewis, > > > >A few comments below. > > > >I use Nutch 2.x as it enables me to do analytics over the data I am > >> crawling. This is my justification for trying to maintain an further the > >> development on that branch over the last while. > >> > > > >Just out of interest, what sort of analytics do you do and why is it > >better > >to do it in 2.x than 1.x? > > > > > >> I am also extremely interested in the technologies supported within the > >> Nutch 2.X stack and I like keeping up with their development and using > >>them > >> to fix my problems if and when the problems arise. > >> I like having fine grained control over my storage architecture. This is > >> also a pro for me. > >> > > > >Another way to look at it is that having to maintain 2 versions in Nutch > >is > >an absolute pain, especially given that there aren't very many active > >committers. > >IMHO the mistake we made a few years ago was to name the GORA-based branch > >'2.x' as it leads people to think that it is an improvement over 1.x. We > >should have called it something like Nutch-GORA or something along these > >lines (the original version was called NutchBase) to underline that it is > >a > >different beast, not necessarily a better one. > > > >Most users are probably not bothered in the underlying technologies so > >much > >and just want the stuff to work, not fix problems. In my view 2.x is not > >production ready, but an experimental branch. > > > > > > > >> The performance Julien talks about (and please correct me if I am wrong > >> Julien) is not so much Nutch related as it is Gora. Different Gora > >>backends > >> perform differently, this is itself driven by who wishes to maintain > >>them. > >> > > > >Not really. The overall performance has improved a bit with the latest > >version of GORA but not that different from what we reported in > >http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html. > >Some backends are probably better than others indeed but all of them are > >atrocious compared to 1.x, I think the reason for that is that these NoSQL > >tools are optimize to provide random reads/writes to the data and in Nutch > >we use them mostly in a sequential manner. Whether the functionalities we > >gain are worth the effort depends on everyone's use case. > > > > > >> On another note, we've identified that for users, Nutch 2.X is a bloody > >> pain to provision and get running. This is a problem for this branch and > >> for the people that invest and possibly waste time trying to determine > >> revisions, etc. > >> > > > >Could not agree more. That and the fact that it puts additional > >constraints > >on the hardware and means servers with bigger specs (££££) > > > > > >> > >> It is my intention to build different Vagrant flavours for each Nutch > >>2.X > >> stack. > >> https://issues.apache.org/jira/browse/NUTCH-1812 > >> > >> If ANYONE on this list is intersted in helping with this effort them I > >> would dedicate some time to document the process on the wiki so that it > >>can > >> be reproduced for everyone's benefit. I feel that this would be a huge > >>move > >> forward for the 2.X branch. > >> > > > > Thanks for your enthusiasm and efforts Lewis! > > > >For anyone insterested in 2.x - there are quite a few issues you can help > >with if you feel so inclined, see > > > https://issues.apache.org/jira/browse/NUTCH/fixforversion/12324325/?select > >edTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel > > > >Julien > > > >-- > > > >Open Source Solutions for Text Engineering > > > >http://digitalpebble.blogspot.com/ > >http://www.digitalpebble.com > >http://twitter.com/digitalpebble > >

