I think it is a great idea of build images with an environment correctly
set up. I think two types of images would be helpful.

1. Development (Virtualbox)
Here, we have Eclipse, plugin, pseudo hadoop...etc correctly installed
maybe on a ubuntu box with 3D-acceleration enabled. Then people can
download that image and start the virtualbox if they want to develop and
debug the code. Like what Tejas has done in his runNutchInEclipse
<http://wiki.apache.org/nutch/RunNutchInEclipse>.

2. Application (Docker)
If people just want to use Nutch2.X, and don't care what is behind the
scene, Docker containers is a good way to distribute and you don't need
virtualbox installed..
For example, the selenium grid docker containers
<https://github.com/momer/nutch-selenium-grid-plugin> contributed by momer
is super handy to use and save me tons of time when integrating Nutch with
Selenium.

I personally have Virtualbox image installed on my gaming desktop :) and I
have some experience with docker which I am willing to spend more time on
it.

Bin


On Fri, Aug 29, 2014 at 8:29 AM, Mattmann, Chris A (3980) <
[email protected]> wrote:

> +1, great.
>
> I'd like to have a conversation about versioning.
>
> Since we're at 1.9, my suggestion would be to have the
> next in the trunk series (1.x) move to version 3.x post
> 1.9 for the release.
>
> Nutch2 remains Nutch and can be worked on there. That
> would give us a nice split in the diversionary branch
> paths for Nutch.
>
> Cheers,
> Chris
>
>
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Chris Mattmann, Ph.D.
> Chief Architect
> Instrument Software and Science Data Systems Section (398)
> NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
> Office: 168-519, Mailstop: 168-527
> Email: [email protected]
> WWW:  http://sunset.usc.edu/~mattmann/
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> Adjunct Associate Professor, Computer Science Department
> University of Southern California, Los Angeles, CA 90089 USA
> ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>
>
>
>
>
>
> -----Original Message-----
> From: Julien Nioche <[email protected]>
> Reply-To: "[email protected]" <[email protected]>
> Date: Friday, August 29, 2014 1:35 AM
> To: "[email protected]" <[email protected]>
> Subject: Re: [RELEASE] Apache Nutch 1.9
>
> >Hi Lewis,
> >
> >A few comments below.
> >
> >I use Nutch 2.x as it enables me to do analytics over the data I am
> >> crawling. This is my justification for trying to maintain an further the
> >> development on that branch over the last while.
> >>
> >
> >Just out of interest, what sort of analytics do you do and why is it
> >better
> >to do it in 2.x than 1.x?
> >
> >
> >> I am also extremely interested in the technologies supported within the
> >> Nutch 2.X stack and I like keeping up with their development and using
> >>them
> >> to fix my problems if and when the problems arise.
> >> I like having fine grained control over my storage architecture. This is
> >> also a pro for me.
> >>
> >
> >Another way to look at it is that having to maintain 2 versions in Nutch
> >is
> >an absolute pain, especially given that there aren't very many active
> >committers.
> >IMHO the mistake we made a few years ago was to name the GORA-based branch
> >'2.x' as it leads people to think that it is an improvement over 1.x. We
> >should have called it something like Nutch-GORA or something along these
> >lines (the original version was called NutchBase) to underline that it is
> >a
> >different beast, not necessarily a better one.
> >
> >Most users are probably not bothered in the underlying technologies so
> >much
> >and just want the stuff to work, not fix problems. In my view 2.x is not
> >production ready, but an experimental branch.
> >
> >
> >
> >> The performance Julien talks about (and please correct me if I am wrong
> >> Julien) is not so much Nutch related as it is Gora. Different Gora
> >>backends
> >> perform differently, this is itself driven by who wishes to maintain
> >>them.
> >>
> >
> >Not really. The overall performance has improved a bit with the latest
> >version of GORA but not that different from what we reported in
> >http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html.
> >Some backends are probably better than others indeed but all of them are
> >atrocious compared to 1.x, I think the reason for that is that these NoSQL
> >tools are optimize to provide random reads/writes to the data and in Nutch
> >we use them mostly in a sequential manner. Whether the functionalities we
> >gain are worth the effort depends on everyone's use case.
> >
> >
> >> On another note, we've identified that for users, Nutch 2.X is a bloody
> >> pain to provision and get running. This is a problem for this branch and
> >> for the people that invest and possibly waste time trying to determine
> >> revisions, etc.
> >>
> >
> >Could not agree more. That and the fact that it puts additional
> >constraints
> >on the hardware and means servers with bigger specs (££££)
> >
> >
> >>
> >> It is my intention to build different Vagrant flavours for each Nutch
> >>2.X
> >> stack.
> >> https://issues.apache.org/jira/browse/NUTCH-1812
> >>
> >> If ANYONE on this list is intersted in helping with this effort them I
> >> would dedicate some time to document the process on the wiki so that it
> >>can
> >> be reproduced for everyone's benefit. I feel that this would be a huge
> >>move
> >> forward for the 2.X branch.
> >>
> >
> > Thanks for your enthusiasm and efforts Lewis!
> >
> >For anyone insterested in 2.x - there are quite a few issues you can help
> >with if you feel so inclined, see
> >
> https://issues.apache.org/jira/browse/NUTCH/fixforversion/12324325/?select
> >edTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel
> >
> >Julien
> >
> >--
> >
> >Open Source Solutions for Text Engineering
> >
> >http://digitalpebble.blogspot.com/
> >http://www.digitalpebble.com
> >http://twitter.com/digitalpebble
>
>

Reply via email to