Hi Lewis, A few comments below.
I use Nutch 2.x as it enables me to do analytics over the data I am > crawling. This is my justification for trying to maintain an further the > development on that branch over the last while. > Just out of interest, what sort of analytics do you do and why is it better to do it in 2.x than 1.x? > I am also extremely interested in the technologies supported within the > Nutch 2.X stack and I like keeping up with their development and using them > to fix my problems if and when the problems arise. > I like having fine grained control over my storage architecture. This is > also a pro for me. > Another way to look at it is that having to maintain 2 versions in Nutch is an absolute pain, especially given that there aren't very many active committers. IMHO the mistake we made a few years ago was to name the GORA-based branch '2.x' as it leads people to think that it is an improvement over 1.x. We should have called it something like Nutch-GORA or something along these lines (the original version was called NutchBase) to underline that it is a different beast, not necessarily a better one. Most users are probably not bothered in the underlying technologies so much and just want the stuff to work, not fix problems. In my view 2.x is not production ready, but an experimental branch. > The performance Julien talks about (and please correct me if I am wrong > Julien) is not so much Nutch related as it is Gora. Different Gora backends > perform differently, this is itself driven by who wishes to maintain them. > Not really. The overall performance has improved a bit with the latest version of GORA but not that different from what we reported in http://digitalpebble.blogspot.co.uk/2013/09/nutch-fight-17-vs-221.html. Some backends are probably better than others indeed but all of them are atrocious compared to 1.x, I think the reason for that is that these NoSQL tools are optimize to provide random reads/writes to the data and in Nutch we use them mostly in a sequential manner. Whether the functionalities we gain are worth the effort depends on everyone's use case. > On another note, we've identified that for users, Nutch 2.X is a bloody > pain to provision and get running. This is a problem for this branch and > for the people that invest and possibly waste time trying to determine > revisions, etc. > Could not agree more. That and the fact that it puts additional constraints on the hardware and means servers with bigger specs (££££) > > It is my intention to build different Vagrant flavours for each Nutch 2.X > stack. > https://issues.apache.org/jira/browse/NUTCH-1812 > > If ANYONE on this list is intersted in helping with this effort them I > would dedicate some time to document the process on the wiki so that it can > be reproduced for everyone's benefit. I feel that this would be a huge move > forward for the 2.X branch. > Thanks for your enthusiasm and efforts Lewis! For anyone insterested in 2.x - there are quite a few issues you can help with if you feel so inclined, see https://issues.apache.org/jira/browse/NUTCH/fixforversion/12324325/?selectedTab=com.atlassian.jira.jira-projects-plugin:version-summary-panel Julien -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

