Hi Lewis I've been talking to the guys @ Bigtop about taking the Nutch testing > ecosystem to a new level. This would involve testing Nutch at what it was > orginally set up to do... large scale web crawling.
you seem to imply that this is not the case but some of us do (or have done) large crawls :-) > Testing in an > integrated testing environment would enable us to continuously test Nutch > ontop of all supported versions of Hadoop as Bigtop is designed to test the > packaging and interoperability testing of Hadoop-related projects, > therefore the testing would be done in a highly distributed environment, > therefore enabling us to determine which versions of Hadoop work best with > and which ones maybe don't. > We'd probably end up bogged down in endless discussions about parameters tuning, feature comparison etc... > > At the moment, it would be nice to know, as a community, which versions of > Hadoop we are all running Nutch on. Are there any preferences, > > [1] http://incubator.apache.org/bigtop/ > I did not know about bigtop, thanks for the pointer! Who would provide the cluster for running the tests? Doing large scale crawls is not just about setting it up and watching it work : it does involve a fair amount of monitoring (unless you don't mind having 90% of your crawlDB filled by porn/junk etc... ). Not sure who would find the time to do that. Thanks Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

