Hi Julien, > > > you seem to imply that this is not the case but some of us do (or have > done) large crawls :-) >
Not at all, I unreservedly take my words back if they came across like that, I am well aware of the work some of you guys are doing. Please let me rephrase, when was the last time we said we were able to benchmark Nutch performance, given some environmenal factors X, and some Hadoop distribution Y? Although this might not directly benefit the community in terms of improving Nutch codebase, it might give us an indication of where we can say Nutch works and where it doesn't. Take NUTCH-839 for example. > > > > We'd probably end up bogged down in endless discussions about parameters > tuning, feature comparison etc... > Ok I agree with you here, but I think as the purpose of this would not be mission critical for Nutch, it would however be nice to try and set some 'consistent' Nutch deployment ontop of many distros of Hadoop and build recursively. > I did not know about bigtop, thanks for the pointer! > > Who would provide the cluster for running the tests? Doing large scale > crawls is not just about setting it up and watching it work : it does > involve a fair amount of monitoring (unless you don't mind having 90% of > your crawlDB filled by porn/junk etc... ). Not sure who would find the time > to do that. > OK so the actual testing is hosted by Cloudera on thier own Jenkins area. >From speaking to the guys here, they mentioned that Apache infrastructure was not quite sufficient enough to handle the testing environment so they are now working from Cloudera's infrastructure. From my discussions so far, they are wanting as many application running in a distributed fashion on the platform as possible as it will also help them to identify where bugs lie in thier own code. We all know Nutch is an ideal candidate for this type of job so hopefully we can find some common ground beneficial to both projects. I will post the Cloudera Jenkins URL when I find Roman and get it from him. In terms of maintenence, I really don't know how that would work Julien, but I know that I've got another two days of face-to-face oppertunity with these guys so there will be no better time to try and sort this kind of thing out. Thank you

