> Hi Lewis, > > My comment was not to be taken too seriously :-) > > I think our official position is to support only the Apache distrib of > Hadoop (correct me if I am wrong), but when we can and if it does not take > much effort, getting Nutch to work on other distros would be a bonus as it > would facilitate its adoption. Sounds like the infra issue would be taken > care of, as for the 'quality' of the crawldb this is not so relevant here. > > Thoughts from anyone else?
I would agree only to officially support Apache's own Hadoop dist, which can be difficult enough between versions. However, contribs for other dists could be shipped along with Nutch releases although i don't see how different API's could easiliy be integrated if that's the case. > > Thanks Lewis! > > Julien > > > On 10 November 2011 16:13, Lewis John Mcgibbney > > <[email protected]>wrote: > > Hi Julien, > > > > > you seem to imply that this is not the case but some of us do (or have > > > done) large crawls :-) > > > > Not at all, I unreservedly take my words back if they came across like > > that, I am well aware of the work some of you guys are doing. Please let > > me rephrase, when was the last time we said we were able to benchmark > > Nutch performance, given some environmenal factors X, and some Hadoop > > distribution Y? Although this might not directly benefit the community in > > terms of improving Nutch codebase, it might give us an indication of > > where we can say Nutch works and where it doesn't. Take NUTCH-839 for > > example. > > > > > We'd probably end up bogged down in endless discussions about > > > parameters tuning, feature comparison etc... > > > > Ok I agree with you here, but I think as the purpose of this would not be > > mission critical for Nutch, it would however be nice to try and set some > > 'consistent' Nutch deployment ontop of many distros of Hadoop and build > > recursively. > > > > > I did not know about bigtop, thanks for the pointer! > > > > > > Who would provide the cluster for running the tests? Doing large scale > > > crawls is not just about setting it up and watching it work : it does > > > involve a fair amount of monitoring (unless you don't mind having 90% > > > of your crawlDB filled by porn/junk etc... ). Not sure who would find > > > the > > > > time > > > > > to do that. > > > > OK so the actual testing is hosted by Cloudera on thier own Jenkins area. > > From speaking to the guys here, they mentioned that Apache infrastructure > > was not quite sufficient enough to handle the testing environment so they > > are now working from Cloudera's infrastructure. From my discussions so > > far, they are wanting as many application running in a distributed > > fashion on the platform as possible as it will also help them to > > identify where bugs lie in thier own code. We all know Nutch is an ideal > > candidate for this type of job so hopefully we can find some common > > ground beneficial to both projects. I will post the Cloudera Jenkins URL > > when I find Roman and get it from him. In terms of maintenence, I really > > don't know how that would work Julien, but I know that I've got another > > two days of face-to-face oppertunity with these guys so there will be no > > better time to try and sort this kind of thing out. > > > > Thank you

