So the Jenkins area for Big Top is here. http://bigtop01.cloudera.org:8080
I'll try to push through with this so heres to hoping :0) Thanks On Thu, Nov 10, 2011 at 12:02 PM, Markus Jelsma <[email protected]>wrote: > > > Hi Lewis, > > > > My comment was not to be taken too seriously :-) > > > > I think our official position is to support only the Apache distrib of > > Hadoop (correct me if I am wrong), but when we can and if it does not > take > > much effort, getting Nutch to work on other distros would be a bonus as > it > > would facilitate its adoption. Sounds like the infra issue would be taken > > care of, as for the 'quality' of the crawldb this is not so relevant > here. > > > > Thoughts from anyone else? > > I would agree only to officially support Apache's own Hadoop dist, which > can > be difficult enough between versions. However, contribs for other dists > could > be shipped along with Nutch releases although i don't see how different > API's > could easiliy be integrated if that's the case. > > > > > Thanks Lewis! > > > > Julien > > > > > > On 10 November 2011 16:13, Lewis John Mcgibbney > > > > <[email protected]>wrote: > > > Hi Julien, > > > > > > > you seem to imply that this is not the case but some of us do (or > have > > > > done) large crawls :-) > > > > > > Not at all, I unreservedly take my words back if they came across like > > > that, I am well aware of the work some of you guys are doing. Please > let > > > me rephrase, when was the last time we said we were able to benchmark > > > Nutch performance, given some environmenal factors X, and some Hadoop > > > distribution Y? Although this might not directly benefit the community > in > > > terms of improving Nutch codebase, it might give us an indication of > > > where we can say Nutch works and where it doesn't. Take NUTCH-839 for > > > example. > > > > > > > We'd probably end up bogged down in endless discussions about > > > > parameters tuning, feature comparison etc... > > > > > > Ok I agree with you here, but I think as the purpose of this would not > be > > > mission critical for Nutch, it would however be nice to try and set > some > > > 'consistent' Nutch deployment ontop of many distros of Hadoop and build > > > recursively. > > > > > > > I did not know about bigtop, thanks for the pointer! > > > > > > > > Who would provide the cluster for running the tests? Doing large > scale > > > > crawls is not just about setting it up and watching it work : it does > > > > involve a fair amount of monitoring (unless you don't mind having 90% > > > > of your crawlDB filled by porn/junk etc... ). Not sure who would find > > > > the > > > > > > time > > > > > > > to do that. > > > > > > OK so the actual testing is hosted by Cloudera on thier own Jenkins > area. > > > From speaking to the guys here, they mentioned that Apache > infrastructure > > > was not quite sufficient enough to handle the testing environment so > they > > > are now working from Cloudera's infrastructure. From my discussions so > > > far, they are wanting as many application running in a distributed > > > fashion on the platform as possible as it will also help them to > > > identify where bugs lie in thier own code. We all know Nutch is an > ideal > > > candidate for this type of job so hopefully we can find some common > > > ground beneficial to both projects. I will post the Cloudera Jenkins > URL > > > when I find Roman and get it from him. In terms of maintenence, I > really > > > don't know how that would work Julien, but I know that I've got another > > > two days of face-to-face oppertunity with these guys so there will be > no > > > better time to try and sort this kind of thing out. > > > > > > Thank you > -- *Lewis*

