Re: Problems with running Nutch on different Hadoop distro's

Markus Jelsma Thu, 10 Nov 2011 12:03:07 -0800

> Hi Lewis,
> 
> My comment was not to be taken too seriously :-)
> 
> I think our official position is to support only the Apache distrib of
> Hadoop (correct me if I am wrong), but when we can and if it does not take
> much effort, getting Nutch to work on other distros would be a bonus as it
> would facilitate its adoption. Sounds like the infra issue would be taken
> care of, as for the 'quality' of the crawldb this is not so relevant here.
> 
> Thoughts from anyone else?


I would agree only to officially support Apache's own Hadoop dist, which can 
be difficult enough between versions. However, contribs for other dists could 
be shipped along with Nutch releases although i don't see how different API's 
could easiliy be integrated if that's the case.

> 
> Thanks Lewis!
> 
> Julien
> 
> 
> On 10 November 2011 16:13, Lewis John Mcgibbney
> 
> <[email protected]>wrote:
> > Hi Julien,
> > 
> > > you seem to imply that this is not the case but some of us do (or have
> > > done) large crawls :-)
> > 
> > Not at all, I unreservedly take my words back if they came across like
> > that, I am well aware of the work some of you guys are doing. Please let
> > me rephrase, when was the last time we said we were able to benchmark
> > Nutch performance, given some environmenal factors X, and some Hadoop
> > distribution Y? Although this might not directly benefit the community in
> > terms of improving Nutch codebase, it might give us an indication of
> > where we can say Nutch works and where it doesn't. Take NUTCH-839 for
> > example.
> > 
> > > We'd probably end up bogged down in endless discussions about
> > > parameters tuning, feature comparison etc...
> > 
> > Ok I agree with you here, but I think as the purpose of this would not be
> > mission critical for Nutch, it would however be nice to try and set some
> > 'consistent' Nutch deployment ontop of many distros of Hadoop and build
> > recursively.
> > 
> > > I did not know about bigtop, thanks for the pointer!
> > > 
> > > Who would provide the cluster for running the tests? Doing large scale
> > > crawls is not just about setting it up and watching it work : it does
> > > involve a fair amount of monitoring (unless you don't mind having 90%
> > > of your crawlDB filled by porn/junk etc... ). Not sure who would find
> > > the
> > 
> > time
> > 
> > > to do that.
> > 
> > OK so the actual testing is hosted by Cloudera on thier own Jenkins area.
> > From speaking to the guys here, they mentioned that Apache infrastructure
> > was not quite sufficient enough to handle the testing environment so they
> > are now working from Cloudera's infrastructure. From my discussions so
> > far, they are wanting as many application running in a distributed
> > fashion on the platform as possible as it will also help them to
> > identify where bugs lie in thier own code. We all know Nutch is an ideal
> > candidate for this type of job so hopefully we can find some common
> > ground beneficial to both projects. I will post the Cloudera Jenkins URL
> > when I find Roman and get it from him. In terms of maintenence, I really
> > don't know how that would work Julien, but I know that I've got another
> > two days of face-to-face oppertunity with these guys so there will be no
> > better time to try and sort this kind of thing out.
> > 
> > Thank you

Re: Problems with running Nutch on different Hadoop distro's

Reply via email to