Hi Lewis,

My comment was not to be taken too seriously :-)

I think our official position is to support only the Apache distrib of
Hadoop (correct me if I am wrong), but when we can and if it does not take
much effort, getting Nutch to work on other distros would be a bonus as it
would facilitate its adoption. Sounds like the infra issue would be taken
care of, as for the 'quality' of the crawldb this is not so relevant here.

Thoughts from anyone else?

Thanks Lewis!

Julien


On 10 November 2011 16:13, Lewis John Mcgibbney
<[email protected]>wrote:

> Hi Julien,
>
> >
> >
> > you seem to imply that this is not the case but some of us do (or have
> > done) large crawls :-)
> >
>
> Not at all, I unreservedly take my words back if they came across like
> that, I am well aware of the work some of you guys are doing. Please let me
> rephrase, when was the last time we said we were able to benchmark Nutch
> performance, given some environmenal factors X, and some Hadoop
> distribution Y? Although this might not directly benefit the community in
> terms of improving Nutch codebase, it might give us an indication of where
> we can say Nutch works and where it doesn't. Take NUTCH-839 for example.
>
> >
> >
> >
> > We'd probably end up bogged down in endless discussions about parameters
> > tuning, feature comparison etc...
> >
>
> Ok I agree with you here, but I think as the purpose of this would not be
> mission critical for Nutch, it would however be nice to try and set some
> 'consistent' Nutch deployment ontop of many distros of Hadoop and build
> recursively.
>
>
>
> > I did not know about bigtop, thanks for the pointer!
> >
> > Who would provide the cluster for running the tests? Doing large scale
> > crawls is not just about setting it up and watching it work : it does
> > involve a fair amount of monitoring (unless you don't mind having 90% of
> > your crawlDB filled by porn/junk etc... ). Not sure who would find the
> time
> > to do that.
> >
>
> OK so the actual testing is hosted by Cloudera on thier own Jenkins area.
> From speaking to the guys here, they mentioned that Apache infrastructure
> was not quite sufficient enough to handle the testing environment so they
> are now working from Cloudera's infrastructure. From my discussions so far,
> they are wanting as many application running in a distributed fashion on
> the platform as possible as it will also help them to identify where bugs
> lie in thier own code. We all know Nutch is an ideal candidate for this
> type of job so hopefully we can find some common ground beneficial to both
> projects. I will post the Cloudera Jenkins URL when I find Roman and get it
> from him. In terms of maintenence, I really don't know how that would work
> Julien, but I know that I've got another two days of face-to-face
> oppertunity with these guys so there will be no better time to try and sort
> this kind of thing out.
>
> Thank you
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to