Re: merge mapred to trunk

2005-09-15 Thread Doug Cutting
I will postpone the merge of the mapred branch into trunk until I have a 
chance to (a) add some MapReduce documentation; and (b) implement 
MapReduce-based dedup.


Doug

Doug Cutting wrote:
Currently we have three versions of nutch: trunk, 0.7 and mapred.  This 
increases the chances for conflicts.  I would thus like to merge the 
mapred branch into trunk soon.  The soonest I could actually start this 
is next week.  Are there any objections?


Doug


Re: merge mapred to trunk

2005-08-31 Thread Kelvin Tan


On Wed, 31 Aug 2005 14:37:54 -0700, Doug Cutting wrote:
>[EMAIL PROTECTED] wrote:
>> I, too, am looking forward to this, but I am wondering what that
>> will do to Kelvin Tan's recent contribution, especially since I
>> saw that both MapReduce and Kelvin's code change how
>> FetchListEntry works.  If merging mapred to trunk means losing
>> Kelvin's changes, then I suggest one of Nutch developers
>> evaluates Kelvin's modifications and, if they are good, commits
>> them to trunk, and then makes the final pre-mapred release (e.g.
>> release-0.8).
>>
>
> It won't lose Kelvin's patch: it will still be a patch to 0.7.
>
> What I worry about is the alternate scenario: that Kelvin & others
> invest a lot of effort making this work with 0.7, while the mapred-
> based code diverges even further.  It would be best if Kelvin's
> patch is ported to the mapred branch sooner rather than later, then
> maintained there.
>
> Doug

Agreed. I have some time in the coming weeks, and will work fulltime to evolve 
the patch to be more compatible with Nutch especially map-red..

k



Re: merge mapred to trunk

2005-08-31 Thread ogjunk-nutch
--- Doug Cutting <[EMAIL PROTECTED]> wrote:

> [EMAIL PROTECTED] wrote:
> > I, too, am looking forward to this, but I am wondering what that
> will
> > do to Kelvin Tan's recent contribution, especially since I saw that
> > both MapReduce and Kelvin's code change how FetchListEntry works. 
> If
> > merging mapred to trunk means losing Kelvin's changes, then I
> suggest
> > one of Nutch developers evaluates Kelvin's modifications and, if
> they
> > are good, commits them to trunk, and then makes the final
> pre-mapred
> > release (e.g. release-0.8).
> 
> It won't lose Kelvin's patch: it will still be a patch to 0.7.

Ah, right, we could always make a 0.7.* release from release 0.7.

> What I worry about is the alternate scenario: that Kelvin & others 
> invest a lot of effort making this work with 0.7, while the
> mapred-based 
> code diverges even further.  It would be best if Kelvin's patch is 
> ported to the mapred branch sooner rather than later, then maintained
> there.

I agree.  I'll actually see Kelvin in person tomorrow, so we'll see if
this is something he can do.  It looks like he added some much-needed
functionality in his patch, so it'd good to keep it.

Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.


Re: merge mapred to trunk

2005-08-31 Thread Doug Cutting

[EMAIL PROTECTED] wrote:

I, too, am looking forward to this, but I am wondering what that will
do to Kelvin Tan's recent contribution, especially since I saw that
both MapReduce and Kelvin's code change how FetchListEntry works.  If
merging mapred to trunk means losing Kelvin's changes, then I suggest
one of Nutch developers evaluates Kelvin's modifications and, if they
are good, commits them to trunk, and then makes the final pre-mapred
release (e.g. release-0.8).


It won't lose Kelvin's patch: it will still be a patch to 0.7.

What I worry about is the alternate scenario: that Kelvin & others 
invest a lot of effort making this work with 0.7, while the mapred-based 
code diverges even further.  It would be best if Kelvin's patch is 
ported to the mapred branch sooner rather than later, then maintained there.


Doug


Re: merge mapred to trunk

2005-08-31 Thread ogjunk-nutch
> Currently we have three versions of nutch: trunk, 0.7 and mapred. 
> This 
> increases the chances for conflicts.  I would thus like to merge the 
> mapred branch into trunk soon.  The soonest I could actually start
> this is next week.  Are there any objections?

I, too, am looking forward to this, but I am wondering what that will
do to Kelvin Tan's recent contribution, especially since I saw that
both MapReduce and Kelvin's code change how FetchListEntry works.  If
merging mapred to trunk means losing Kelvin's changes, then I suggest
one of Nutch developers evaluates Kelvin's modifications and, if they
are good, commits them to trunk, and then makes the final pre-mapred
release (e.g. release-0.8).

Otis

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.


Re: merge mapred to trunk

2005-08-31 Thread Andrzej Bialecki

Doug Cutting wrote:
Currently we have three versions of nutch: trunk, 0.7 and mapred.  This 
increases the chances for conflicts.  I would thus like to merge the 
mapred branch into trunk soon.  The soonest I could actually start this 
is next week.  Are there any objections?


++1 :-)


--
Best regards,
Andrzej Bialecki <><
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Re: merge mapred to trunk

2005-08-31 Thread Doug Cutting

Jérôme Charron wrote:

I don't take a look yet at mapred branch.
It will going to be a good surprise to discover it in the trunk... ;-)


I will make some effort to document things more before I merge to trunk, 
so that folks know what they're getting.  Many things have changed 
(e.g., segment format).  Several things have not yet been fully worked 
out and/or implemented (e.g., segment merging).  But the basics are all 
working (intranet and & whole-web crawling, indexing & search), both in 
standalone and distributed configurations.  My focus has been stress 
testing the distributed infrastructure (NDFS & MapReduce).  We've 
discovered and fixed a number of bugs in this over recent weeks, so it 
is getting ever more stable.  I'm hoping that others can help fill in 
the gaps in tools.


Once the merge is done I'd like to make a few other changes.

These are:

  1. Remove most static references to NutchConf outside of main() 
routines.  The MapReduce-based versions of the command line tools have 
no such references.  The biggest change here will be to plugins. 
Plugins APIs should probably all be modified to use a factory, and the 
factory should be constructed from a NutchConf, e.g., something like:

  public static PluginXFactory PluginXFactory.getFactory(NutchConf);
  public PluginX PluginXFactory.getPlugin(...);
This should permit folks to more easily configure things programatically 
(think JMX) and to run multiple configurations in a single JVM.


  2. FetchListEntry has been mostly replaced with a new, simpler 
datastructure called a CrawlDatum.  FetchListEntry is used in the 
IndexingFilter API to pass the url, fetch date and incoming anchors. 
Currently, in the mapred branch, the indexer creates a dummy 
FetchListEntry to pass to plugins.  But instead the IndexingFilter API 
should probably be altered to pass the CrawlDatum, anchors and url.


I have avoided making these changes since they would make it difficult 
to merge improvements to plugins into the mapred branch.  But, once we 
have moved mapred to trunk, we should make them soon.  Incompatible API 
changes are best made early, so that folks have more time to work with them.


Does this all sound reasonable?

Doug



Re: merge mapred to trunk

2005-08-31 Thread Jérôme Charron
On 8/31/05, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote:
> 
> Doug Cutting wrote:
> > Currently we have three versions of nutch: trunk, 0.7 and mapred. This
> > increases the chances for conflicts. I would thus like to merge the
> > mapred branch into trunk soon. The soonest I could actually start this
> > is next week. Are there any objections?

+1
I don't take a look yet at mapred branch.
It will going to be a good surprise to discover it in the trunk... ;-)

Jérôme

-- 
http://motrech.free.fr/
http://www.frutch.org/


Re: merge mapred to trunk

2005-08-31 Thread Piotr Kosiorowski

Doug Cutting wrote:
Currently we have three versions of nutch: trunk, 0.7 and mapred.  This 
increases the chances for conflicts.  I would thus like to merge the 
mapred branch into trunk soon.  The soonest I could actually start this 
is next week.  Are there any objections?


Doug


+1
P.



merge mapred to trunk

2005-08-31 Thread Doug Cutting
Currently we have three versions of nutch: trunk, 0.7 and mapred.  This 
increases the chances for conflicts.  I would thus like to merge the 
mapred branch into trunk soon.  The soonest I could actually start this 
is next week.  Are there any objections?


Doug