Re: merge mapred to trunk
I will postpone the merge of the mapred branch into trunk until I have a chance to (a) add some MapReduce documentation; and (b) implement MapReduce-based dedup. Doug Doug Cutting wrote: Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances for conflicts. I would thus like to merge the mapred branch into trunk soon. The soonest I could actually start this is next week. Are there any objections? Doug
Re: merge mapred to trunk
On Wed, 31 Aug 2005 14:37:54 -0700, Doug Cutting wrote: >[EMAIL PROTECTED] wrote: >> I, too, am looking forward to this, but I am wondering what that >> will do to Kelvin Tan's recent contribution, especially since I >> saw that both MapReduce and Kelvin's code change how >> FetchListEntry works. If merging mapred to trunk means losing >> Kelvin's changes, then I suggest one of Nutch developers >> evaluates Kelvin's modifications and, if they are good, commits >> them to trunk, and then makes the final pre-mapred release (e.g. >> release-0.8). >> > > It won't lose Kelvin's patch: it will still be a patch to 0.7. > > What I worry about is the alternate scenario: that Kelvin & others > invest a lot of effort making this work with 0.7, while the mapred- > based code diverges even further. It would be best if Kelvin's > patch is ported to the mapred branch sooner rather than later, then > maintained there. > > Doug Agreed. I have some time in the coming weeks, and will work fulltime to evolve the patch to be more compatible with Nutch especially map-red.. k
Re: merge mapred to trunk
--- Doug Cutting <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: > > I, too, am looking forward to this, but I am wondering what that > will > > do to Kelvin Tan's recent contribution, especially since I saw that > > both MapReduce and Kelvin's code change how FetchListEntry works. > If > > merging mapred to trunk means losing Kelvin's changes, then I > suggest > > one of Nutch developers evaluates Kelvin's modifications and, if > they > > are good, commits them to trunk, and then makes the final > pre-mapred > > release (e.g. release-0.8). > > It won't lose Kelvin's patch: it will still be a patch to 0.7. Ah, right, we could always make a 0.7.* release from release 0.7. > What I worry about is the alternate scenario: that Kelvin & others > invest a lot of effort making this work with 0.7, while the > mapred-based > code diverges even further. It would be best if Kelvin's patch is > ported to the mapred branch sooner rather than later, then maintained > there. I agree. I'll actually see Kelvin in person tomorrow, so we'll see if this is something he can do. It looks like he added some much-needed functionality in his patch, so it'd good to keep it. Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.
Re: merge mapred to trunk
[EMAIL PROTECTED] wrote: I, too, am looking forward to this, but I am wondering what that will do to Kelvin Tan's recent contribution, especially since I saw that both MapReduce and Kelvin's code change how FetchListEntry works. If merging mapred to trunk means losing Kelvin's changes, then I suggest one of Nutch developers evaluates Kelvin's modifications and, if they are good, commits them to trunk, and then makes the final pre-mapred release (e.g. release-0.8). It won't lose Kelvin's patch: it will still be a patch to 0.7. What I worry about is the alternate scenario: that Kelvin & others invest a lot of effort making this work with 0.7, while the mapred-based code diverges even further. It would be best if Kelvin's patch is ported to the mapred branch sooner rather than later, then maintained there. Doug
Re: merge mapred to trunk
> Currently we have three versions of nutch: trunk, 0.7 and mapred. > This > increases the chances for conflicts. I would thus like to merge the > mapred branch into trunk soon. The soonest I could actually start > this is next week. Are there any objections? I, too, am looking forward to this, but I am wondering what that will do to Kelvin Tan's recent contribution, especially since I saw that both MapReduce and Kelvin's code change how FetchListEntry works. If merging mapred to trunk means losing Kelvin's changes, then I suggest one of Nutch developers evaluates Kelvin's modifications and, if they are good, commits them to trunk, and then makes the final pre-mapred release (e.g. release-0.8). Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ -- Find it. Tag it. Share it.
Re: merge mapred to trunk
Doug Cutting wrote: Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances for conflicts. I would thus like to merge the mapred branch into trunk soon. The soonest I could actually start this is next week. Are there any objections? ++1 :-) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
Re: merge mapred to trunk
Jérôme Charron wrote: I don't take a look yet at mapred branch. It will going to be a good surprise to discover it in the trunk... ;-) I will make some effort to document things more before I merge to trunk, so that folks know what they're getting. Many things have changed (e.g., segment format). Several things have not yet been fully worked out and/or implemented (e.g., segment merging). But the basics are all working (intranet and & whole-web crawling, indexing & search), both in standalone and distributed configurations. My focus has been stress testing the distributed infrastructure (NDFS & MapReduce). We've discovered and fixed a number of bugs in this over recent weeks, so it is getting ever more stable. I'm hoping that others can help fill in the gaps in tools. Once the merge is done I'd like to make a few other changes. These are: 1. Remove most static references to NutchConf outside of main() routines. The MapReduce-based versions of the command line tools have no such references. The biggest change here will be to plugins. Plugins APIs should probably all be modified to use a factory, and the factory should be constructed from a NutchConf, e.g., something like: public static PluginXFactory PluginXFactory.getFactory(NutchConf); public PluginX PluginXFactory.getPlugin(...); This should permit folks to more easily configure things programatically (think JMX) and to run multiple configurations in a single JVM. 2. FetchListEntry has been mostly replaced with a new, simpler datastructure called a CrawlDatum. FetchListEntry is used in the IndexingFilter API to pass the url, fetch date and incoming anchors. Currently, in the mapred branch, the indexer creates a dummy FetchListEntry to pass to plugins. But instead the IndexingFilter API should probably be altered to pass the CrawlDatum, anchors and url. I have avoided making these changes since they would make it difficult to merge improvements to plugins into the mapred branch. But, once we have moved mapred to trunk, we should make them soon. Incompatible API changes are best made early, so that folks have more time to work with them. Does this all sound reasonable? Doug
Re: merge mapred to trunk
On 8/31/05, Piotr Kosiorowski <[EMAIL PROTECTED]> wrote: > > Doug Cutting wrote: > > Currently we have three versions of nutch: trunk, 0.7 and mapred. This > > increases the chances for conflicts. I would thus like to merge the > > mapred branch into trunk soon. The soonest I could actually start this > > is next week. Are there any objections? +1 I don't take a look yet at mapred branch. It will going to be a good surprise to discover it in the trunk... ;-) Jérôme -- http://motrech.free.fr/ http://www.frutch.org/
Re: merge mapred to trunk
Doug Cutting wrote: Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances for conflicts. I would thus like to merge the mapred branch into trunk soon. The soonest I could actually start this is next week. Are there any objections? Doug +1 P.
merge mapred to trunk
Currently we have three versions of nutch: trunk, 0.7 and mapred. This increases the chances for conflicts. I would thus like to merge the mapred branch into trunk soon. The soonest I could actually start this is next week. Are there any objections? Doug