Yes, big time interest, Breno! Thanks and would appreciate your contribution. Instructions are here if you use Github:
http://github.com/apache/nutch/#contributing, otherwise, JIRA and SVN patch would be fine too. Thanks! Cheers, Chris ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Chief Architect Instrument Software and Science Data Systems Section (398) NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 168-519, Mailstop: 168-527 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Associate Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ -----Original Message----- From: Breno Faria <[email protected]> Reply-To: "[email protected]" <[email protected]> Date: Wednesday, June 3, 2015 at 1:13 AM To: "[email protected]" <[email protected]> Subject: AW: Deduplication -- custom Signature >Hi Lewis, > >Thanks for the explanation! > >> I understand that and I would say it makes sense, however as I said >>before Signatures are usually part of the core codebase and not >>implemented as plugins (at least I've never implemented a signature as a >>plugin). > >Is there interest in including the domain aware Signature in nutch? I >would gladly contribute. > >Cheers > >Breno Faria >Software Architect – Text Analytics >Intrafind Software AG >Tel: +49 (89) 3090446-26 >Web: http://www.intrafind.de > > >-----Ursprüngliche Nachricht----- >Von: Lewis John Mcgibbney [mailto:[email protected]] >Gesendet: Dienstag, 2. Juni 2015 19:06 >An: [email protected] >Betreff: Re: Deduplication -- custom Signature > >Hi Breno, > >On Tue, Jun 2, 2015 at 1:38 AM, <[email protected]> wrote: > >> >> We are indexing several domains for a specific project, which may >> contain duplicated content (e.g. pdf files). The users of the system >> come from different organisations and wonder why the content is not >> appearing under certain domains. It's a usability issue (with a >>political aftertaste). >> > >Thanks for explanation. > > >> >> Yes, I extended Signature, and I'm also able to use it through the >> db.signature.class property, if I pack the class into its own jar and >> put it into nutch/lib. I'd much rather like to include it in our >> existing plugin jar, though. > > >This is rather strange as Signatures are part of the *core* codebase e.g. >/src/java and not /src/plugins. Does this make sense? > > >> I'm not sure what you mean by ".job jar.". > > >If you build the Nutch source, you'll see /runtime/deploy/nutch.XXX.job >this is the main artifact sent to deployment clusters (JobTracker). > > >> We have been developing our plugin outside of nutch and placing the >> corresponding jars into a plugin directory together with the >> plugin.xml. Is there any "magic" happening regarding the classpath >> when one has ant building it inside nutch? > > >In general our documentation can be seen here >http://wiki.apache.org/nutch/PluginCentral >Specifically, you can see here >http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading >This is why I think it is a bit strange that you've implemented your >signature as a plugin and not as part of the core codebase. > > > >> Is there a naming convention regarding the plugin name and >> corresponding jar? Do they have to match? >> > >For plugins, accompanying and required files and naming conventions >please see http://wiki.apache.org/nutch/WritingPluginExample > > >> >> The reason behind developing our plugin outside of nutch and >> decoupling the build environment is to make updates of nutch easier. >> That way we can simply download the binary release and overlay our >> plugin. I realize now this seems to be a little off the usual way of >>writing plugins for nutch. >> > >I understand that and I would say it makes sense, however as I said >before Signatures are usually part of the core codebase and not >implemented as plugins (at least I've never implemented a signature as a >plugin). >hth >Lewis

