Hi Lewis, Thanks for the explanation!
> I understand that and I would say it makes sense, however as I said before > Signatures are usually part of the core codebase and not implemented as > plugins (at least I've never implemented a signature as a plugin). Is there interest in including the domain aware Signature in nutch? I would gladly contribute. Cheers Breno Faria Software Architect – Text Analytics Intrafind Software AG Tel: +49 (89) 3090446-26 Web: http://www.intrafind.de -----Ursprüngliche Nachricht----- Von: Lewis John Mcgibbney [mailto:[email protected]] Gesendet: Dienstag, 2. Juni 2015 19:06 An: [email protected] Betreff: Re: Deduplication -- custom Signature Hi Breno, On Tue, Jun 2, 2015 at 1:38 AM, <[email protected]> wrote: > > We are indexing several domains for a specific project, which may > contain duplicated content (e.g. pdf files). The users of the system > come from different organisations and wonder why the content is not > appearing under certain domains. It's a usability issue (with a political > aftertaste). > Thanks for explanation. > > Yes, I extended Signature, and I'm also able to use it through the > db.signature.class property, if I pack the class into its own jar and > put it into nutch/lib. I'd much rather like to include it in our > existing plugin jar, though. This is rather strange as Signatures are part of the *core* codebase e.g. /src/java and not /src/plugins. Does this make sense? > I'm not sure what you mean by ".job jar.". If you build the Nutch source, you'll see /runtime/deploy/nutch.XXX.job this is the main artifact sent to deployment clusters (JobTracker). > We have been developing our plugin outside of nutch and placing the > corresponding jars into a plugin directory together with the > plugin.xml. Is there any "magic" happening regarding the classpath > when one has ant building it inside nutch? In general our documentation can be seen here http://wiki.apache.org/nutch/PluginCentral Specifically, you can see here http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading This is why I think it is a bit strange that you've implemented your signature as a plugin and not as part of the core codebase. > Is there a naming convention regarding the plugin name and > corresponding jar? Do they have to match? > For plugins, accompanying and required files and naming conventions please see http://wiki.apache.org/nutch/WritingPluginExample > > The reason behind developing our plugin outside of nutch and > decoupling the build environment is to make updates of nutch easier. > That way we can simply download the binary release and overlay our > plugin. I realize now this seems to be a little off the usual way of writing > plugins for nutch. > I understand that and I would say it makes sense, however as I said before Signatures are usually part of the core codebase and not implemented as plugins (at least I've never implemented a signature as a plugin). hth Lewis

