Hi Lewis,

Thanks for the explanation!

> I understand that and I would say it makes sense, however as I said before 
> Signatures are usually part of the core codebase and not implemented as 
> plugins (at least I've never implemented a signature as a plugin).

Is there interest in including the domain aware Signature in nutch? I would 
gladly contribute.

Cheers

Breno Faria
Software Architect – Text Analytics
Intrafind Software AG
Tel:      +49 (89) 3090446-26 
Web:    http://www.intrafind.de


-----Ursprüngliche Nachricht-----
Von: Lewis John Mcgibbney [mailto:[email protected]] 
Gesendet: Dienstag, 2. Juni 2015 19:06
An: [email protected]
Betreff: Re: Deduplication -- custom Signature

Hi Breno,

On Tue, Jun 2, 2015 at 1:38 AM, <[email protected]> wrote:

>
> We are indexing several domains for a specific project, which may 
> contain duplicated content (e.g. pdf files). The users of the system 
> come from different organisations and wonder why the content is not 
> appearing under certain domains. It's a usability issue (with a political 
> aftertaste).
>

Thanks for explanation.


>
> Yes, I extended Signature, and I'm also able to use it through the 
> db.signature.class property, if I pack the class into its own jar and 
> put it into nutch/lib. I'd much rather like to include it in our 
> existing plugin jar, though.


This is rather strange as Signatures are part of the *core* codebase e.g.
/src/java and not /src/plugins. Does this make sense?


> I'm not sure what you mean by ".job jar.".


If you build the Nutch source, you'll see /runtime/deploy/nutch.XXX.job this is 
the main artifact sent to deployment clusters (JobTracker).


> We have been developing our plugin outside of nutch and placing the 
> corresponding jars into a plugin directory together with the 
> plugin.xml. Is there any "magic" happening regarding the classpath 
> when one has ant building it inside nutch?


In general our documentation can be seen here 
http://wiki.apache.org/nutch/PluginCentral
Specifically, you can see here
http://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading
This is why I think it is a bit strange that you've implemented your signature 
as a plugin and not as part of the core codebase.



> Is there a naming convention regarding the plugin name and 
> corresponding jar? Do they have to match?
>

For plugins, accompanying and required files and naming conventions please see 
http://wiki.apache.org/nutch/WritingPluginExample


>
> The reason behind developing our plugin outside of nutch and 
> decoupling the build environment is to make updates of nutch easier. 
> That way we can simply download the binary release and overlay our 
> plugin. I realize now this seems to be a little off the usual way of writing 
> plugins for nutch.
>

I understand that and I would say it makes sense, however as I said before 
Signatures are usually part of the core codebase and not implemented as plugins 
(at least I've never implemented a signature as a plugin).
hth
Lewis

Reply via email to