>Hi Hiran,
>
>ok, got it. - the problem is already given in 
>https://issues.apache.org/jira/browse/NUTCH-427 :)

Indeed - when rereading that article it exactly describes my perception.

>In this case, you're right. The plugin system wasn't designed to manipulate 
>Java system properties.

If it does not then setting the system property when using the crawl script 
should have helped - but then I probably missed putting the jar into the system 
classpath.

> But it should be possible to do it by adding a static hook which is called 
> before instantiation.

When you look at the protocol-smb hook it comes with this static hook, but as 
it is never executed does not help.

> The second problem would be the class loader encapsulation: the class 
> java.net.URL is used in many places and the protocol handler 
> (jcifs.smb.Handler) must be globally available.

True. That is where I almost assumed the Nutch configuration code would at some 
point collect all the protocol plugins (everything registered to the protocol 
extension point) and set the system property globally but could not find it.

> But be pragmatic - protocol-smb will not make it into the "official" Nutch 
> package because of the LGPL license [1]. 

That is understood. Although I could think of two other exercises that would 
help:
- create a tutorial to add some arbitrary protocol (e.g. the foo://bar/baz url)
- modify the protocol-smb plugin to make use of the smbclient binary.

I'd be willing to do the latter but would like to see a less clumsy behaviour 
for plugins. Adding the plugin plus modifying config files should be enough in 
my eyes.

> To make protocol-smb working for "your" Nutch package:
> 
> 1. set the system property accordingly. If you use bin/nutch, modify it or 
> pass it via the environment variable
>    export NUTCH_OPTS=-Djava.protocol.handler.pkgs=jcifs
>
> 2. make sure that the jcifs jar is added as global dependency
>    - add it to ivy/ivy.xml
>    - or copy it to runtime/local/lib/  (local mode for quick testing)
>   (or alternatively copy the jcifs/smb/Handler.java and dependencies
>    to your source tree)

So far I used
bin/crawl --index -D  solr.server.url=http://172.17.0.9:8983/solr/nutch -D  
java.protocol.handler.pkgs=jcifs urls crawl

but I will try your hints. Will need a few days for this.

Hiran 

Reply via email to