Hi Hiran,

ok, got it. - the problem is already given in 
https://issues.apache.org/jira/browse/NUTCH-427 :)

In this case, you're right. The plugin system wasn't designed to manipulate 
Java system properties.
But it should be possible to do it by adding a static hook which is called 
before instantiation.
The second problem would be the class loader encapsulation: the class 
java.net.URL is used in many
places and the protocol handler (jcifs.smb.Handler) must be globally available.

But be pragmatic - protocol-smb will not make it into the "official" Nutch 
package because of the
LGPL license [1].  To make protocol-smb working for "your" Nutch package:

1. set the system property accordingly. If you use bin/nutch, modify it or pass 
it via the
environment variable
    export NUTCH_OPTS=-Djava.protocol.handler.pkgs=jcifs

2. make sure that the jcifs jar is added as global dependency
    - add it to ivy/ivy.xml
    - or copy it to runtime/local/lib/  (local mode for quick testing)
   (or alternatively copy the jcifs/smb/Handler.java and dependencies
    to your source tree)

Best,
Sebastian

[1] https://www.apache.org/legal/resolved.html#category-x

On 09/18/2017 09:42 PM, Hiran CHAUDHURI wrote:
> Hello Sebastian,
> 
>>> Is it possible that the plugin lifecycle is broken or at least buggy?
>>
>> The Nutch plugin system is complex but in general a good idea 
>> (https://wiki.apache.org/nutch/WhyNutchHasAPluginSystem). It's definitely 
>> not broken, 
>> although there may be issues (e.g., the recently fixed NUTCH-2378).
>>
>> Regarding the protocol plugins: I haven't tried protocol-smb but other 
>> protocol plugins
>> (protocol-file or protocol-ftp) use the same mechanism to register the 
>> supported protocol:
> 
> I'm afraid the protocols file and ftp are no good examples, as they are known 
> to the Java platform out of the box.
> I just tried this sample application:
> 
> ----8<----------------------------------------------------
> package test;
> 
> import java.net.URL;
> 
> public class Test {
> 
>     public static void main(String[] args) throws Exception {
>         new URL("http://foo/bar";);
>         new URL("https://foo/bar";);
>         new URL("file://foo/blar");
>         new URL("ftp://foo/bar";);
>         new URL("smb://foo/bar");
>         new URL("foo://bar/baz");
>     }
>     
> }
> ---------------------------------------->8----------------
> 
> The output is, as expected "Exception in thread "main" 
> java.net.MalformedURLException: unknown protocol: smb".
> The smb protocol, as well as the foo protocol need to be installed in the JVM 
> by setting the system property java.protocol.handler.pkgs.
> An example is visible on https://jcifs.samba.org/src/src/jcifs/Config.java, 
> in the method registerSmbURLHandler().
> 
>> The plugin.xml defines the supported protocol:
>>
>> <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol"
>>             point="org.apache.nutch.protocol.Protocol">
>>    <implementation id="org.apache.nutch.protocol.smb.SMB"
>>                    class="org.apache.nutch.protocol.smb.SMB">
>>      <parameter name="protocolName" value="smb" />
>>    </implementation>
>>  </extension>
>>
>> The check whether a protocol is supported by one of the registered plugins 
>> is done without any protocol plugin instantiated just using the plugin.xml.
> 
> My feeling is that this check happens or does not happen, but at some point 
> in time Nutch tries to run the URL() constructor, which does not rely on the 
> PluginRepository but the JVM factory methods which are unaware of the new 
> protocol.
> 
>> If the protocol "smb" is not supported you should find a message:
>>  org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb
> 
>> If you see a MalformedURLException the problem is located somewhere else.
>> What is the exact error message and the full stack trace?
> 
> Here is what I get:
> ----8<----------------------------------------------------
> Executing bin/crawl --index -D 
> solr.server.url=http://172.17.0.9:8983/solr/nutch -D 
> java.protocol.handler.pkgs=jcifs urls crawl 1
> Injecting seed URLs
> /nutch/bin/nutch inject crawl/crawldb urls
> 2017-09-18 19:25:20,324 INFO  org.apache.nutch.crawl.Injector - Injector: 
> starting at 2017-09-18 19:25:20
> 2017-09-18 19:25:20,326 INFO  org.apache.nutch.crawl.Injector - Injector: 
> crawlDb: crawl/crawldb
> 2017-09-18 19:25:20,327 INFO  org.apache.nutch.crawl.Injector - Injector: 
> urlDir: urls
> 2017-09-18 19:25:20,327 INFO  org.apache.nutch.crawl.Injector - Injector: 
> Converting injected urls to crawl db entries.
> 2017-09-18 19:25:20,610 WARN  org.apache.hadoop.util.NativeCodeLoader - 
> Unable to load native-hadoop library for your platform... using builtin-java 
> classes where applicable
> 2017-09-18 19:25:22,620 INFO  org.apache.nutch.plugin.PluginRepository - 
> Plugins: looking in: /nutch/plugins
> 2017-09-18 19:25:22,851 WARN  org.apache.nutch.plugin.PluginRepository - 
> Error while loading plugin `/nutch/plugins/parse-replace/plugin.xml` 
> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml (No 
> such file or directory)
> 2017-09-18 19:25:22,904 WARN  org.apache.nutch.plugin.PluginRepository - 
> Error while loading plugin `/nutch/plugins/plugin/plugin.xml` 
> java.io.FileNotFoundException: /nutch/plugins/plugin/plugin.xml (No such file 
> or directory)
> 2017-09-18 19:25:22,956 WARN  org.apache.nutch.plugin.PluginRepository - 
> Error while loading plugin `/nutch/plugins/publish-rabitmq/plugin.xml` 
> java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml (No 
> such file or directory)
> 2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository - 
> Plugin Auto-activation mode: [true]
> 2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository - 
> Registered Plugins:
> 2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Regex URL Filter (urlfilter-regex)
> 2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Html Parse Plug-in (parse-html)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   HTTP Framework (lib-http)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   the nutch core extension points (nutch-extensionpoints)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Basic Indexing Filter (index-basic)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Anchor Indexing Filter (index-anchor)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Tika Parser Plug-in (parse-tika)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Basic URL Normalizer (urlnormalizer-basic)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Regex URL Filter Framework (lib-regex-filter)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Regex URL Normalizer (urlnormalizer-regex)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   CyberNeko HTML Parser (lib-nekohtml)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   OPIC Scoring Plug-in (scoring-opic)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Pass-through URL Normalizer (urlnormalizer-pass)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   SMB Protocol Plug-in (protocol-smb)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Http Protocol Plug-in (protocol-http)
> 2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -      
>   File Protocol Plug-in (protocol-file)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -      
>   SolrIndexWriter (indexer-solr)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository - 
> Registered Extension-Points:
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Nutch Content Parser (org.apache.nutch.parse.Parser)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -      
>   HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -      
>   Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2017-09-18 19:25:23,131 INFO  
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find 
> rules for scope 'inject', using default
> 2017-09-18 19:25:23,469 WARN  org.apache.nutch.crawl.Injector - Skipping 
> smb://nas/Documents:java.net.MalformedURLException: unknown protocol: smb
> 2017-09-18 19:25:23,473 INFO  
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find 
> rules for scope 'inject', using default
> 2017-09-18 19:25:23,771 INFO  org.apache.nutch.crawl.Injector - Injector: 
> overwrite: false
> 2017-09-18 19:25:23,772 INFO  org.apache.nutch.crawl.Injector - Injector: 
> update: false
> 2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: 
> Total urls rejected by filters: 2
> 2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: 
> Total urls injected after normalization and filtering: 0
> 2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: 
> Total urls injected but already in CrawlDb: 0
> 2017-09-18 19:25:24,286 INFO  org.apache.nutch.crawl.Injector - Injector: 
> Total new urls injected: 0
> 2017-09-18 19:25:24,288 INFO  org.apache.nutch.crawl.Injector - Injector: 
> finished at 2017-09-18 19:25:24, elapsed: 00:00:03
> ---------------------------------------->8----------------
> 
> Mind the fact that both the plugin and the extension point are listed, and 
> still there is this warning line with the hint for a MalformedURLException.
> 
> Hiran
> 

Reply via email to