Hello Sebastian,

>> Is it possible that the plugin lifecycle is broken or at least buggy?
>
>The Nutch plugin system is complex but in general a good idea 
>(https://wiki.apache.org/nutch/WhyNutchHasAPluginSystem). It's definitely not 
>broken, 
>although there may be issues (e.g., the recently fixed NUTCH-2378).
>
> Regarding the protocol plugins: I haven't tried protocol-smb but other 
> protocol plugins
> (protocol-file or protocol-ftp) use the same mechanism to register the 
> supported protocol:

I'm afraid the protocols file and ftp are no good examples, as they are known 
to the Java platform out of the box.
I just tried this sample application:

----8<----------------------------------------------------
package test;

import java.net.URL;

public class Test {

    public static void main(String[] args) throws Exception {
        new URL("http://foo/bar";);
        new URL("https://foo/bar";);
        new URL("file://foo/blar");
        new URL("ftp://foo/bar";);
        new URL("smb://foo/bar");
        new URL("foo://bar/baz");
    }
    
}
---------------------------------------->8----------------

The output is, as expected "Exception in thread "main" 
java.net.MalformedURLException: unknown protocol: smb".
The smb protocol, as well as the foo protocol need to be installed in the JVM 
by setting the system property java.protocol.handler.pkgs.
An example is visible on https://jcifs.samba.org/src/src/jcifs/Config.java, in 
the method registerSmbURLHandler().

>The plugin.xml defines the supported protocol:
>
> <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol"
>             point="org.apache.nutch.protocol.Protocol">
>    <implementation id="org.apache.nutch.protocol.smb.SMB"
>                    class="org.apache.nutch.protocol.smb.SMB">
>      <parameter name="protocolName" value="smb" />
>    </implementation>
>  </extension>
>
>The check whether a protocol is supported by one of the registered plugins is 
>done without any protocol plugin instantiated just using the plugin.xml.

My feeling is that this check happens or does not happen, but at some point in 
time Nutch tries to run the URL() constructor, which does not rely on the 
PluginRepository but the JVM factory methods which are unaware of the new 
protocol.

>If the protocol "smb" is not supported you should find a message:
>  org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb

>If you see a MalformedURLException the problem is located somewhere else.
>What is the exact error message and the full stack trace?

Here is what I get:
----8<----------------------------------------------------
Executing bin/crawl --index -D 
solr.server.url=http://172.17.0.9:8983/solr/nutch -D 
java.protocol.handler.pkgs=jcifs urls crawl 1
Injecting seed URLs
/nutch/bin/nutch inject crawl/crawldb urls
2017-09-18 19:25:20,324 INFO  org.apache.nutch.crawl.Injector - Injector: 
starting at 2017-09-18 19:25:20
2017-09-18 19:25:20,326 INFO  org.apache.nutch.crawl.Injector - Injector: 
crawlDb: crawl/crawldb
2017-09-18 19:25:20,327 INFO  org.apache.nutch.crawl.Injector - Injector: 
urlDir: urls
2017-09-18 19:25:20,327 INFO  org.apache.nutch.crawl.Injector - Injector: 
Converting injected urls to crawl db entries.
2017-09-18 19:25:20,610 WARN  org.apache.hadoop.util.NativeCodeLoader - Unable 
to load native-hadoop library for your platform... using builtin-java classes 
where applicable
2017-09-18 19:25:22,620 INFO  org.apache.nutch.plugin.PluginRepository - 
Plugins: looking in: /nutch/plugins
2017-09-18 19:25:22,851 WARN  org.apache.nutch.plugin.PluginRepository - Error 
while loading plugin `/nutch/plugins/parse-replace/plugin.xml` 
java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml (No such 
file or directory)
2017-09-18 19:25:22,904 WARN  org.apache.nutch.plugin.PluginRepository - Error 
while loading plugin `/nutch/plugins/plugin/plugin.xml` 
java.io.FileNotFoundException: /nutch/plugins/plugin/plugin.xml (No such file 
or directory)
2017-09-18 19:25:22,956 WARN  org.apache.nutch.plugin.PluginRepository - Error 
while loading plugin `/nutch/plugins/publish-rabitmq/plugin.xml` 
java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml (No 
such file or directory)
2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository - Plugin 
Auto-activation mode: [true]
2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository - 
Registered Plugins:
2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository -        
Regex URL Filter (urlfilter-regex)
2017-09-18 19:25:23,052 INFO  org.apache.nutch.plugin.PluginRepository -        
Html Parse Plug-in (parse-html)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
HTTP Framework (lib-http)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
the nutch core extension points (nutch-extensionpoints)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
Basic Indexing Filter (index-basic)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
Anchor Indexing Filter (index-anchor)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
Tika Parser Plug-in (parse-tika)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
Basic URL Normalizer (urlnormalizer-basic)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
Regex URL Filter Framework (lib-regex-filter)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
Regex URL Normalizer (urlnormalizer-regex)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
CyberNeko HTML Parser (lib-nekohtml)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
OPIC Scoring Plug-in (scoring-opic)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
Pass-through URL Normalizer (urlnormalizer-pass)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
SMB Protocol Plug-in (protocol-smb)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
Http Protocol Plug-in (protocol-http)
2017-09-18 19:25:23,053 INFO  org.apache.nutch.plugin.PluginRepository -        
File Protocol Plug-in (protocol-file)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        
SolrIndexWriter (indexer-solr)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository - 
Registered Extension-Points:
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        
Nutch Content Parser (org.apache.nutch.parse.Parser)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        
Nutch URL Filter (org.apache.nutch.net.URLFilter)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        
HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        
Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2017-09-18 19:25:23,054 INFO  org.apache.nutch.plugin.PluginRepository -        
Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        
Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        
Nutch Protocol (org.apache.nutch.protocol.Protocol)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        
Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        
Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        
Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2017-09-18 19:25:23,055 INFO  org.apache.nutch.plugin.PluginRepository -        
Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2017-09-18 19:25:23,131 INFO  
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find rules 
for scope 'inject', using default
2017-09-18 19:25:23,469 WARN  org.apache.nutch.crawl.Injector - Skipping 
smb://nas/Documents:java.net.MalformedURLException: unknown protocol: smb
2017-09-18 19:25:23,473 INFO  
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find rules 
for scope 'inject', using default
2017-09-18 19:25:23,771 INFO  org.apache.nutch.crawl.Injector - Injector: 
overwrite: false
2017-09-18 19:25:23,772 INFO  org.apache.nutch.crawl.Injector - Injector: 
update: false
2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: Total 
urls rejected by filters: 2
2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: Total 
urls injected after normalization and filtering: 0
2017-09-18 19:25:24,285 INFO  org.apache.nutch.crawl.Injector - Injector: Total 
urls injected but already in CrawlDb: 0
2017-09-18 19:25:24,286 INFO  org.apache.nutch.crawl.Injector - Injector: Total 
new urls injected: 0
2017-09-18 19:25:24,288 INFO  org.apache.nutch.crawl.Injector - Injector: 
finished at 2017-09-18 19:25:24, elapsed: 00:00:03
---------------------------------------->8----------------

Mind the fact that both the plugin and the extension point are listed, and 
still there is this warning line with the hint for a MalformedURLException.

Hiran

Reply via email to