Hello Sebastian,
>> Is it possible that the plugin lifecycle is broken or at least buggy?
>
>The Nutch plugin system is complex but in general a good idea
>(https://wiki.apache.org/nutch/WhyNutchHasAPluginSystem). It's definitely not
>broken,
>although there may be issues (e.g., the recently fixed NUTCH-2378).
>
> Regarding the protocol plugins: I haven't tried protocol-smb but other
> protocol plugins
> (protocol-file or protocol-ftp) use the same mechanism to register the
> supported protocol:
I'm afraid the protocols file and ftp are no good examples, as they are known
to the Java platform out of the box.
I just tried this sample application:
----8<----------------------------------------------------
package test;
import java.net.URL;
public class Test {
public static void main(String[] args) throws Exception {
new URL("http://foo/bar");
new URL("https://foo/bar");
new URL("file://foo/blar");
new URL("ftp://foo/bar");
new URL("smb://foo/bar");
new URL("foo://bar/baz");
}
}
---------------------------------------->8----------------
The output is, as expected "Exception in thread "main"
java.net.MalformedURLException: unknown protocol: smb".
The smb protocol, as well as the foo protocol need to be installed in the JVM
by setting the system property java.protocol.handler.pkgs.
An example is visible on https://jcifs.samba.org/src/src/jcifs/Config.java, in
the method registerSmbURLHandler().
>The plugin.xml defines the supported protocol:
>
> <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol"
> point="org.apache.nutch.protocol.Protocol">
> <implementation id="org.apache.nutch.protocol.smb.SMB"
> class="org.apache.nutch.protocol.smb.SMB">
> <parameter name="protocolName" value="smb" />
> </implementation>
> </extension>
>
>The check whether a protocol is supported by one of the registered plugins is
>done without any protocol plugin instantiated just using the plugin.xml.
My feeling is that this check happens or does not happen, but at some point in
time Nutch tries to run the URL() constructor, which does not rely on the
PluginRepository but the JVM factory methods which are unaware of the new
protocol.
>If the protocol "smb" is not supported you should find a message:
> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb
>If you see a MalformedURLException the problem is located somewhere else.
>What is the exact error message and the full stack trace?
Here is what I get:
----8<----------------------------------------------------
Executing bin/crawl --index -D
solr.server.url=http://172.17.0.9:8983/solr/nutch -D
java.protocol.handler.pkgs=jcifs urls crawl 1
Injecting seed URLs
/nutch/bin/nutch inject crawl/crawldb urls
2017-09-18 19:25:20,324 INFO org.apache.nutch.crawl.Injector - Injector:
starting at 2017-09-18 19:25:20
2017-09-18 19:25:20,326 INFO org.apache.nutch.crawl.Injector - Injector:
crawlDb: crawl/crawldb
2017-09-18 19:25:20,327 INFO org.apache.nutch.crawl.Injector - Injector:
urlDir: urls
2017-09-18 19:25:20,327 INFO org.apache.nutch.crawl.Injector - Injector:
Converting injected urls to crawl db entries.
2017-09-18 19:25:20,610 WARN org.apache.hadoop.util.NativeCodeLoader - Unable
to load native-hadoop library for your platform... using builtin-java classes
where applicable
2017-09-18 19:25:22,620 INFO org.apache.nutch.plugin.PluginRepository -
Plugins: looking in: /nutch/plugins
2017-09-18 19:25:22,851 WARN org.apache.nutch.plugin.PluginRepository - Error
while loading plugin `/nutch/plugins/parse-replace/plugin.xml`
java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml (No such
file or directory)
2017-09-18 19:25:22,904 WARN org.apache.nutch.plugin.PluginRepository - Error
while loading plugin `/nutch/plugins/plugin/plugin.xml`
java.io.FileNotFoundException: /nutch/plugins/plugin/plugin.xml (No such file
or directory)
2017-09-18 19:25:22,956 WARN org.apache.nutch.plugin.PluginRepository - Error
while loading plugin `/nutch/plugins/publish-rabitmq/plugin.xml`
java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml (No
such file or directory)
2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository - Plugin
Auto-activation mode: [true]
2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository -
Registered Plugins:
2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository -
Regex URL Filter (urlfilter-regex)
2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository -
Html Parse Plug-in (parse-html)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
HTTP Framework (lib-http)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
the nutch core extension points (nutch-extensionpoints)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
Basic Indexing Filter (index-basic)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
Anchor Indexing Filter (index-anchor)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
Tika Parser Plug-in (parse-tika)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
Basic URL Normalizer (urlnormalizer-basic)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
Regex URL Filter Framework (lib-regex-filter)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
Regex URL Normalizer (urlnormalizer-regex)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
CyberNeko HTML Parser (lib-nekohtml)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
OPIC Scoring Plug-in (scoring-opic)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
Pass-through URL Normalizer (urlnormalizer-pass)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
SMB Protocol Plug-in (protocol-smb)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
Http Protocol Plug-in (protocol-http)
2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
File Protocol Plug-in (protocol-file)
2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
SolrIndexWriter (indexer-solr)
2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
Registered Extension-Points:
2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
Nutch Content Parser (org.apache.nutch.parse.Parser)
2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
Nutch URL Filter (org.apache.nutch.net.URLFilter)
2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
Nutch Protocol (org.apache.nutch.protocol.Protocol)
2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
2017-09-18 19:25:23,131 INFO
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2017-09-18 19:25:23,469 WARN org.apache.nutch.crawl.Injector - Skipping
smb://nas/Documents:java.net.MalformedURLException: unknown protocol: smb
2017-09-18 19:25:23,473 INFO
org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find rules
for scope 'inject', using default
2017-09-18 19:25:23,771 INFO org.apache.nutch.crawl.Injector - Injector:
overwrite: false
2017-09-18 19:25:23,772 INFO org.apache.nutch.crawl.Injector - Injector:
update: false
2017-09-18 19:25:24,285 INFO org.apache.nutch.crawl.Injector - Injector: Total
urls rejected by filters: 2
2017-09-18 19:25:24,285 INFO org.apache.nutch.crawl.Injector - Injector: Total
urls injected after normalization and filtering: 0
2017-09-18 19:25:24,285 INFO org.apache.nutch.crawl.Injector - Injector: Total
urls injected but already in CrawlDb: 0
2017-09-18 19:25:24,286 INFO org.apache.nutch.crawl.Injector - Injector: Total
new urls injected: 0
2017-09-18 19:25:24,288 INFO org.apache.nutch.crawl.Injector - Injector:
finished at 2017-09-18 19:25:24, elapsed: 00:00:03
---------------------------------------->8----------------
Mind the fact that both the plugin and the extension point are listed, and
still there is this warning line with the hint for a MalformedURLException.
Hiran