Hi Hiran, ok, got it. - the problem is already given in https://issues.apache.org/jira/browse/NUTCH-427 :)
In this case, you're right. The plugin system wasn't designed to manipulate Java system properties. But it should be possible to do it by adding a static hook which is called before instantiation. The second problem would be the class loader encapsulation: the class java.net.URL is used in many places and the protocol handler (jcifs.smb.Handler) must be globally available. But be pragmatic - protocol-smb will not make it into the "official" Nutch package because of the LGPL license [1]. To make protocol-smb working for "your" Nutch package: 1. set the system property accordingly. If you use bin/nutch, modify it or pass it via the environment variable export NUTCH_OPTS=-Djava.protocol.handler.pkgs=jcifs 2. make sure that the jcifs jar is added as global dependency - add it to ivy/ivy.xml - or copy it to runtime/local/lib/ (local mode for quick testing) (or alternatively copy the jcifs/smb/Handler.java and dependencies to your source tree) Best, Sebastian [1] https://www.apache.org/legal/resolved.html#category-x On 09/18/2017 09:42 PM, Hiran CHAUDHURI wrote: > Hello Sebastian, > >>> Is it possible that the plugin lifecycle is broken or at least buggy? >> >> The Nutch plugin system is complex but in general a good idea >> (https://wiki.apache.org/nutch/WhyNutchHasAPluginSystem). It's definitely >> not broken, >> although there may be issues (e.g., the recently fixed NUTCH-2378). >> >> Regarding the protocol plugins: I haven't tried protocol-smb but other >> protocol plugins >> (protocol-file or protocol-ftp) use the same mechanism to register the >> supported protocol: > > I'm afraid the protocols file and ftp are no good examples, as they are known > to the Java platform out of the box. > I just tried this sample application: > > ----8<---------------------------------------------------- > package test; > > import java.net.URL; > > public class Test { > > public static void main(String[] args) throws Exception { > new URL("http://foo/bar"); > new URL("https://foo/bar"); > new URL("file://foo/blar"); > new URL("ftp://foo/bar"); > new URL("smb://foo/bar"); > new URL("foo://bar/baz"); > } > > } > ---------------------------------------->8---------------- > > The output is, as expected "Exception in thread "main" > java.net.MalformedURLException: unknown protocol: smb". > The smb protocol, as well as the foo protocol need to be installed in the JVM > by setting the system property java.protocol.handler.pkgs. > An example is visible on https://jcifs.samba.org/src/src/jcifs/Config.java, > in the method registerSmbURLHandler(). > >> The plugin.xml defines the supported protocol: >> >> <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol" >> point="org.apache.nutch.protocol.Protocol"> >> <implementation id="org.apache.nutch.protocol.smb.SMB" >> class="org.apache.nutch.protocol.smb.SMB"> >> <parameter name="protocolName" value="smb" /> >> </implementation> >> </extension> >> >> The check whether a protocol is supported by one of the registered plugins >> is done without any protocol plugin instantiated just using the plugin.xml. > > My feeling is that this check happens or does not happen, but at some point > in time Nutch tries to run the URL() constructor, which does not rely on the > PluginRepository but the JVM factory methods which are unaware of the new > protocol. > >> If the protocol "smb" is not supported you should find a message: >> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb > >> If you see a MalformedURLException the problem is located somewhere else. >> What is the exact error message and the full stack trace? > > Here is what I get: > ----8<---------------------------------------------------- > Executing bin/crawl --index -D > solr.server.url=http://172.17.0.9:8983/solr/nutch -D > java.protocol.handler.pkgs=jcifs urls crawl 1 > Injecting seed URLs > /nutch/bin/nutch inject crawl/crawldb urls > 2017-09-18 19:25:20,324 INFO org.apache.nutch.crawl.Injector - Injector: > starting at 2017-09-18 19:25:20 > 2017-09-18 19:25:20,326 INFO org.apache.nutch.crawl.Injector - Injector: > crawlDb: crawl/crawldb > 2017-09-18 19:25:20,327 INFO org.apache.nutch.crawl.Injector - Injector: > urlDir: urls > 2017-09-18 19:25:20,327 INFO org.apache.nutch.crawl.Injector - Injector: > Converting injected urls to crawl db entries. > 2017-09-18 19:25:20,610 WARN org.apache.hadoop.util.NativeCodeLoader - > Unable to load native-hadoop library for your platform... using builtin-java > classes where applicable > 2017-09-18 19:25:22,620 INFO org.apache.nutch.plugin.PluginRepository - > Plugins: looking in: /nutch/plugins > 2017-09-18 19:25:22,851 WARN org.apache.nutch.plugin.PluginRepository - > Error while loading plugin `/nutch/plugins/parse-replace/plugin.xml` > java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml (No > such file or directory) > 2017-09-18 19:25:22,904 WARN org.apache.nutch.plugin.PluginRepository - > Error while loading plugin `/nutch/plugins/plugin/plugin.xml` > java.io.FileNotFoundException: /nutch/plugins/plugin/plugin.xml (No such file > or directory) > 2017-09-18 19:25:22,956 WARN org.apache.nutch.plugin.PluginRepository - > Error while loading plugin `/nutch/plugins/publish-rabitmq/plugin.xml` > java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml (No > such file or directory) > 2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository - > Plugin Auto-activation mode: [true] > 2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository - > Registered Plugins: > 2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository - > Regex URL Filter (urlfilter-regex) > 2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository - > Html Parse Plug-in (parse-html) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > HTTP Framework (lib-http) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > the nutch core extension points (nutch-extensionpoints) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > Basic Indexing Filter (index-basic) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > Anchor Indexing Filter (index-anchor) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > Tika Parser Plug-in (parse-tika) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > Basic URL Normalizer (urlnormalizer-basic) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > Regex URL Filter Framework (lib-regex-filter) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > Regex URL Normalizer (urlnormalizer-regex) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > CyberNeko HTML Parser (lib-nekohtml) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > OPIC Scoring Plug-in (scoring-opic) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > Pass-through URL Normalizer (urlnormalizer-pass) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > SMB Protocol Plug-in (protocol-smb) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > Http Protocol Plug-in (protocol-http) > 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository - > File Protocol Plug-in (protocol-file) > 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository - > SolrIndexWriter (indexer-solr) > 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository - > Registered Extension-Points: > 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository - > Nutch Content Parser (org.apache.nutch.parse.Parser) > 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository - > Nutch URL Filter (org.apache.nutch.net.URLFilter) > 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository - > HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) > 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository - > Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) > 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository - > Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) > 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository - > Nutch Publisher (org.apache.nutch.publisher.NutchPublisher) > 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository - > Nutch Protocol (org.apache.nutch.protocol.Protocol) > 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository - > Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter) > 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository - > Nutch Index Writer (org.apache.nutch.indexer.IndexWriter) > 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository - > Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) > 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository - > Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) > 2017-09-18 19:25:23,131 INFO > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find > rules for scope 'inject', using default > 2017-09-18 19:25:23,469 WARN org.apache.nutch.crawl.Injector - Skipping > smb://nas/Documents:java.net.MalformedURLException: unknown protocol: smb > 2017-09-18 19:25:23,473 INFO > org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find > rules for scope 'inject', using default > 2017-09-18 19:25:23,771 INFO org.apache.nutch.crawl.Injector - Injector: > overwrite: false > 2017-09-18 19:25:23,772 INFO org.apache.nutch.crawl.Injector - Injector: > update: false > 2017-09-18 19:25:24,285 INFO org.apache.nutch.crawl.Injector - Injector: > Total urls rejected by filters: 2 > 2017-09-18 19:25:24,285 INFO org.apache.nutch.crawl.Injector - Injector: > Total urls injected after normalization and filtering: 0 > 2017-09-18 19:25:24,285 INFO org.apache.nutch.crawl.Injector - Injector: > Total urls injected but already in CrawlDb: 0 > 2017-09-18 19:25:24,286 INFO org.apache.nutch.crawl.Injector - Injector: > Total new urls injected: 0 > 2017-09-18 19:25:24,288 INFO org.apache.nutch.crawl.Injector - Injector: > finished at 2017-09-18 19:25:24, elapsed: 00:00:03 > ---------------------------------------->8---------------- > > Mind the fact that both the plugin and the extension point are listed, and > still there is this warning line with the hint for a MalformedURLException. > > Hiran >