Hi Hiran,
ok, got it. - the problem is already given in
https://issues.apache.org/jira/browse/NUTCH-427 :)
In this case, you're right. The plugin system wasn't designed to manipulate
Java system properties.
But it should be possible to do it by adding a static hook which is called
before instantiation.
The second problem would be the class loader encapsulation: the class
java.net.URL is used in many
places and the protocol handler (jcifs.smb.Handler) must be globally available.
But be pragmatic - protocol-smb will not make it into the "official" Nutch
package because of the
LGPL license [1]. To make protocol-smb working for "your" Nutch package:
1. set the system property accordingly. If you use bin/nutch, modify it or pass
it via the
environment variable
export NUTCH_OPTS=-Djava.protocol.handler.pkgs=jcifs
2. make sure that the jcifs jar is added as global dependency
- add it to ivy/ivy.xml
- or copy it to runtime/local/lib/ (local mode for quick testing)
(or alternatively copy the jcifs/smb/Handler.java and dependencies
to your source tree)
Best,
Sebastian
[1] https://www.apache.org/legal/resolved.html#category-x
On 09/18/2017 09:42 PM, Hiran CHAUDHURI wrote:
> Hello Sebastian,
>
>>> Is it possible that the plugin lifecycle is broken or at least buggy?
>>
>> The Nutch plugin system is complex but in general a good idea
>> (https://wiki.apache.org/nutch/WhyNutchHasAPluginSystem). It's definitely
>> not broken,
>> although there may be issues (e.g., the recently fixed NUTCH-2378).
>>
>> Regarding the protocol plugins: I haven't tried protocol-smb but other
>> protocol plugins
>> (protocol-file or protocol-ftp) use the same mechanism to register the
>> supported protocol:
>
> I'm afraid the protocols file and ftp are no good examples, as they are known
> to the Java platform out of the box.
> I just tried this sample application:
>
> ----8<----------------------------------------------------
> package test;
>
> import java.net.URL;
>
> public class Test {
>
> public static void main(String[] args) throws Exception {
> new URL("http://foo/bar");
> new URL("https://foo/bar");
> new URL("file://foo/blar");
> new URL("ftp://foo/bar");
> new URL("smb://foo/bar");
> new URL("foo://bar/baz");
> }
>
> }
> ---------------------------------------->8----------------
>
> The output is, as expected "Exception in thread "main"
> java.net.MalformedURLException: unknown protocol: smb".
> The smb protocol, as well as the foo protocol need to be installed in the JVM
> by setting the system property java.protocol.handler.pkgs.
> An example is visible on https://jcifs.samba.org/src/src/jcifs/Config.java,
> in the method registerSmbURLHandler().
>
>> The plugin.xml defines the supported protocol:
>>
>> <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol"
>> point="org.apache.nutch.protocol.Protocol">
>> <implementation id="org.apache.nutch.protocol.smb.SMB"
>> class="org.apache.nutch.protocol.smb.SMB">
>> <parameter name="protocolName" value="smb" />
>> </implementation>
>> </extension>
>>
>> The check whether a protocol is supported by one of the registered plugins
>> is done without any protocol plugin instantiated just using the plugin.xml.
>
> My feeling is that this check happens or does not happen, but at some point
> in time Nutch tries to run the URL() constructor, which does not rely on the
> PluginRepository but the JVM factory methods which are unaware of the new
> protocol.
>
>> If the protocol "smb" is not supported you should find a message:
>> org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb
>
>> If you see a MalformedURLException the problem is located somewhere else.
>> What is the exact error message and the full stack trace?
>
> Here is what I get:
> ----8<----------------------------------------------------
> Executing bin/crawl --index -D
> solr.server.url=http://172.17.0.9:8983/solr/nutch -D
> java.protocol.handler.pkgs=jcifs urls crawl 1
> Injecting seed URLs
> /nutch/bin/nutch inject crawl/crawldb urls
> 2017-09-18 19:25:20,324 INFO org.apache.nutch.crawl.Injector - Injector:
> starting at 2017-09-18 19:25:20
> 2017-09-18 19:25:20,326 INFO org.apache.nutch.crawl.Injector - Injector:
> crawlDb: crawl/crawldb
> 2017-09-18 19:25:20,327 INFO org.apache.nutch.crawl.Injector - Injector:
> urlDir: urls
> 2017-09-18 19:25:20,327 INFO org.apache.nutch.crawl.Injector - Injector:
> Converting injected urls to crawl db entries.
> 2017-09-18 19:25:20,610 WARN org.apache.hadoop.util.NativeCodeLoader -
> Unable to load native-hadoop library for your platform... using builtin-java
> classes where applicable
> 2017-09-18 19:25:22,620 INFO org.apache.nutch.plugin.PluginRepository -
> Plugins: looking in: /nutch/plugins
> 2017-09-18 19:25:22,851 WARN org.apache.nutch.plugin.PluginRepository -
> Error while loading plugin `/nutch/plugins/parse-replace/plugin.xml`
> java.io.FileNotFoundException: /nutch/plugins/parse-replace/plugin.xml (No
> such file or directory)
> 2017-09-18 19:25:22,904 WARN org.apache.nutch.plugin.PluginRepository -
> Error while loading plugin `/nutch/plugins/plugin/plugin.xml`
> java.io.FileNotFoundException: /nutch/plugins/plugin/plugin.xml (No such file
> or directory)
> 2017-09-18 19:25:22,956 WARN org.apache.nutch.plugin.PluginRepository -
> Error while loading plugin `/nutch/plugins/publish-rabitmq/plugin.xml`
> java.io.FileNotFoundException: /nutch/plugins/publish-rabitmq/plugin.xml (No
> such file or directory)
> 2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository -
> Plugin Auto-activation mode: [true]
> 2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository -
> Registered Plugins:
> 2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository -
> Regex URL Filter (urlfilter-regex)
> 2017-09-18 19:25:23,052 INFO org.apache.nutch.plugin.PluginRepository -
> Html Parse Plug-in (parse-html)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> HTTP Framework (lib-http)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> the nutch core extension points (nutch-extensionpoints)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> Basic Indexing Filter (index-basic)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> Anchor Indexing Filter (index-anchor)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> Tika Parser Plug-in (parse-tika)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> Basic URL Normalizer (urlnormalizer-basic)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> Regex URL Filter Framework (lib-regex-filter)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> Regex URL Normalizer (urlnormalizer-regex)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> CyberNeko HTML Parser (lib-nekohtml)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> OPIC Scoring Plug-in (scoring-opic)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> Pass-through URL Normalizer (urlnormalizer-pass)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> SMB Protocol Plug-in (protocol-smb)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> Http Protocol Plug-in (protocol-http)
> 2017-09-18 19:25:23,053 INFO org.apache.nutch.plugin.PluginRepository -
> File Protocol Plug-in (protocol-file)
> 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
> SolrIndexWriter (indexer-solr)
> 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
> Registered Extension-Points:
> 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
> Nutch Content Parser (org.apache.nutch.parse.Parser)
> 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
> Nutch URL Filter (org.apache.nutch.net.URLFilter)
> 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
> HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter)
> 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
> Nutch Scoring (org.apache.nutch.scoring.ScoringFilter)
> 2017-09-18 19:25:23,054 INFO org.apache.nutch.plugin.PluginRepository -
> Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer)
> 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
> Nutch Publisher (org.apache.nutch.publisher.NutchPublisher)
> 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
> Nutch Protocol (org.apache.nutch.protocol.Protocol)
> 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
> Nutch URL Ignore Exemption Filter (org.apache.nutch.net.URLExemptionFilter)
> 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
> Nutch Index Writer (org.apache.nutch.indexer.IndexWriter)
> 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
> Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter)
> 2017-09-18 19:25:23,055 INFO org.apache.nutch.plugin.PluginRepository -
> Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter)
> 2017-09-18 19:25:23,131 INFO
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find
> rules for scope 'inject', using default
> 2017-09-18 19:25:23,469 WARN org.apache.nutch.crawl.Injector - Skipping
> smb://nas/Documents:java.net.MalformedURLException: unknown protocol: smb
> 2017-09-18 19:25:23,473 INFO
> org.apache.nutch.net.urlnormalizer.regex.RegexURLNormalizer - can't find
> rules for scope 'inject', using default
> 2017-09-18 19:25:23,771 INFO org.apache.nutch.crawl.Injector - Injector:
> overwrite: false
> 2017-09-18 19:25:23,772 INFO org.apache.nutch.crawl.Injector - Injector:
> update: false
> 2017-09-18 19:25:24,285 INFO org.apache.nutch.crawl.Injector - Injector:
> Total urls rejected by filters: 2
> 2017-09-18 19:25:24,285 INFO org.apache.nutch.crawl.Injector - Injector:
> Total urls injected after normalization and filtering: 0
> 2017-09-18 19:25:24,285 INFO org.apache.nutch.crawl.Injector - Injector:
> Total urls injected but already in CrawlDb: 0
> 2017-09-18 19:25:24,286 INFO org.apache.nutch.crawl.Injector - Injector:
> Total new urls injected: 0
> 2017-09-18 19:25:24,288 INFO org.apache.nutch.crawl.Injector - Injector:
> finished at 2017-09-18 19:25:24, elapsed: 00:00:03
> ---------------------------------------->8----------------
>
> Mind the fact that both the plugin and the extension point are listed, and
> still there is this warning line with the hint for a MalformedURLException.
>
> Hiran
>