Hi Hiran,
Your code call setURLStreamHandlerFactory, the documentation for which says
"This method can be called at most once in a given Java Virtual Machine". Isn't
this going to be a problem?
https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#setURLStreamHandlerFactory-java.net.URLStreamHandlerFactory-
Additionally, does this URLStreamHandlerFactory successfully load the standard
handlers (HTTP, HTTPS...)? I would expect it to fail on these.
To be able to create a pull request, your repository needs to be a fork of the
original repository, which does not seem to be the case here.
Yossi.
-----Original Message-----
From: Hiran CHAUDHURI [mailto:[email protected]]
Sent: 22 September 2017 11:54
To: [email protected]
Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?
Hello all.
This time following up on my own post...
>>> When you look at the protocol-smb hook it comes with this static
>>> hook, but as it is never executed does not help.
>>
>>Yes, it has to be called.
>
>So when would Nutch call this static hook? In practice this does not happen
>before the plugin is required, but then it is too late as the
>MalformedURLException is thrown already.
>And this aproach cannot cover the classpath issue.
It seems Nutch would never call this static hook. That is why I patched the
PluginRepository class.
>>> - create a tutorial to add some arbitrary protocol (e.g. the
>>> foo://bar/baz url)
>>> - modify the protocol-smb plugin to make use of the smbclient binary.
>>>
>>> I'd be willing to do the latter but would like to see a less clumsy
>>> behaviour for plugins.
>>
>>Great! Nutch could not exist without voluntary work. Thanks!
>>
>>Sorry, that integration will not be that easy. The problem was indeed
>>already known since long and should have been better tested, see also [1] and
>>[2] - the class org.apache.nutch.protocol.sftp.Handler (a dummy handler) has
>>been lost, you'll find it in the zip file attached to NUTCH-714.
>>
>>However, encapsulation and lazy instantiation I would not call "clumsy
>>behavior", it's useful for heavy-weight plugins (e.g., parse-tika which
>>brings 50 MB dependencies).
>
>Both concepts, encapsulation and lazy instantiation are great. What I call
>clumsy is that the encapsulation does not work. Look at it from a user
>perspective of the protocol-smb plugin.
>It comes as a (set of) jars, together with an XML descriptor. This could be
>nicely wrapped in a zip file and thus is one artifact that can easily be
>versioned and distributed.
>
>But as soon as I want to install it, I have to
>1 - put the artifact into the plugins directory
>2 - modify Nutch configuration files to allow smb:// urls plus include
>the plugin to the loaded list
>3 - extract jcifs.jar and place it on the system classpath
>4 - run nutch with the correct system property
>
>While items 1 and 2 can be understood easily and maybe one day come with a
>nice management interface, items 3 and 4 require knowledge about the internals
>of the plugin.
>Where did the encapsulation go? This is where I'd like to improve, and I have
>an idea how that could be established. Need to test it though.
I have a solution that makes steps 3 and 4 obsolete.
>I would need the first to test modifications to the plugin system.
>Then with the second I would create a smb plugin that would suffer
>other limitations than the LGPL. ;-)
So here is the solution to the first step - the modified plugin system. It is
available here, however I am not sure how to create the pull request...
https://github.com/HiranChaudhuri/nutch/commit/dc9cbeb3da7ca021e2cce322482d2eaa1ec15b28
Next will be one example plugin and the mentioned protocol-smb.
Hiran