Hi Hiran,

Your code call setURLStreamHandlerFactory, the documentation for which says 
"This method can be called at most once in a given Java Virtual Machine". Isn't 
this going to be a problem? 
https://docs.oracle.com/javase/8/docs/api/java/net/URL.html#setURLStreamHandlerFactory-java.net.URLStreamHandlerFactory-
Additionally, does this URLStreamHandlerFactory successfully load the standard 
handlers (HTTP, HTTPS...)? I would expect it to fail on these.

To be able to create a pull request, your repository needs to be a fork of the 
original repository, which does not seem to be the case here.

        Yossi.

-----Original Message-----
From: Hiran CHAUDHURI [mailto:[email protected]] 
Sent: 22 September 2017 11:54
To: [email protected]
Subject: RE: [EXT] Re: Nutch Plugin Lifecycle broken due to lazy loading?

Hello all.

This time following up on my own post...

>>> When you look at the protocol-smb hook it comes with this static 
>>> hook, but as it is never executed does not help.
>>
>>Yes, it has to be called.
>
>So when would Nutch call this static hook? In practice this does not happen 
>before the plugin is required, but then it is too late as the 
>MalformedURLException is thrown already.
>And this aproach cannot cover the classpath issue.

It seems Nutch would never call this static hook. That is why I patched the 
PluginRepository class.

>>> - create a tutorial to add some arbitrary protocol (e.g. the 
>>> foo://bar/baz url)
>>> - modify the protocol-smb plugin to make use of the smbclient binary.
>>>
>>> I'd be willing to do the latter but would like to see a less clumsy 
>>> behaviour for plugins.
>>
>>Great! Nutch could not exist without voluntary work. Thanks!
>>
>>Sorry, that integration will not be that easy. The problem was indeed 
>>already known since long and should have been better tested, see also [1] and 
>>[2] - the class org.apache.nutch.protocol.sftp.Handler (a dummy handler) has 
>>been lost, you'll find it in the zip file attached to NUTCH-714.
>>
>>However, encapsulation and lazy instantiation I would not call "clumsy 
>>behavior", it's useful for heavy-weight plugins (e.g., parse-tika which 
>>brings 50 MB dependencies).
>
>Both concepts, encapsulation and lazy instantiation are great. What I call 
>clumsy is that the encapsulation does not work. Look at it from a user 
>perspective of the protocol-smb plugin.
>It comes as a (set of) jars, together with an XML descriptor. This could be 
>nicely wrapped in a zip file and thus is one artifact that can easily be 
>versioned and distributed.
>
>But as soon as I want to install it, I have to
>1 - put the artifact into the plugins directory
>2 - modify Nutch configuration files to allow smb:// urls plus include 
>the plugin to the loaded list
>3 - extract jcifs.jar and place it on the system classpath
>4 - run nutch with the correct system property
>
>While items 1 and 2 can be understood easily and maybe one day come with a 
>nice management interface, items 3 and 4 require knowledge about the internals 
>of the plugin. 
>Where did the encapsulation go? This is where I'd like to improve, and I have 
>an idea how that could be established. Need to test it though.

I have a solution that makes steps 3 and 4 obsolete.

>I would need the first to test modifications to the plugin system.
>Then with the second I would create a smb plugin that would suffer 
>other limitations than the LGPL. ;-)

So here is the solution to the first step - the modified plugin system. It is 
available here, however I am not sure how to create the pull request...
https://github.com/HiranChaudhuri/nutch/commit/dc9cbeb3da7ca021e2cce322482d2eaa1ec15b28

Next will be one example plugin and the mentioned protocol-smb.

Hiran

Reply via email to