Hi Hiran,

> Is it possible that the plugin lifecycle is broken or at least buggy?

The Nutch plugin system is complex but in general a good idea
(https://wiki.apache.org/nutch/WhyNutchHasAPluginSystem). It's definitely not 
broken, although there may
be issues (e.g., the recently fixed NUTCH-2378).

Regarding the protocol plugins: I haven't tried protocol-smb but other protocol 
plugins
(protocol-file or protocol-ftp) use the same mechanism to register the 
supported protocol:

The plugin.xml defines the supported protocol:

  <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol"
             point="org.apache.nutch.protocol.Protocol">
    <implementation id="org.apache.nutch.protocol.smb.SMB"
                    class="org.apache.nutch.protocol.smb.SMB">
      <parameter name="protocolName" value="smb" />
    </implementation>
  </extension>

The check whether a protocol is supported by one of the registered plugins is 
done
without any protocol plugin instantiated just using the plugin.xml.

If the protocol "smb" is not supported you should find a message:
  org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb

You can try this via (here for file:// URLs):

  # file:// not supported (ProtocolNotFound exception)
  bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' file://...
  Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound:
    protocol not found for url=file
        at 
org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:85)
        at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:137)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
        at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:268)

  # enable protocol-file and retry:
  nutch parsechecker -Dplugin.includes='protocol-file|parse-html' file://...

If you see a MalformedURLException the problem is located somewhere else.
What is the exact error message and the full stack trace?

Thanks,
Sebastian

On 09/14/2017 09:35 PM, Hiran CHAUDHURI wrote:
> Hello there.
> 
>  
> 
> Is it possible that the plugin lifecycle is broken or at least buggy?
> 
>  
> 
> I'm trying to setup Nutch 1.13 on Solr 6.6.1 such that it crawls the intranet.
> 
> That said, a lot of our documents are accessed via SMB, and to make the URLs 
> in the search result
> actually clickable, I want to enable Nutch to fetch the documents via 
> SMB/jcifs.
> 
>  
> 
> So first I configured Nutch to scan urls like smb://server/share.
> 
> Nutch writes into the logs that the smb protocol is unknown and therefore the 
> url is skipped (yes,
> it already passed all the regex filters)
> 
> Then I installed the protocol-smb plugin from here: 
> https://issues.apache.org/jira/browse/NUTCH-427
> 
> Nutch confirms that protocol-smb is loaded on startup and registered in the 
> PluginRepository.
> 
> But right after that Nutch writes into the logs that the smb protocol is 
> unknown and therefore the
> url is skipped....
> 
>  
> 
> So I was wondering what may have happened here and I went to check the plugin 
> source code.
> 
> It seems as soon as the protocol-smb plugin is instantiated, it writes a log 
> message indicating this
> fact. Then it tries to register the SMB protocol URLHandler with the JVM and 
> again writes a log
> message. I have not seen any of these two messages.
> 
>  
> 
> Then I checked the Nutch 1.13 source code, especially the PluginRepository 
> class. It detects and
> successfully registers the plugins, and the code is commented as being sparse 
> on resources by only
> instantiating plugins when they are required. So it is intentional that the 
> protocol-smb plugin is
> registered but not instantiated. Which invokes a chicken-egg problem.
> 
>  
> 
> If the protocol plugin does not get instantiated, it cannot register its 
> protocol. So although the
> plugin is registered, the smb://.... urls will throw MalformedURLExceptions.
> 
> And more generally speaking: Plugins are not able to initialize after being 
> registered, only just
> before they are being loaded. My feeling is something is missing the plugin 
> lifecycle....
> 
>  
> 
> Any ideas? Or should this post go to the developer's list?
> 
>  
> 
> Hiran
> 
>  
> 
>  
> 
> *Hiran Chaudhuri**
> Principal Support Engineer*
> 
> Service Reliability Engineering - Custom
> 
> Amadeus Data Processing GmbH
> Berghamer Strasse 6
> 85435 Erding
> T: +49-8122-43x3662
> [email protected]_
> http://amadeus.com <http://amadeus.com/>_**
> 
>  
> 

Reply via email to