Hi Hiran, > Is it possible that the plugin lifecycle is broken or at least buggy?
The Nutch plugin system is complex but in general a good idea (https://wiki.apache.org/nutch/WhyNutchHasAPluginSystem). It's definitely not broken, although there may be issues (e.g., the recently fixed NUTCH-2378). Regarding the protocol plugins: I haven't tried protocol-smb but other protocol plugins (protocol-file or protocol-ftp) use the same mechanism to register the supported protocol: The plugin.xml defines the supported protocol: <extension id="org.apache.nutch.protocol.smb" name="SMBProtocol" point="org.apache.nutch.protocol.Protocol"> <implementation id="org.apache.nutch.protocol.smb.SMB" class="org.apache.nutch.protocol.smb.SMB"> <parameter name="protocolName" value="smb" /> </implementation> </extension> The check whether a protocol is supported by one of the registered plugins is done without any protocol plugin instantiated just using the plugin.xml. If the protocol "smb" is not supported you should find a message: org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=smb You can try this via (here for file:// URLs): # file:// not supported (ProtocolNotFound exception) bin/nutch parsechecker -Dplugin.includes='protocol-http|parse-html' file://... Exception in thread "main" org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=file at org.apache.nutch.protocol.ProtocolFactory.getProtocol(ProtocolFactory.java:85) at org.apache.nutch.parse.ParserChecker.run(ParserChecker.java:137) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.parse.ParserChecker.main(ParserChecker.java:268) # enable protocol-file and retry: nutch parsechecker -Dplugin.includes='protocol-file|parse-html' file://... If you see a MalformedURLException the problem is located somewhere else. What is the exact error message and the full stack trace? Thanks, Sebastian On 09/14/2017 09:35 PM, Hiran CHAUDHURI wrote: > Hello there. > > > > Is it possible that the plugin lifecycle is broken or at least buggy? > > > > I'm trying to setup Nutch 1.13 on Solr 6.6.1 such that it crawls the intranet. > > That said, a lot of our documents are accessed via SMB, and to make the URLs > in the search result > actually clickable, I want to enable Nutch to fetch the documents via > SMB/jcifs. > > > > So first I configured Nutch to scan urls like smb://server/share. > > Nutch writes into the logs that the smb protocol is unknown and therefore the > url is skipped (yes, > it already passed all the regex filters) > > Then I installed the protocol-smb plugin from here: > https://issues.apache.org/jira/browse/NUTCH-427 > > Nutch confirms that protocol-smb is loaded on startup and registered in the > PluginRepository. > > But right after that Nutch writes into the logs that the smb protocol is > unknown and therefore the > url is skipped.... > > > > So I was wondering what may have happened here and I went to check the plugin > source code. > > It seems as soon as the protocol-smb plugin is instantiated, it writes a log > message indicating this > fact. Then it tries to register the SMB protocol URLHandler with the JVM and > again writes a log > message. I have not seen any of these two messages. > > > > Then I checked the Nutch 1.13 source code, especially the PluginRepository > class. It detects and > successfully registers the plugins, and the code is commented as being sparse > on resources by only > instantiating plugins when they are required. So it is intentional that the > protocol-smb plugin is > registered but not instantiated. Which invokes a chicken-egg problem. > > > > If the protocol plugin does not get instantiated, it cannot register its > protocol. So although the > plugin is registered, the smb://.... urls will throw MalformedURLExceptions. > > And more generally speaking: Plugins are not able to initialize after being > registered, only just > before they are being loaded. My feeling is something is missing the plugin > lifecycle.... > > > > Any ideas? Or should this post go to the developer's list? > > > > Hiran > > > > > > *Hiran Chaudhuri** > Principal Support Engineer* > > Service Reliability Engineering - Custom > > Amadeus Data Processing GmbH > Berghamer Strasse 6 > 85435 Erding > T: +49-8122-43x3662 > [email protected]_ > http://amadeus.com <http://amadeus.com/>_** > > >

