Hi Hiran, > From the log I have it seems the fetcher tries to resolve URLs > before the PluginRepository is initialized.
The Fetcher is highly concurrent, it may (even has to) start feeding the fetch queues before fetching can start. The PluginRepository is initialized when the first plugin instance is required (one of the protocol plugins). We could instantiate the PluginRepository beforehand, e.g. in NutchConfiguration.create(). However, it's not guaranteed that the configuration is not changed afterwards. Indeed, that's done sometimes, esp. in unit tests. What's worse is that there are definitely two cases - in unit tests - in Nutch server where more than one Configuration is used, every configuration with its own PluginRepository! That's in contradiction with the "one and only" JVM-wide URLStreamHandlerFactory. When running the unit tests ("ant test") we already get the exception Caused by: java.lang.Error: factory already defined at java.net.URL.setURLStreamHandlerFactory(URL.java:1112) I see two ways to go: 1. be pragmatic - instantiate PluginRepository in NutchConfiguration.create() - set this instance as URLStreamHandlerFactory in the static method PluginRepository.get(config) to make sure that the method URL.setURLStreamHandlerFactory(..) is called exactly once The default usage (one MapReduce job running in its own JVM) will work this way. Unit tests should be easily fixed. But yes, the Nutch server would require that protocol plugins stay the same. It's not really a problem, since it's easy to filter away undesired URLs using URL filters. 2. think of protocol handlers as static and more low-level, e.g., implement them all to org.apache.nutch.protocol.<protocol>.Handler and implement only the minimally required methods (eg. getDefaultPort()). Plugins are dynamic but URLStreamHandler-s are not - they cannot be changed. What do you think? Best, Sebastian On 09/23/2017 08:23 AM, Hiran CHAUDHURI wrote: > When trying to run the example protocol-foo plugin (I am writing it), I was > able to pass the injector and generator phases, but it seems the fetch phase > fails. > > From the log I have it seems the fetcher tries to resolve URLs before the > PluginRepository is initialized. Such behaviour would of course render the > whole protocol plugins useless... > > So yes, the whole construct still needs to be tested carefully. > > 2017-09-23 08:13:06,783 INFO fetcher.FetchItemQueues - Using queue mode : > byHost > 2017-09-23 08:13:06,785 INFO fetcher.Fetcher - Fetcher: threads: 50 > 2017-09-23 08:13:06,785 INFO fetcher.Fetcher - Fetcher: time-out divisor: 2 > 2017-09-23 08:13:06,836 INFO plugin.PluginRepository - Plugins: looking in: > /home/hiran/dev/nutch/runtime/local/plugins > 2017-09-23 08:13:06,845 WARN fetcher.FetchItem - Cannot parse url: > foo://example.com > java.net.MalformedURLException: unknown protocol: foo > at java.net.URL.<init>(URL.java:600) > at java.net.URL.<init>(URL.java:490) > at java.net.URL.<init>(URL.java:439) > at org.apache.nutch.fetcher.FetchItem.create(FetchItem.java:71) > at org.apache.nutch.fetcher.FetchItem.create(FetchItem.java:63) > at > org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:87) > at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:91) > 2017-09-23 08:13:06,899 INFO fetcher.QueueFeeder - QueueFeeder finished: > total 2 records + hit by time limit :0 > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Registered Plugins: > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Html Parse > Plug-in (parse-html) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Anchor > Indexing Filter (index-anchor) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Tika Parser > Plug-in (parse-tika) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Regex URL > Filter Framework (lib-regex-filter) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - CyberNeko > HTML Parser (lib-nekohtml) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - OPIC Scoring > Plug-in (scoring-opic) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Pass-through > URL Normalizer (urlnormalizer-pass) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Http Protocol > Plug-in (protocol-http) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Foo Protocol > Example Plug-in (protocol-foo) > 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - > SolrIndexWriter (indexer-solr) > 2 >