Hi Hiran,

> From the log I have it seems the fetcher tries to resolve URLs
> before the PluginRepository is initialized.

The Fetcher is highly concurrent, it may (even has to) start feeding the fetch 
queues
before fetching can start. The PluginRepository is initialized when the first 
plugin
instance is required (one of the protocol plugins).

We could instantiate the PluginRepository beforehand, e.g. in 
NutchConfiguration.create().
However, it's not guaranteed that the configuration is not changed afterwards. 
Indeed,
that's done sometimes, esp. in unit tests.

What's worse is that there are definitely two cases
 - in unit tests
 - in Nutch server
where more than one Configuration is used, every configuration with its own 
PluginRepository!
That's in contradiction with the "one and only" JVM-wide 
URLStreamHandlerFactory.
When running the unit tests ("ant test") we already get the exception
  Caused by: java.lang.Error: factory already defined
        at java.net.URL.setURLStreamHandlerFactory(URL.java:1112)

I see two ways to go:

1. be pragmatic
   - instantiate PluginRepository in NutchConfiguration.create()
   - set this instance as URLStreamHandlerFactory in the static method
     PluginRepository.get(config) to make sure that the method
     URL.setURLStreamHandlerFactory(..) is called exactly once
   The default usage (one MapReduce job running in its own JVM)
   will work this way. Unit tests should be easily fixed.
   But yes, the Nutch server would require that protocol plugins
   stay the same. It's not really a problem, since it's easy to
   filter away undesired URLs using URL filters.

2. think of protocol handlers as static and more low-level,
   e.g., implement them all to org.apache.nutch.protocol.<protocol>.Handler
   and implement only the minimally required methods (eg. getDefaultPort()).
   Plugins are dynamic but URLStreamHandler-s are not - they cannot
   be changed.

What do you think?

Best,
Sebastian


On 09/23/2017 08:23 AM, Hiran CHAUDHURI wrote:
> When trying to run the example protocol-foo plugin (I am writing it), I was 
> able to pass the injector and generator phases, but it seems the fetch phase 
> fails.
> 
> From the log I have it seems the fetcher tries to resolve URLs before the 
> PluginRepository is initialized. Such behaviour would of course render the 
> whole protocol plugins useless...
> 
> So yes, the whole construct still needs to be tested carefully.
> 
> 2017-09-23 08:13:06,783 INFO  fetcher.FetchItemQueues - Using queue mode : 
> byHost
> 2017-09-23 08:13:06,785 INFO  fetcher.Fetcher - Fetcher: threads: 50
> 2017-09-23 08:13:06,785 INFO  fetcher.Fetcher - Fetcher: time-out divisor: 2
> 2017-09-23 08:13:06,836 INFO  plugin.PluginRepository - Plugins: looking in: 
> /home/hiran/dev/nutch/runtime/local/plugins
> 2017-09-23 08:13:06,845 WARN  fetcher.FetchItem - Cannot parse url: 
> foo://example.com
> java.net.MalformedURLException: unknown protocol: foo
>         at java.net.URL.<init>(URL.java:600)
>         at java.net.URL.<init>(URL.java:490)
>         at java.net.URL.<init>(URL.java:439)
>         at org.apache.nutch.fetcher.FetchItem.create(FetchItem.java:71)
>         at org.apache.nutch.fetcher.FetchItem.create(FetchItem.java:63)
>         at 
> org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:87)
>         at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:91)
> 2017-09-23 08:13:06,899 INFO  fetcher.QueueFeeder - QueueFeeder finished: 
> total 2 records + hit by time limit :0
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository - Plugin 
> Auto-activation mode: [true]
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository - Registered Plugins:
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Regex URL 
> Filter (urlfilter-regex)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Html Parse 
> Plug-in (parse-html)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         HTTP 
> Framework (lib-http)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         the nutch 
> core extension points (nutch-extensionpoints)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Basic 
> Indexing Filter (index-basic)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Anchor 
> Indexing Filter (index-anchor)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Tika Parser 
> Plug-in (parse-tika)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Basic URL 
> Normalizer (urlnormalizer-basic)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Regex URL 
> Filter Framework (lib-regex-filter)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Regex URL 
> Normalizer (urlnormalizer-regex)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         CyberNeko 
> HTML Parser (lib-nekohtml)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         OPIC Scoring 
> Plug-in (scoring-opic)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Pass-through 
> URL Normalizer (urlnormalizer-pass)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Http Protocol 
> Plug-in (protocol-http)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         Foo Protocol 
> Example Plug-in (protocol-foo)
> 2017-09-23 08:13:07,508 INFO  plugin.PluginRepository -         
> SolrIndexWriter (indexer-solr)
> 2
> 

Reply via email to