Hi Hiran,
> From the log I have it seems the fetcher tries to resolve URLs
> before the PluginRepository is initialized.
The Fetcher is highly concurrent, it may (even has to) start feeding the fetch
queues
before fetching can start. The PluginRepository is initialized when the first
plugin
instance is required (one of the protocol plugins).
We could instantiate the PluginRepository beforehand, e.g. in
NutchConfiguration.create().
However, it's not guaranteed that the configuration is not changed afterwards.
Indeed,
that's done sometimes, esp. in unit tests.
What's worse is that there are definitely two cases
- in unit tests
- in Nutch server
where more than one Configuration is used, every configuration with its own
PluginRepository!
That's in contradiction with the "one and only" JVM-wide
URLStreamHandlerFactory.
When running the unit tests ("ant test") we already get the exception
Caused by: java.lang.Error: factory already defined
at java.net.URL.setURLStreamHandlerFactory(URL.java:1112)
I see two ways to go:
1. be pragmatic
- instantiate PluginRepository in NutchConfiguration.create()
- set this instance as URLStreamHandlerFactory in the static method
PluginRepository.get(config) to make sure that the method
URL.setURLStreamHandlerFactory(..) is called exactly once
The default usage (one MapReduce job running in its own JVM)
will work this way. Unit tests should be easily fixed.
But yes, the Nutch server would require that protocol plugins
stay the same. It's not really a problem, since it's easy to
filter away undesired URLs using URL filters.
2. think of protocol handlers as static and more low-level,
e.g., implement them all to org.apache.nutch.protocol.<protocol>.Handler
and implement only the minimally required methods (eg. getDefaultPort()).
Plugins are dynamic but URLStreamHandler-s are not - they cannot
be changed.
What do you think?
Best,
Sebastian
On 09/23/2017 08:23 AM, Hiran CHAUDHURI wrote:
> When trying to run the example protocol-foo plugin (I am writing it), I was
> able to pass the injector and generator phases, but it seems the fetch phase
> fails.
>
> From the log I have it seems the fetcher tries to resolve URLs before the
> PluginRepository is initialized. Such behaviour would of course render the
> whole protocol plugins useless...
>
> So yes, the whole construct still needs to be tested carefully.
>
> 2017-09-23 08:13:06,783 INFO fetcher.FetchItemQueues - Using queue mode :
> byHost
> 2017-09-23 08:13:06,785 INFO fetcher.Fetcher - Fetcher: threads: 50
> 2017-09-23 08:13:06,785 INFO fetcher.Fetcher - Fetcher: time-out divisor: 2
> 2017-09-23 08:13:06,836 INFO plugin.PluginRepository - Plugins: looking in:
> /home/hiran/dev/nutch/runtime/local/plugins
> 2017-09-23 08:13:06,845 WARN fetcher.FetchItem - Cannot parse url:
> foo://example.com
> java.net.MalformedURLException: unknown protocol: foo
> at java.net.URL.<init>(URL.java:600)
> at java.net.URL.<init>(URL.java:490)
> at java.net.URL.<init>(URL.java:439)
> at org.apache.nutch.fetcher.FetchItem.create(FetchItem.java:71)
> at org.apache.nutch.fetcher.FetchItem.create(FetchItem.java:63)
> at
> org.apache.nutch.fetcher.FetchItemQueues.addFetchItem(FetchItemQueues.java:87)
> at org.apache.nutch.fetcher.QueueFeeder.run(QueueFeeder.java:91)
> 2017-09-23 08:13:06,899 INFO fetcher.QueueFeeder - QueueFeeder finished:
> total 2 records + hit by time limit :0
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Plugin
> Auto-activation mode: [true]
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Registered Plugins:
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Regex URL
> Filter (urlfilter-regex)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Html Parse
> Plug-in (parse-html)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - HTTP
> Framework (lib-http)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - the nutch
> core extension points (nutch-extensionpoints)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Basic
> Indexing Filter (index-basic)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Anchor
> Indexing Filter (index-anchor)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Tika Parser
> Plug-in (parse-tika)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Basic URL
> Normalizer (urlnormalizer-basic)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Regex URL
> Filter Framework (lib-regex-filter)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Regex URL
> Normalizer (urlnormalizer-regex)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - CyberNeko
> HTML Parser (lib-nekohtml)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - OPIC Scoring
> Plug-in (scoring-opic)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Pass-through
> URL Normalizer (urlnormalizer-pass)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Http Protocol
> Plug-in (protocol-http)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository - Foo Protocol
> Example Plug-in (protocol-foo)
> 2017-09-23 08:13:07,508 INFO plugin.PluginRepository -
> SolrIndexWriter (indexer-solr)
> 2
>