Hi Hiran, great! And bringing the discussion back to @user - sorry, wrong reply button...
> Sounds as if it were not a real problem other than a convention. Just a decision made without having the URL protocol handlers on the radar. > I wish there were some guide how to write protocol plugins. But that is why I > am creating this > dummy - it might help document the minimum tasks for a plugin developer. Yes, it's not part of https://wiki.apache.org/nutch/WritingPluginExample-1.2 Parse and indexing filter plugins are the most common ones. Thanks for your work, Sebastian On 09/26/2017 10:36 PM, Hiran CHAUDHURI wrote: > Hello Sebastian, > >> The only value is that there is no need to load it explicitly (no call >> needed) - at the price, that >> it's harder to control when the loading is done. > > Sounds as if it were not a real problem other than a convention. Maybe we can > find some way to automatically have the PluginRepository initialized. > >> For unit tests it's just handy to test the same method/class/plugin with >> various >> configurations. >> Nutch server allows to run multiple jobs in a single JVM. It's possible but >> not necessarily a >> good idea to run jobs with different sets of plugins. > > Oh yes, unit tests. They usually run within the same JVM. > >>> Does the configuration differ that much? >> Rarely and if we just shouldn't care because there is no way around with >> JVM-wide URL >> handlers. > > This is what I see as the tricky part. If we'd like to have different > PluginRepositories with different configuration, how would we find the > correct one for instantiating the next URLStreamHandler? > >>>> We could instantiate the PluginRepository beforehand, e.g. in >>>> NutchConfiguration.create(). >> That's the wrong place - too early, initialization must wait until >> command-line arguments >> (properties set via -Dproperty=value) are processed. > > I will have to trust you here. Although I drilled into one part of nutch I do > not have a full architecture overview. > >>> Since we register the PluginRepository in a 1:1 relationship with the >>> JVM, this class should become a singleton I guess. >> That was also my first thought, however for unit tests we need multiple >> PluginRepository-s >> based on different configurations. >> >> I've tried it just with the first instance in >> https://github.com/sebastian-nagel/nutch/tree/NUTCH-2429 >> https://github.com/apache/nutch/compare/master...sebastian-nagel:NUTCH-2429 >> (feel free to pull or cherry-pick any of my commits!) >> >> ... fetching now fails for foo:// URLs because the protocol-foo is (by now) >> only a dummy: > > That is good news already. :-) > > I fixed the issue that PluginRepository would have to be a singleton called > URLStreamHandlerFactory. It keeps references to PluginRepository instances. I > made them WeakReferences so the PluginRepository instances can get garbage > collected if no longer needed. > > Then I applied you modification to NutchTool so the PluginRepository would > get initialized in time for the fetch phase. This seems to work, after all I > see the same exceptions you mention as the protocol-foo plugin is a dummy. > > I am trying to fill this gap now. I wish there were some guide how to write > protocol plugins. But that is why I am creating this dummy - it might help > document the minimum tasks for a plugin developer. > >> Some more things to do, esp. fix all 33 classes implementing >> org.apache.hadoop.util.Tool but >> not org.apache.nutch.util.NutchTool. :) > > Sounds like we are getting somewhere. So far I tested running the crawl > script. Enough errors there that needed to be fixed.... > > Hiran >

