Hi Hiran,

great! And bringing the discussion back to @user - sorry, wrong reply button...

> Sounds as if it were not a real problem other than a convention.

Just a decision made without having the URL protocol handlers on the radar.

> I wish there were some guide how to write protocol plugins. But that is why I 
> am creating this
> dummy - it might help document the minimum tasks for a plugin developer.

Yes, it's not part of
  https://wiki.apache.org/nutch/WritingPluginExample-1.2
Parse and indexing filter plugins are the most common ones.

Thanks for your work,
Sebastian


On 09/26/2017 10:36 PM, Hiran CHAUDHURI wrote:
> Hello Sebastian,
> 
>> The only value is that there is no need to load it explicitly (no call 
>> needed) - at the price, that
>> it's harder to control when the loading is done.
> 
> Sounds as if it were not a real problem other than a convention. Maybe we can 
> find some way to automatically have the PluginRepository initialized.
> 
>> For unit tests it's just handy to test the same method/class/plugin with 
>> various 
>> configurations.
>> Nutch server allows to run multiple jobs in a single JVM.  It's possible but 
>> not necessarily a 
>> good idea to run jobs with different sets of plugins.
> 
> Oh yes, unit tests. They usually run within the same JVM.
> 
>>> Does the configuration differ that much?
>> Rarely and if we just shouldn't care because there is no way around with 
>> JVM-wide URL 
>> handlers.
> 
> This is what I see as the tricky part. If we'd like to have different 
> PluginRepositories with different configuration, how would we find the 
> correct one for instantiating the next URLStreamHandler?
> 
>>>> We could instantiate the PluginRepository beforehand, e.g. in
>>>> NutchConfiguration.create().
>> That's the wrong place - too early, initialization must wait until 
>> command-line arguments 
>> (properties set via -Dproperty=value) are processed.
> 
> I will have to trust you here. Although I drilled into one part of nutch I do 
> not have a full architecture overview.
> 
>>> Since we register the PluginRepository in a 1:1 relationship with the 
>>> JVM, this class should become a singleton I guess.
>> That was also my first thought, however for unit tests we need multiple 
>> PluginRepository-s
>> based on different configurations.
>>
>> I've tried it just with the first instance in
>>  https://github.com/sebastian-nagel/nutch/tree/NUTCH-2429
>>  https://github.com/apache/nutch/compare/master...sebastian-nagel:NUTCH-2429
>> (feel free to pull or cherry-pick any of my commits!)
>>
>> ... fetching now fails for foo:// URLs because the protocol-foo is (by now) 
>> only a dummy:
> 
> That is good news already. :-)
> 
> I fixed the issue that PluginRepository would have to be a singleton called 
> URLStreamHandlerFactory. It keeps references to PluginRepository instances. I 
> made them WeakReferences so the PluginRepository instances can get garbage 
> collected if no longer needed.
> 
> Then I applied you modification to NutchTool so the PluginRepository would 
> get initialized in time for the fetch phase. This seems to work, after all I 
> see the same exceptions you mention as the protocol-foo plugin is a dummy.
> 
> I am trying to fill this gap now. I wish there were some guide how to write 
> protocol plugins. But that is why I am creating this dummy - it might help 
> document the minimum tasks for a plugin developer.
> 
>> Some more things to do, esp. fix all 33 classes implementing 
>> org.apache.hadoop.util.Tool but 
>> not org.apache.nutch.util.NutchTool. :)
> 
> Sounds like we are getting somewhere. So far I tested running the crawl 
> script. Enough errors there that needed to be fixed....
> 
> Hiran
> 

Reply via email to