Hello Lewis.
I had a look and it seems to be the pages in PluginCentral do cover
questions that were raised at some point in time but they do not round
off to a full plugin documentation.
There are two lists of pages: Information about the plugin system/plugin
development, and a list of plugins you can download. While the latter
contains example plugins I'd expect the groundwork to be covered in the
first list. So let's look at that:
There is an introduction why there is a plugin system at all, and a
general information about plugins. These two pages could actually make
one highlevel introduction.
The technical concepts is where I'd expect more details, like the
classloading principle and the life cycle of plugins. Means of
communicating with the outside world (access configuration data, store
temporary data, what to do with oversized content, fetch dates and 'not
modified since' responses. Explain the difference between a recource
that was 'not found' vs one that 'is gone'. Or whether the response to
getRobotRules() should be cached and if so for how long. This may end up
with a common part and then specialized pages for each of the plugin types.
The remaining pages describe special problems or tutorials applicable to
special cases: WritingPluginExample cares about writing IndexingFilter
and ScoringFilter. Not useful for someone looking into Protocol plugins.
The next describes writing an indexing filter (again?). PluginGotchas
describes a problem during compilation. Well, I never had one.
And Tika? I'm trying my luck on Protocol plugins.
Thus I can say that PluginCentral does not cover the questions I have
raised so far.
Hiran
On 07.10.24 19:47, Lewis John McGibbney wrote:
Hi Hiran,
If you haven't already please take a look at
https://cwiki.apache.org/confluence/display/NUTCH/PluginCentral and see if any
of your questions are answered. If we need to augment the documentation then we
can do that. Please let us know if this is the case.
lewismc
On 2024/10/06 20:13:35 Hiran Chaudhuri wrote:
I was experimenting with the protocol plugin that continually connects
and disconnects from the server for each and every request.
HTML may be lightweight (or cached in the httpclient code), but other
protocols are not.
My code was ruthless about establishing and tearing down the
connections, but it looked very repetitive for getProtocolOutput and
getRobotRules.
Trying to make functions reusable first of all led to loss of complete
control on the connection. No worries, they get garbage collected -
don't they?
Well it seems these connections get closed and gc'ed but it takes too
much time. Inbetween the fetcher hits problems and runs into grace
periods of 300 000 milliseconds. The total scan becomes unperformant
just because I tried to optimize the code. Which leads me to the next
question:
What is the plugin's life cycle? Is there one plugin instance per
server? One per URL? One per thread? Or one in total?
This scope defines whether I can make use of local variables, or
instance fields. Or is there some other mechanism where a plugin could
store data that should survive across the getProtocolOutput calls? Could
a plugin define which scope it wants to be in?