Hello everybody,
i am certainly working on the requirement to index a website with the
help of nutch. But it should be possible to exclude certain parts of a
Page by marking it somehow in the HTML Code (additional markup or custom
attribute). Now i am wondering where i should start with my
implementation. I started reading the Wiki and found following possible
starting points to write a custom plugin:
* Parser
* HTMLParseFilter
* IndexingFilter
All these interfaces sound somehow like they could work for me. The
Interface IndexingFilter's method filter mentions that it can manipulate
a document that should be parsed (sounds good to me). Otherwise the
Interface Parser sounded reasonable at first too.
So please tell me if i am heading into the right direction and which
Interface/Extension Point i should choose.
Thanks for your help in advance!
With kind regards,
Marcus