Hello everybody,

i am certainly working on the requirement to index a website with the help of nutch. But it should be possible to exclude certain parts of a Page by marking it somehow in the HTML Code (additional markup or custom attribute). Now i am wondering where i should start with my implementation. I started reading the Wiki and found following possible starting points to write a custom plugin:

   * Parser
   * HTMLParseFilter
   * IndexingFilter

All these interfaces sound somehow like they could work for me. The Interface IndexingFilter's method filter mentions that it can manipulate a document that should be parsed (sounds good to me). Otherwise the Interface Parser sounded reasonable at first too.

So please tell me if i am heading into the right direction and which Interface/Extension Point i should choose.

Thanks for your help in advance!

With kind regards,
Marcus

Reply via email to