How to write a plugin to ignore certain parts of a HTML Page?

Marcus Böhm Mon, 03 Jan 2011 09:50:58 -0800

Hello everybody,

i am certainly working on the requirement to index a website with thehelp of nutch. But it should be possible to exclude certain parts of aPage by marking it somehow in the HTML Code (additional markup or customattribute). Now i am wondering where i should start with myimplementation. I started reading the Wiki and found following possiblestarting points to write a custom plugin:


   * Parser
   * HTMLParseFilter
   * IndexingFilter

All these interfaces sound somehow like they could work for me. TheInterface IndexingFilter's method filter mentions that it can manipulatea document that should be parsed (sounds good to me). Otherwise theInterface Parser sounded reasonable at first too.

So please tell me if i am heading into the right direction and whichInterface/Extension Point i should choose.


Thanks for your help in advance!

With kind regards,
Marcus

How to write a plugin to ignore certain parts of a HTML Page?

Reply via email to