Hello everybody,

i posted this question already 2 weeks ago, but i got no answer. Could someone please have a look at it?

I just wanna know where i should start. Would be great if someone could help me getting started.


with best regards,
Marcus

-------- Original-Nachricht --------
Betreff:        How to write a plugin to ignore certain parts of a HTML Page?
Datum:  Mon, 03 Jan 2011 18:40:48 +0100
Von:    Marcus Böhm <[email protected]>
Antwort an:     [email protected]
An:     [email protected]



Hello everybody,

i am currently working on the requirement to index a website with the
help of nutch. But it should be possible to exclude certain parts of a
Page by marking it somehow in the HTML Code (additional markup or custom
attribute or a comment). Now i am wondering where i should start with my
implementation. I started reading the Wiki and found following possible
starting points to write a custom plugin:

    * Parser
    * HTMLParseFilter
    * IndexingFilter

All these interfaces sound somehow like they could work for me. The
Interface IndexingFilter's method filter mentions that it can manipulate
a document that should be parsed (sounds good to me). Otherwise the
Interface Parser sounded reasonable at first too.

So please tell me if i am heading into the right direction and which
Interface/Extension Point i should choose.

Thanks for your help in advance!

With kind regards,
Marcus



Reply via email to