hi,
you have to start here :
nutch-1.0/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
at this method :
public ParseResult getParse(Content content)
uo have to use this varaible : DocumentFragment root;
you should let nutch extract title, outlinks...etc, after that you have
to write a method to parse your html source and delete unwanted section
of your html, and send the new epurated root variable :
// run filters on parse
ParseResult filteredParse = this.htmlParseFilters.filter(content,
parseResult,
metaTags, root);
hope it will help u
mehdi
> Date: Mon, 17 Jan 2011 15:06:50 +0100
> From: [email protected]
> To: [email protected]
> Subject: Fwd: How to write a plugin to ignore certain parts of a HTML Page?
>
> Hello everybody,
>
> i posted this question already 2 weeks ago, but i got no answer. Could
> someone please have a look at it?
>
> I just wanna know where i should start. Would be great if someone could
> help me getting started.
>
>
> with best regards,
> Marcus
>
> -------- Original-Nachricht --------
> Betreff: How to write a plugin to ignore certain parts of a HTML Page?
> Datum: Mon, 03 Jan 2011 18:40:48 +0100
> Von: Marcus Böhm <[email protected]>
> Antwort an: [email protected]
> An: [email protected]
>
>
>
> Hello everybody,
>
> i am currently working on the requirement to index a website with the
> help of nutch. But it should be possible to exclude certain parts of a
> Page by marking it somehow in the HTML Code (additional markup or custom
> attribute or a comment). Now i am wondering where i should start with my
> implementation. I started reading the Wiki and found following possible
> starting points to write a custom plugin:
>
> * Parser
> * HTMLParseFilter
> * IndexingFilter
>
> All these interfaces sound somehow like they could work for me. The
> Interface IndexingFilter's method filter mentions that it can manipulate
> a document that should be parsed (sounds good to me). Otherwise the
> Interface Parser sounded reasonable at first too.
>
> So please tell me if i am heading into the right direction and which
> Interface/Extension Point i should choose.
>
> Thanks for your help in advance!
>
> With kind regards,
> Marcus
>
>
>