RE: How to write a plugin to ignore certain parts of a HTML Page?

a a Mon, 17 Jan 2011 07:59:23 -0800

hi,

you have to start here :




nutch-1.0/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java



at this method :



public ParseResult getParse(Content content)



uo have to use this varaible :  DocumentFragment root;



you should let nutch extract title, outlinks...etc, after that you have 
to write a method to parse your html source and delete unwanted section 
of your html, and send the new epurated root variable :



 // run filters on parse

    ParseResult filteredParse = this.htmlParseFilters.filter(content, 
parseResult,

                                                             metaTags, root);



hope it will help u


mehdi




> Date: Mon, 17 Jan 2011 15:06:50 +0100
> From: [email protected]
> To: [email protected]
> Subject: Fwd: How to write a plugin to ignore certain parts of a HTML Page?
> 
> Hello everybody,
> 
> i posted this question already 2 weeks ago, but i got no answer. Could 
> someone please have a look at it?
> 
> I just wanna know where i should start. Would be great if someone could 
> help me getting started.
> 
> 
> with best regards,
> Marcus
> 
> -------- Original-Nachricht --------
> Betreff:      How to write a plugin to ignore certain parts of a HTML Page?
> Datum:        Mon, 03 Jan 2011 18:40:48 +0100
> Von:  Marcus Böhm <[email protected]>
> Antwort an:   [email protected]
> An:   [email protected]
> 
> 
> 
> Hello everybody,
> 
> i am currently working on the requirement to index a website with the
> help of nutch. But it should be possible to exclude certain parts of a
> Page by marking it somehow in the HTML Code (additional markup or custom
> attribute or a comment). Now i am wondering where i should start with my
> implementation. I started reading the Wiki and found following possible
> starting points to write a custom plugin:
> 
>      * Parser
>      * HTMLParseFilter
>      * IndexingFilter
> 
> All these interfaces sound somehow like they could work for me. The
> Interface IndexingFilter's method filter mentions that it can manipulate
> a document that should be parsed (sounds good to me). Otherwise the
> Interface Parser sounded reasonable at first too.
> 
> So please tell me if i am heading into the right direction and which
> Interface/Extension Point i should choose.
> 
> Thanks for your help in advance!
> 
> With kind regards,
> Marcus
> 
> 
>

RE: How to write a plugin to ignore certain parts of a HTML Page?

Reply via email to