Changing html indexing content

Žygimantas Medelis Wed, 17 Nov 2010 06:35:26 -0800

Hi,

The web pages that I am indexing contain loads of links with anchor text,
while links are needed to crawl the pages, anchor text pollutes my index. So
I want to get rid of them. My first approach was to remove all a.href
elements from html at parse time, this was bad idea since then crawler has
no links to jump to. Still working with parse time plugin (based on
parse-html), I


1. get DocumentFragment, produced by nekohtml parsing,
2. then parse the same page again to strip all a.href
3. call setTextContent on document fragment (from step 1).

Yet the result is the same as with first approach, all out-links are lost.

Probably the better place for this is index time plugin but the objects I
get through its interface

NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum
datum, Inlinks inlinks)

do not allow modifications of NutchDocument content. Is there a way to do
this?

regards

Changing html indexing content

Reply via email to