Hi, The web pages that I am indexing contain loads of links with anchor text, while links are needed to crawl the pages, anchor text pollutes my index. So I want to get rid of them. My first approach was to remove all a.href elements from html at parse time, this was bad idea since then crawler has no links to jump to. Still working with parse time plugin (based on parse-html), I
1. get DocumentFragment, produced by nekohtml parsing, 2. then parse the same page again to strip all a.href 3. call setTextContent on document fragment (from step 1). Yet the result is the same as with first approach, all out-links are lost. Probably the better place for this is index time plugin but the objects I get through its interface NutchDocument filter(NutchDocument doc, Parse parse, Text url, CrawlDatum datum, Inlinks inlinks) do not allow modifications of NutchDocument content. Is there a way to do this? regards

