David Wallace wrote:
I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code. I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.
Note that this is only done when collecting anchor texts, not when
Hi all,
I've been grubbing around with Nutch for a while now, although I'm
still working with 0.7 code. I notice that when anchors are collected
for a document, they're made unique by domain and by anchor text.
I'm using Nutch for an intranet style search engine, on a single
site, so I don't
Hi,
did you tried...
property
namedb.ignore.internal.links/name
valuetrue/value
descriptionIf true, when adding new links to a page, links from
the same host are ignored. This is an effective way to limit the
size of the link database, keeping the only the highest quality
links.
Thank you Stefan, for your speedy response.
I have indeed changed that setting to false. However, that doesn't
deal with my problem. The offending method is getAnchors in
org.apache.nutch.db.WebDBAnchors, which is called from
org.apache.nutch.tools.FetchListTool. This method makes the array