Hi Harald, > In particular I would like to see all incoming links for the document. > > Is it in principle possible to get incoming links in a NutchDocument for > indexing or is this not > even implemented? It's implemented.
First, is index-anchor activated by property "plugin.includes"? > I did not find in the code the place where a NutchDocument, as passed to > IndexWriter.write() is > created and filled. This indexing filter plugin populates the index field "anchor", cf. http://wiki.apache.org/nutch/IndexStructure. Second, there are a couple of properties which affect how links and anchors are stored in linkdb. Most important to check for your purpose: <property> <name>db.ignore.internal.links</name> <value>true</value> <description>If true, when adding new links to a page, links from the same host are ignored. This is an effective way to limit the size of the link database, keeping only the highest quality links. </description> </property> Sebastian On 04/28/2014 04:30 PM, Harald Kirsch wrote: > After implementing my own org.apache.nutch.inderer.IndexWriter I check the > data coming along and I > only see > > url > tstamp > digest > boost > segment > cache > host > title > content > > In particular I would like to see all incoming links for the document. > > I think I call the indexer correctly, because the linkdb is given on the > command line and I see in > the logs: > > 2014-04-28 15:53:39,963 INFO indexer.IndexerMapReduce - IndexerMapReduce: > linkdb: > nutch-crawldata/linkdb > > I did not find in the code the place where a NutchDocument, as passed to > IndexWriter.write() is > created and filled. > > Is it in principle possible to get incoming links in a NutchDocument for > indexing or is this not > even implemented? > > Harald.

