After implementing my own org.apache.nutch.inderer.IndexWriter I check the data coming along and I only see

url
tstamp
digest
boost
segment
cache
host
title
content

In particular I would like to see all incoming links for the document.

I think I call the indexer correctly, because the linkdb is given on the command line and I see in the logs:

2014-04-28 15:53:39,963 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: nutch-crawldata/linkdb

I did not find in the code the place where a NutchDocument, as passed to IndexWriter.write() is created and filled.

Is it in principle possible to get incoming links in a NutchDocument for indexing or is this not even implemented?

Harald.

Reply via email to