Re: Indexing documents with all incoming links

Sebastian Nagel Mon, 28 Apr 2014 12:18:02 -0700

Hi Harald,

> In particular I would like to see all incoming links for the document.
>
> Is it in principle possible to get incoming links in a NutchDocument for 
> indexing or is this not
> even implemented?
It's implemented.


First, is index-anchor activated by property "plugin.includes"?
> I did not find in the code the place where a NutchDocument, as passed to 
> IndexWriter.write() is
> created and filled.
This indexing filter plugin populates the index field "anchor",
cf. http://wiki.apache.org/nutch/IndexStructure.

Second, there are a couple of properties which affect how links and anchors are
stored in linkdb. Most important to check for your purpose:

<property>
  <name>db.ignore.internal.links</name>
  <value>true</value>
  <description>If true, when adding new links to a page, links from
  the same host are ignored.  This is an effective way to limit the
  size of the link database, keeping only the highest quality
  links.
  </description>
</property>


Sebastian

On 04/28/2014 04:30 PM, Harald Kirsch wrote:
> After implementing my own org.apache.nutch.inderer.IndexWriter I check the 
> data coming along and I
> only see
> 
> url
> tstamp
> digest
> boost
> segment
> cache
> host
> title
> content
> 
> In particular I would like to see all incoming links for the document.
> 
> I think I call the indexer correctly, because the linkdb is given on the 
> command line and I see in
> the logs:
> 
> 2014-04-28 15:53:39,963 INFO  indexer.IndexerMapReduce - IndexerMapReduce: 
> linkdb:
> nutch-crawldata/linkdb
> 
> I did not find in the code the place where a NutchDocument, as passed to 
> IndexWriter.write() is
> created and filled.
> 
> Is it in principle possible to get incoming links in a NutchDocument for 
> indexing or is this not
> even implemented?
> 
> Harald.

Re: Indexing documents with all incoming links

Reply via email to