I am trying do develop a news crawler and I want to prohibit it from wandering 
too far away from the seed list that I provide.
It seems like I should use the DepthScoringFilter, but I am having trouble 
getting it to work. After a few crawl cycles, all the _depth_ metadata say 
either 1 or 1000. Scores, meanwhile, vary from 0 to 1 and mostly don't look 
like depths.
I have added a scoring.depth.max property to nutch-site.xml.
<property>
  <name>scoring.depth.max</name>
  <value>3</value>
</property>

I have changed the plugin.includes list to contain scoring-depth instead of 
opic, and now it looks like this.

<property>
  <name>plugin.includes</name>
  
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-depth|urlnormalizer-(pass|regex|basic)</value>
</property>
This is all using Nutch 1.12.

What do I need to do to limit the crawl frontier so it won't go more than N 
hops from the seed list, if that is possible?

Reply via email to