I am trying do develop a news crawler and I want to prohibit it from wandering too far away from the seed list that I provide. It seems like I should use the DepthScoringFilter, but I am having trouble getting it to work. After a few crawl cycles, all the _depth_ metadata say either 1 or 1000. Scores, meanwhile, vary from 0 to 1 and mostly don't look like depths. I have added a scoring.depth.max property to nutch-site.xml. <property> <name>scoring.depth.max</name> <value>3</value> </property>
I have changed the plugin.includes list to contain scoring-depth instead of opic, and now it looks like this. <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-depth|urlnormalizer-(pass|regex|basic)</value> </property> This is all using Nutch 1.12. What do I need to do to limit the crawl frontier so it won't go more than N hops from the seed list, if that is possible?

