Hey all! Since a few days we are currently playing a bit arround with Nutch. Today we have encountered the following issue.
Our very simple test "URL structure" looks like this: index.html -> 1.1.html -> 1.1.1.html -> 1.1.1.1.html -> 1.1.1.1.1.html We start a crawl on the index.html (index.html is the only page in the seed list) with a depth of 3. In this case the first three pages (index.html, 1.1html and 1.1.1.html) are crawled and indexed which is absolutley fine. Now we start a second crawl (recrawl) with the same depth and crawl db and in this case all of the pages (including 1.1.1.1.html and 1.1.1.1.1.html) are crawled. Nutch seems to take the indexed pages from the first crawl (like 1.1.1.html) also as a starting point for crawling. In our case we'd like to force Nutch to always just crawl stuff within a depth of 3 from the real seed page, which is index.html in this case. Is there any possible way to do this? We have already tried to use the '-noAdditions' option to 'updatedb' like mentioned in the wiki (http://wiki.apache.org/nutch/IntranetRecrawl), but this results in the fact that only the first URL (index.html) is crawled. In addition we are afraid that new URLs (for example if we add now 1.2.html as a link to the index.html) are also not crawled. Thanks a lot in advance! -- View this message in context: http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320.html Sent from the Nutch - User mailing list archive at Nabble.com.

