Salut Alexandre, The use of the term 'depth' the crawl tool is very misleading. What it means is # rounds of generate/fetch/parse/update and has nothing to do with the actual logical depth from a start seed.
You can limit the depth of a crawl using the patch from https://issues.apache.org/jira/browse/NUTCH-1331. BTW I'd use the new script in the SVN trunk instead of the all in all crawl command as it gives more control and a better understanding of what happens HTH Julien On 17 September 2012 15:06, Alexandre <[email protected]> wrote: > Hey all! > > Since a few days we are currently playing a bit arround with Nutch. Today > we > have encountered the following issue. > > Our very simple test "URL structure" looks like this: > index.html -> 1.1.html -> 1.1.1.html -> 1.1.1.1.html -> > 1.1.1.1.1.html > > We start a crawl on the index.html (index.html is the only page in the seed > list) with a depth of 3. In this case the first three pages (index.html, > 1.1html and 1.1.1.html) are crawled and indexed which is absolutley fine. > Now we start a second crawl (recrawl) with the same depth and crawl db and > in this case all of the pages (including 1.1.1.1.html and 1.1.1.1.1.html) > are crawled. Nutch seems to take the indexed pages from the first crawl > (like 1.1.1.html) also as a starting point for crawling. > > In our case we'd like to force Nutch to always just crawl stuff within a > depth of 3 from the real seed page, which is index.html in this case. Is > there any possible way to do this? > > We have already tried to use the '-noAdditions' option to 'updatedb' like > mentioned in the wiki (http://wiki.apache.org/nutch/IntranetRecrawl), but > this results in the fact that only the first URL (index.html) is crawled. > In addition we are afraid that new URLs (for example if we add now 1.2.html > as a link to the index.html) are also not crawled. > > Thanks a lot in advance! > > > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

