Salut Alexandre,

The use of the term 'depth' the crawl tool is very misleading. What it
means is # rounds of generate/fetch/parse/update and has nothing to do with
the actual logical depth from a start seed.

You can limit the depth of a crawl using the patch from
https://issues.apache.org/jira/browse/NUTCH-1331.

BTW I'd use the new script in the SVN trunk instead of the all in all crawl
command as it gives more control and a better understanding of what happens

HTH

Julien

On 17 September 2012 15:06, Alexandre <[email protected]> wrote:

> Hey all!
>
> Since a few days we are currently playing a bit arround with Nutch. Today
> we
> have encountered the following issue.
>
> Our very simple test "URL structure" looks like this:
> index.html  ->  1.1.html  ->  1.1.1.html  ->  1.1.1.1.html  ->
> 1.1.1.1.1.html
>
> We start a crawl on the index.html (index.html is the only page in the seed
> list) with a depth of 3. In this case the first three pages (index.html,
> 1.1html and 1.1.1.html) are crawled and indexed which is absolutley fine.
> Now we start a second crawl (recrawl) with the same depth and crawl db and
> in this case all of the pages (including 1.1.1.1.html and 1.1.1.1.1.html)
> are crawled. Nutch seems to take the indexed pages from the first crawl
> (like 1.1.1.html) also as a starting point for crawling.
>
> In our case we'd like to force Nutch to always just crawl stuff within a
> depth of 3 from the real seed page, which is index.html in this case. Is
> there any possible way to do this?
>
> We have already tried to use the '-noAdditions' option to 'updatedb' like
> mentioned in the wiki (http://wiki.apache.org/nutch/IntranetRecrawl), but
> this results in the fact that only the first URL (index.html) is crawled.
> In addition we are afraid that new URLs (for example if we add now 1.2.html
> as a link to the index.html) are also not crawled.
>
> Thanks a lot in advance!
>
>
>
>
>
> --
> View this message in context:
> http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>



-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com
http://twitter.com/digitalpebble

Reply via email to