Absolute depth for recrawling

Alexandre Mon, 17 Sep 2012 07:06:58 -0700

Hey all!

Since a few days we are currently playing a bit arround with Nutch. Today we
have encountered the following issue.


Our very simple test "URL structure" looks like this:
index.html  ->  1.1.html  ->  1.1.1.html  ->  1.1.1.1.html  -> 
1.1.1.1.1.html

We start a crawl on the index.html (index.html is the only page in the seed
list) with a depth of 3. In this case the first three pages (index.html,
1.1html and 1.1.1.html) are crawled and indexed which is absolutley fine.
Now we start a second crawl (recrawl) with the same depth and crawl db and
in this case all of the pages (including 1.1.1.1.html and 1.1.1.1.1.html)
are crawled. Nutch seems to take the indexed pages from the first crawl
(like 1.1.1.html) also as a starting point for crawling.

In our case we'd like to force Nutch to always just crawl stuff within a
depth of 3 from the real seed page, which is index.html in this case. Is
there any possible way to do this?

We have already tried to use the '-noAdditions' option to 'updatedb' like
mentioned in the wiki (http://wiki.apache.org/nutch/IntranetRecrawl), but
this results in the fact that only the first URL (index.html) is crawled.
In addition we are afraid that new URLs (for example if we add now 1.2.html
as a link to the index.html) are also not crawled.

Thanks a lot in advance!





--
View this message in context: 
http://lucene.472066.n3.nabble.com/Absolute-depth-for-recrawling-tp4008320.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Absolute depth for recrawling

Reply via email to