Re: db_unfetched large number, but crawling not fetching any longer

Sebastian Nagel Fri, 23 Mar 2012 13:45:40 -0700

Could you explain what is meant by "continuously running crawl cycles"?


Usually, you run a crawl with a certain "depth", a max. number of cycles.
If the depth is reached the crawler stops even if there are still unfetched
URLs. If generator generates an empty fetch list in one cycle the crawler stops
before depth is reached. The reason for an empty fetch list may be:
 - no more unfetched URLs (trivial, but not in your case)
 - recent temporary failures: after a temporary failure (network timeout, etc.)
   a URL is blocked for one day.

Does one of these suggestions answer your question?

Sebastian

On 03/23/2012 02:46 PM, webdev1977 wrote:

I was under the impression that setting topN for crawl cycles would limit the
number of items each iteration of the crawl would fetch/parse.  However,
eventually after continuously running crawl cycles it would get ALL the
urls.  My continuous crawl has stopped fetching/parsing and the stats from
crawldb indicate that db_unfetched is 133,359.

Why is it no longer fetching urls if there are so many unfetched?

--
View this message in context: 
http://lucene.472066.n3.nabble.com/db-unfetched-large-number-but-crawling-not-fetching-any-longer-tp3851587p3851587.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Re: db_unfetched large number, but crawling not fetching any longer

Reply via email to