Could you explain what is meant by "continuously running crawl cycles"?
Usually, you run a crawl with a certain "depth", a max. number of cycles. If the depth is reached the crawler stops even if there are still unfetched URLs. If generator generates an empty fetch list in one cycle the crawler stops before depth is reached. The reason for an empty fetch list may be: - no more unfetched URLs (trivial, but not in your case) - recent temporary failures: after a temporary failure (network timeout, etc.) a URL is blocked for one day. Does one of these suggestions answer your question? Sebastian On 03/23/2012 02:46 PM, webdev1977 wrote:
I was under the impression that setting topN for crawl cycles would limit the number of items each iteration of the crawl would fetch/parse. However, eventually after continuously running crawl cycles it would get ALL the urls. My continuous crawl has stopped fetching/parsing and the stats from crawldb indicate that db_unfetched is 133,359. Why is it no longer fetching urls if there are so many unfetched? -- View this message in context: http://lucene.472066.n3.nabble.com/db-unfetched-large-number-but-crawling-not-fetching-any-longer-tp3851587p3851587.html Sent from the Nutch - User mailing list archive at Nabble.com.