I run Nutch1.3 crawl with topN = 5000, and depth=20.
For the first two crawl cycles the Generator and CrawlDb Update phases
take ~1hour.  Around the 3rd cycle this increases to 3.5 hours, then
around the 9th cycle these two phases take over 12 hours.  I have
plotted out this time and it's not growing naturally as in linearly or
exponsntially or anything like that.  There are distinct digital steps
in the Generator and CrawlDb time.  I expect these phases to take
longer as I have more links but not like this.
After the crawling is complete I started crawling again and the
Generator and CrawlDb time go back to taking ~1 hours.  It seems that
I can keep these times at 1hour if I do not use a depth>2.
Why is this happening?  Any ideas?
During these two phases the processor is 99% utilized, and the memory only 11%.

Reply via email to