I run Nutch1.3 crawl with topN = 5000, and depth=20.
For the first two crawl cycles the Generator and CrawlDb Update phases
take ~1hour. Around the 3rd cycle this increases to 3.5 hours, then
around the 9th cycle these two phases take over 12 hours. I have
plotted out this time and it's not growing naturally as in linearly or
exponsntially or anything like that. There are distinct digital steps
in the Generator and CrawlDb time. I expect these phases to take
longer as I have more links but not like this.
After the crawling is complete I started crawling again and the
Generator and CrawlDb time go back to taking ~1 hours. It seems that
I can keep these times at 1hour if I do not use a depth>2.
Why is this happening? Any ideas?
During these two phases the processor is 99% utilized, and the memory only 11%.