It's likely you're normalizing and filtering in both jobs. We don't do filtering or normalization at all for both jobs and rely on ParseOutputFormat instead.
> I run Nutch1.3 crawl with topN = 5000, and depth=20. > For the first two crawl cycles the Generator and CrawlDb Update phases > take ~1hour. Around the 3rd cycle this increases to 3.5 hours, then > around the 9th cycle these two phases take over 12 hours. I have > plotted out this time and it's not growing naturally as in linearly or > exponsntially or anything like that. There are distinct digital steps > in the Generator and CrawlDb time. I expect these phases to take > longer as I have more links but not like this. > After the crawling is complete I started crawling again and the > Generator and CrawlDb time go back to taking ~1 hours. It seems that > I can keep these times at 1hour if I do not use a depth>2. > Why is this happening? Any ideas? > During these two phases the processor is 99% utilized, and the memory only > 11%.

