Hi Michael, both "crawldb" jobs are similar - they merge status information into the CrawlDb, fetch status and newly found links resp. detected duplicates. There are two situations where I could think of the second job takes longer: - if there are many duplicates, significantly more than status updates and additions in the preceding updatedb job - if the CrawlDb has grown significantly (the preceding updatedb added many new URLs)
But you're right. I can see no reason why $commonOptions is not used for the dedup job. Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should be also checked for the other jobs which are not run with $commonOptions. If possible, please test whether running the dedup job with the common options fixes your problem. That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks, Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote: > In the standard crawl script, there is a _bin_nutch updatedb command and, > soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs > with "crawldb /path/to/crawl/db" in their names (in addition to the actual > deduplication job). > In my situation, the "crawldb" job launched by dedup takes twice as long as > the one launched by updatedb. Why should that be? Is it doing something > different? > I notice that the script passes $commonOptions to updatedb but not to dedup. >

