Hi Michael, the easiest way is probably to check the actual job configuration as shown by the Hadoop resource manager webapp, see screenshot. It's also indicated from where a configuration property is set.
Best, Sebastian On 05/02/2017 12:57 AM, Michael Coffey wrote: > Thanks, I will do some testing with $commonOptions applied to dedup. I > suspect that the dedup-update is not compressing its output. Any easy way to > check for just that? > > > > Hi Michael, both "crawldb" jobs are similar - they merge status information > into the CrawlDb,fetch status and newly found links resp. detected > duplicates. There are two situations where > I could think of the second job takes longer: - if there are many duplicates, > significantly more than status updates and additions in the preceding > updatedb job - if the CrawlDb has grown significantly (the preceding updatedb > added many new URLs) But you're right. I can see no reason why $commonOptions > is not used for the dedup job. > Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should > be also > checked for the other jobs which are not run with $commonOptions. > If possible, please test whether running the dedup job with the common > options fixes your > problem. > That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks, > Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote: >> In the standard crawl script, there is a _bin_nutch updatedb command and, >> soon after > that, a _bin_nutch dedup command. Both of them launch hadoop jobs with > "crawldb /path/to/crawl/db" > in their names (in addition to the actual deduplication job). >> In my situation, the "crawldb" job launched by dedup takes twice as long as >> the one launched > by updatedb. Why should that be? Is it doing something different? >> I notice that the script passes $commonOptions to updatedb but not to dedup. >>

