Good tip! The compression topic is interesting because we spend a lot of time reading and writing files.
For the dedup-crawldb job, I have: mapreduce.map.output.compress = true (from command line) mapreduce.map.output.compress.codec = org.apache.hadoop.io.compress.DefaultCodec mapreduce.output.fileoutputformat.compress = false (from default) mapreduce.output.fileoutputformat.compress.codec = org.apache.hadoop.io.compress.DefaultCodec mapreduce.output.fileoutputformat.compress.type = RECORD For the linkdb-merge job, it is the same except: mapreduce.output.fileoutputformat.compress = true But the jobhistory config page does not say how fileoutputformat.compress was set = true. Anyway, I'm thinking of using bzip2 for the final output compression. Anybody know a reason I shouldn't try that? -- Hi Michael, the easiest way is probably to check the actual job configuration as shown by the Hadoop resource manager webapp, see screenshot. It's also indicated from where a configuration property is set. Best, Sebastian On 05/02/2017 12:57 AM, Michael Coffey wrote: > Thanks, I will do some testing with $commonOptions applied to dedup. I > suspect that the dedup-update is not compressing its output. Any easy way to check for just that? > > > > Hi Michael, both "crawldb" jobs are similar - they merge status information > into the CrawlDb,fetch status and newly found links resp. detected duplicates. There are two situations where > I could think of the second job takes longer: - if there are many duplicates, > significantly more than status updates and additions in the preceding updatedb job - if the CrawlDb has grown significantly (the preceding updatedb added many new URLs) But you're right. I can see no reason why $commonOptions is not used for the dedup job. > Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should > be also > checked for the other jobs which are not run with $commonOptions. > If possible, please test whether running the dedup job with the common > options fixes your > problem. > That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks, > Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote: >> In the standard crawl script, there is a _bin_nutch updatedb command and, >> soon after > that, a _bin_nutch dedup command. Both of them launch hadoop jobs with > "crawldb /path/to/crawl/db" > in their names (in addition to the actual deduplication job). >> In my situation, the "crawldb" job launched by dedup takes twice as long as >> the one launched > by updatedb. Why should that be? Is it doing something different? >> I notice that the script passes $commonOptions to updatedb but not to dedup. >>

