Hello - if you have slow disks but plenty of CPU power, bzip2 would be a good choice. Otherwise gzip is probably a more suitable candidate.
Markus -----Original message----- > From:Michael Coffey <[email protected]> > Sent: Wednesday 3rd May 2017 20:30 > To: User <[email protected]> > Subject: Re: crawlDb speed around deduplication > > Good tip! > > The compression topic is interesting because we spend a lot of time reading > and writing files. > > For the dedup-crawldb job, I have: > > mapreduce.map.output.compress = true (from command line) > mapreduce.map.output.compress.codec = > org.apache.hadoop.io.compress.DefaultCodec > mapreduce.output.fileoutputformat.compress = false (from default) > mapreduce.output.fileoutputformat.compress.codec = > org.apache.hadoop.io.compress.DefaultCodec > mapreduce.output.fileoutputformat.compress.type = RECORD > > > For the linkdb-merge job, it is the same except: > mapreduce.output.fileoutputformat.compress = true > > > But the jobhistory config page does not say how fileoutputformat.compress was > set = true. > > > Anyway, I'm thinking of using bzip2 for the final output compression. Anybody > know a reason I shouldn't try that? > > -- > > Hi Michael, > > the easiest way is probably to check the actual job configuration as shown by > the Hadoop resource > manager webapp, see screenshot. It's also indicated from where a > configuration property is > set. > > Best, > Sebastian > > On 05/02/2017 12:57 AM, Michael Coffey wrote: > > Thanks, I will do some testing with $commonOptions applied to dedup. I > > suspect that the > dedup-update is not compressing its output. Any easy way to check for just > that? > > > > > > > > Hi Michael, both "crawldb" jobs are similar - they merge status information > > into the > CrawlDb,fetch status and newly found links resp. detected duplicates. There > are two situations > where > > I could think of the second job takes longer: - if there are many > > duplicates, significantly > more than status updates and additions in the preceding updatedb job - if the > CrawlDb has > grown significantly (the preceding updatedb added many new URLs) But you're > right. I can see > no reason why $commonOptions is not used for the dedup job. > > Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, > > should be also > > checked for the other jobs which are not run with $commonOptions. > > If possible, please test whether running the dedup job with the common > > options fixes > your > > problem. > > That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks, > > Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote: > >> In the standard crawl script, there is a _bin_nutch updatedb command and, > >> soon after > > that, a _bin_nutch dedup command. Both of them launch hadoop jobs with > > "crawldb /path/to/crawl/db" > > in their names (in addition to the actual deduplication job). > >> In my situation, the "crawldb" job launched by dedup takes twice as long > >> as the one > launched > > by updatedb. Why should that be? Is it doing something different? > >> I notice that the script passes $commonOptions to updatedb but not to > >> dedup. > >> >

