Hi Michael, > Anyway, I'm thinking of using bzip2 for the final output compression. Anybody > know a reason I shouldn't try that?
That's not a bad choice. But more important: the records (CrawlDatum objects) in the CrawlDb are small, so you should set > mapreduce.output.fileoutputformat.compress.type = BLOCK and use it in combination with a splittable codec (BZip2 is splittable). I haven't played with compression option for LinkDb but BLOCK may be also worth to try. In doubt, you need to find a good balance between CPU and IO usage for your hardware. Best, Sebastian On 05/03/2017 08:26 PM, Michael Coffey wrote: > Good tip! > > The compression topic is interesting because we spend a lot of time reading > and writing files. > > For the dedup-crawldb job, I have: > > mapreduce.map.output.compress = true (from command line) > mapreduce.map.output.compress.codec = > org.apache.hadoop.io.compress.DefaultCodec > mapreduce.output.fileoutputformat.compress = false (from default) > mapreduce.output.fileoutputformat.compress.codec = > org.apache.hadoop.io.compress.DefaultCodec > mapreduce.output.fileoutputformat.compress.type = RECORD > > > For the linkdb-merge job, it is the same except: > mapreduce.output.fileoutputformat.compress = true > > > But the jobhistory config page does not say how fileoutputformat.compress was > set = true. > > > Anyway, I'm thinking of using bzip2 for the final output compression. Anybody > know a reason I shouldn't try that? > > -- > > Hi Michael, > > the easiest way is probably to check the actual job configuration as shown by > the Hadoop resource > manager webapp, see screenshot. It's also indicated from where a > configuration property is > set. > > Best, > Sebastian > > On 05/02/2017 12:57 AM, Michael Coffey wrote: >> Thanks, I will do some testing with $commonOptions applied to dedup. I >> suspect that the > dedup-update is not compressing its output. Any easy way to check for just > that? >> >> >> >> Hi Michael, both "crawldb" jobs are similar - they merge status information >> into the > CrawlDb,fetch status and newly found links resp. detected duplicates. There > are two situations > where >> I could think of the second job takes longer: - if there are many >> duplicates, significantly > more than status updates and additions in the preceding updatedb job - if the > CrawlDb has > grown significantly (the preceding updatedb added many new URLs) But you're > right. I can see > no reason why $commonOptions is not used for the dedup job. >> Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should >> be also >> checked for the other jobs which are not run with $commonOptions. >> If possible, please test whether running the dedup job with the common >> options fixes > your >> problem. >> That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks, >> Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote: >>> In the standard crawl script, there is a _bin_nutch updatedb command and, >>> soon after >> that, a _bin_nutch dedup command. Both of them launch hadoop jobs with >> "crawldb /path/to/crawl/db" >> in their names (in addition to the actual deduplication job). >>> In my situation, the "crawldb" job launched by dedup takes twice as long as >>> the one > launched >> by updatedb. Why should that be? Is it doing something different? >>> I notice that the script passes $commonOptions to updatedb but not to dedup. >>>

