RE: crawlDb speed around deduplication

Markus Jelsma Wed, 03 May 2017 13:01:06 -0700

Hello - if you have slow disks but plenty of CPU power, bzip2 would be a good 
choice. Otherwise gzip is probably a more suitable candidate.


Markus

 
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Wednesday 3rd May 2017 20:30
> To: User <[email protected]>
> Subject: Re: crawlDb speed around deduplication
> 
> Good tip!
> 
> The compression topic is interesting because we spend a lot of time reading 
> and writing files.
> 
> For the dedup-crawldb job, I have:
> 
> mapreduce.map.output.compress  = true (from command line)
> mapreduce.map.output.compress.codec = 
> org.apache.hadoop.io.compress.DefaultCodec
> mapreduce.output.fileoutputformat.compress = false (from default)
> mapreduce.output.fileoutputformat.compress.codec = 
> org.apache.hadoop.io.compress.DefaultCodec
> mapreduce.output.fileoutputformat.compress.type = RECORD
> 
> 
> For the linkdb-merge job, it is the same except:
> mapreduce.output.fileoutputformat.compress = true
> 
> 
> But the jobhistory config page does not say how fileoutputformat.compress was 
> set = true.
> 
> 
> Anyway, I'm thinking of using bzip2 for the final output compression. Anybody 
> know a reason I shouldn't try that?
> 
> --
> 
> Hi Michael,
> 
> the easiest way is probably to check the actual job configuration as shown by 
> the Hadoop resource
> manager webapp, see screenshot. It's also indicated from where a 
> configuration property is
> set.
> 
> Best,
> Sebastian
> 
> On 05/02/2017 12:57 AM, Michael Coffey wrote:
> > Thanks, I will do some testing with $commonOptions applied to dedup. I 
> > suspect that the
> dedup-update is not compressing its output. Any easy way to check for just 
> that?
> > 
> > 
> > 
> > Hi Michael, both "crawldb" jobs are similar - they merge status information 
> > into the
> CrawlDb,fetch status and newly found links resp. detected duplicates. There 
> are two situations
> where
> > I could think of the second job takes longer: - if there are many 
> > duplicates, significantly
> more than status updates and additions in the preceding updatedb job - if the 
> CrawlDb has
> grown significantly (the preceding updatedb added many new URLs) But you're 
> right. I can see
> no reason why $commonOptions is not used for the dedup job.
> > Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, 
> > should be also
> > checked for the other jobs which are not run with $commonOptions.
> > If possible, please test whether running the dedup job with the common 
> > options fixes
> your
> > problem.
> > That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks,
> > Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote:
> >> In the standard crawl script, there is a _bin_nutch updatedb command and, 
> >> soon after
> > that, a _bin_nutch dedup command. Both of them launch hadoop jobs with 
> > "crawldb /path/to/crawl/db"
> > in their names (in addition to the actual deduplication job).
> >> In my situation, the "crawldb" job launched by dedup takes twice as long 
> >> as the one
> launched
> > by updatedb. Why should that be? Is it doing something different?
> >> I notice that the script passes $commonOptions to updatedb but not to 
> >> dedup.
> >>
>

RE: crawlDb speed around deduplication

Reply via email to