Re: crawlDb speed around deduplication

Sebastian Nagel Wed, 03 May 2017 12:53:06 -0700

Hi Michael,

> Anyway, I'm thinking of using bzip2 for the final output compression. Anybody 
> know a reason I
shouldn't try that?


That's not a bad choice.

But more important: the records (CrawlDatum objects) in the CrawlDb are small,
so you should set
> mapreduce.output.fileoutputformat.compress.type = BLOCK
and use it in combination with a splittable codec (BZip2 is splittable).

I haven't played with compression option for LinkDb but BLOCK may be also worth 
to try.
In doubt, you need to find a good balance between CPU and IO usage for your 
hardware.

Best,
Sebastian

On 05/03/2017 08:26 PM, Michael Coffey wrote:
> Good tip!
> 
> The compression topic is interesting because we spend a lot of time reading 
> and writing files.
> 
> For the dedup-crawldb job, I have:
> 
> mapreduce.map.output.compress  = true (from command line)
> mapreduce.map.output.compress.codec = 
> org.apache.hadoop.io.compress.DefaultCodec
> mapreduce.output.fileoutputformat.compress = false (from default)
> mapreduce.output.fileoutputformat.compress.codec = 
> org.apache.hadoop.io.compress.DefaultCodec
> mapreduce.output.fileoutputformat.compress.type = RECORD
> 
> 
> For the linkdb-merge job, it is the same except:
> mapreduce.output.fileoutputformat.compress = true
> 
> 
> But the jobhistory config page does not say how fileoutputformat.compress was 
> set = true.
> 
> 
> Anyway, I'm thinking of using bzip2 for the final output compression. Anybody 
> know a reason I shouldn't try that?
> 
> --
> 
> Hi Michael,
> 
> the easiest way is probably to check the actual job configuration as shown by 
> the Hadoop resource
> manager webapp, see screenshot. It's also indicated from where a 
> configuration property is
> set.
> 
> Best,
> Sebastian
> 
> On 05/02/2017 12:57 AM, Michael Coffey wrote:
>> Thanks, I will do some testing with $commonOptions applied to dedup. I 
>> suspect that the
> dedup-update is not compressing its output. Any easy way to check for just 
> that?
>>
>>
>>
>> Hi Michael, both "crawldb" jobs are similar - they merge status information 
>> into the
> CrawlDb,fetch status and newly found links resp. detected duplicates. There 
> are two situations
> where
>> I could think of the second job takes longer: - if there are many 
>> duplicates, significantly
> more than status updates and additions in the preceding updatedb job - if the 
> CrawlDb has
> grown significantly (the preceding updatedb added many new URLs) But you're 
> right. I can see
> no reason why $commonOptions is not used for the dedup job.
>> Please, open an issue on https://issues.apache.org/jira/browse/NUTCH, should 
>> be also
>> checked for the other jobs which are not run with $commonOptions.
>> If possible, please test whether running the dedup job with the common 
>> options fixes
> your
>> problem.
>> That's easily done: just edit src/bin/crawl and run "ant runtime". Thanks,
>> Sebastian On 04/28/2017 02:54 AM, Michael Coffey wrote:
>>> In the standard crawl script, there is a _bin_nutch updatedb command and, 
>>> soon after
>> that, a _bin_nutch dedup command. Both of them launch hadoop jobs with 
>> "crawldb /path/to/crawl/db"
>> in their names (in addition to the actual deduplication job).
>>> In my situation, the "crawldb" job launched by dedup takes twice as long as 
>>> the one
> launched
>> by updatedb. Why should that be? Is it doing something different?
>>> I notice that the script passes $commonOptions to updatedb but not to dedup.
>>>

Re: crawlDb speed around deduplication

Reply via email to