RE: Increasing the number of reducer in Deduplication

2019-02-20 Thread Suraj Singh
Hi Sebastian,

Of course, I had just copied the property and pasted. My bad.
Thanks for confirming.

Regards,
Suraj Singh 

-Original Message-
From: Sebastian Nagel  
Sent: Wednesday, 20 February 2019 13:26
To: user@nutch.apache.org
Subject: Re: Increasing the number of reducer in Deduplication

Hi Suraj,

the correct syntax would be:

  __bin_nutch dedup -Dmapreduce.job.reduces=32 "$CRAWL_PATH"/crawldb

Hadoop configuration properties must be passed before remaining arguments and 
you need to pass them as -Dname=value

To confirm: I use to run the dedup job with 1200 reducers on a CrawlDb with 
more than 10 billion URLs.  Works seamlessly.

Best,
Sebastian

On 2/20/19 12:55 PM, Suraj Singh wrote:
> Hi All,
> 
> Can I increase the number of reducer in Deduplication on crawldb? Currently 
> it is running with 1 reducer.
> Will it impact the crawling in any way?
> 
> Current command in crawl script:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb
> 
> Can I update it to:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb mapreduce.job.reduces=32
> 
> Thanks it advance.
> 
> Regards,
> Suraj Singh
> 



Re: Increasing the number of reducer in Deduplication

2019-02-20 Thread Sebastian Nagel
Hi Suraj,

the correct syntax would be:

  __bin_nutch dedup -Dmapreduce.job.reduces=32 "$CRAWL_PATH"/crawldb

Hadoop configuration properties must be passed before remaining arguments
and you need to pass them as -Dname=value

To confirm: I use to run the dedup job with 1200 reducers on a CrawlDb with
more than 10 billion URLs.  Works seamlessly.

Best,
Sebastian

On 2/20/19 12:55 PM, Suraj Singh wrote:
> Hi All,
> 
> Can I increase the number of reducer in Deduplication on crawldb? Currently 
> it is running with 1 reducer.
> Will it impact the crawling in any way?
> 
> Current command in crawl script:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb
> 
> Can I update it to:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb mapreduce.job.reduces=32
> 
> Thanks it advance.
> 
> Regards,
> Suraj Singh
> 



RE: Increasing the number of reducer in Deduplication

2019-02-20 Thread Suraj Singh
Thanks Markus.

Regards,
Suraj Singh

-Original Message-
From: Markus Jelsma  
Sent: Wednesday, 20 February 2019 13:04
To: user@nutch.apache.org
Subject: RE: Increasing the number of reducer in Deduplication

Hello Suraj,

That should be no problem. Duplicates are grouped by their signature, this 
means you can have as many reducers as you would like.

Regards,
Markus
 
 
-Original message-
> From:Suraj Singh 
> Sent: Wednesday 20th February 2019 12:56
> To: user@nutch.apache.org
> Subject: Increasing the number of reducer in Deduplication
> 
> Hi All,
> 
> Can I increase the number of reducer in Deduplication on crawldb? Currently 
> it is running with 1 reducer.
> Will it impact the crawling in any way?
> 
> Current command in crawl script:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb
> 
> Can I update it to:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb mapreduce.job.reduces=32
> 
> Thanks it advance.
> 
> Regards,
> Suraj Singh
> 


RE: Increasing the number of reducer in Deduplication

2019-02-20 Thread Markus Jelsma
Hello Suraj,

That should be no problem. Duplicates are grouped by their signature, this 
means you can have as many reducers as you would like.

Regards,
Markus
 
 
-Original message-
> From:Suraj Singh 
> Sent: Wednesday 20th February 2019 12:56
> To: user@nutch.apache.org
> Subject: Increasing the number of reducer in Deduplication
> 
> Hi All,
> 
> Can I increase the number of reducer in Deduplication on crawldb? Currently 
> it is running with 1 reducer.
> Will it impact the crawling in any way?
> 
> Current command in crawl script:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb
> 
> Can I update it to:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb mapreduce.job.reduces=32
> 
> Thanks it advance.
> 
> Regards,
> Suraj Singh
>