Hi Suraj,

the correct syntax would be:

  __bin_nutch dedup -Dmapreduce.job.reduces=32 "$CRAWL_PATH"/crawldb

Hadoop configuration properties must be passed before remaining arguments
and you need to pass them as -Dname=value

To confirm: I use to run the dedup job with 1200 reducers on a CrawlDb with
more than 10 billion URLs.  Works seamlessly.


On 2/20/19 12:55 PM, Suraj Singh wrote:
> Hi All,
> Can I increase the number of reducer in Deduplication on crawldb? Currently 
> it is running with 1 reducer.
> Will it impact the crawling in any way?
> Current command in crawl script:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb
> Can I update it to:
> __bin_nutch dedup "$CRAWL_PATH"/crawldb mapreduce.job.reduces=32
> Thanks it advance.
> Regards,
> Suraj Singh

Reply via email to