RE: Increasing the number of reducer in Deduplication
Hi Sebastian, Of course, I had just copied the property and pasted. My bad. Thanks for confirming. Regards, Suraj Singh -Original Message- From: Sebastian Nagel Sent: Wednesday, 20 February 2019 13:26 To: user@nutch.apache.org Subject: Re: Increasing the number of reducer in Deduplication Hi Suraj, the correct syntax would be: __bin_nutch dedup -Dmapreduce.job.reduces=32 "$CRAWL_PATH"/crawldb Hadoop configuration properties must be passed before remaining arguments and you need to pass them as -Dname=value To confirm: I use to run the dedup job with 1200 reducers on a CrawlDb with more than 10 billion URLs. Works seamlessly. Best, Sebastian On 2/20/19 12:55 PM, Suraj Singh wrote: > Hi All, > > Can I increase the number of reducer in Deduplication on crawldb? Currently > it is running with 1 reducer. > Will it impact the crawling in any way? > > Current command in crawl script: > __bin_nutch dedup "$CRAWL_PATH"/crawldb > > Can I update it to: > __bin_nutch dedup "$CRAWL_PATH"/crawldb mapreduce.job.reduces=32 > > Thanks it advance. > > Regards, > Suraj Singh >
Re: Increasing the number of reducer in Deduplication
Hi Suraj, the correct syntax would be: __bin_nutch dedup -Dmapreduce.job.reduces=32 "$CRAWL_PATH"/crawldb Hadoop configuration properties must be passed before remaining arguments and you need to pass them as -Dname=value To confirm: I use to run the dedup job with 1200 reducers on a CrawlDb with more than 10 billion URLs. Works seamlessly. Best, Sebastian On 2/20/19 12:55 PM, Suraj Singh wrote: > Hi All, > > Can I increase the number of reducer in Deduplication on crawldb? Currently > it is running with 1 reducer. > Will it impact the crawling in any way? > > Current command in crawl script: > __bin_nutch dedup "$CRAWL_PATH"/crawldb > > Can I update it to: > __bin_nutch dedup "$CRAWL_PATH"/crawldb mapreduce.job.reduces=32 > > Thanks it advance. > > Regards, > Suraj Singh >
RE: Increasing the number of reducer in Deduplication
Thanks Markus. Regards, Suraj Singh -Original Message- From: Markus Jelsma Sent: Wednesday, 20 February 2019 13:04 To: user@nutch.apache.org Subject: RE: Increasing the number of reducer in Deduplication Hello Suraj, That should be no problem. Duplicates are grouped by their signature, this means you can have as many reducers as you would like. Regards, Markus -Original message- > From:Suraj Singh > Sent: Wednesday 20th February 2019 12:56 > To: user@nutch.apache.org > Subject: Increasing the number of reducer in Deduplication > > Hi All, > > Can I increase the number of reducer in Deduplication on crawldb? Currently > it is running with 1 reducer. > Will it impact the crawling in any way? > > Current command in crawl script: > __bin_nutch dedup "$CRAWL_PATH"/crawldb > > Can I update it to: > __bin_nutch dedup "$CRAWL_PATH"/crawldb mapreduce.job.reduces=32 > > Thanks it advance. > > Regards, > Suraj Singh >
RE: Increasing the number of reducer in Deduplication
Hello Suraj, That should be no problem. Duplicates are grouped by their signature, this means you can have as many reducers as you would like. Regards, Markus -Original message- > From:Suraj Singh > Sent: Wednesday 20th February 2019 12:56 > To: user@nutch.apache.org > Subject: Increasing the number of reducer in Deduplication > > Hi All, > > Can I increase the number of reducer in Deduplication on crawldb? Currently > it is running with 1 reducer. > Will it impact the crawling in any way? > > Current command in crawl script: > __bin_nutch dedup "$CRAWL_PATH"/crawldb > > Can I update it to: > __bin_nutch dedup "$CRAWL_PATH"/crawldb mapreduce.job.reduces=32 > > Thanks it advance. > > Regards, > Suraj Singh >