crawlDb speed around deduplication

Michael Coffey Thu, 27 Apr 2017 17:59:34 -0700

In the standard crawl script, there is a _bin_nutch updatedb command and, soon 
after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with 
"crawldb /path/to/crawl/db" in their names (in addition to the actual 
deduplication job).
In my situation, the "crawldb" job launched by dedup takes twice as long as the 
one launched by updatedb. Why should that be? Is it doing something different?
I notice that the script passes $commonOptions to updatedb but not to dedup.

crawlDb speed around deduplication

Reply via email to