In the standard crawl script, there is a _bin_nutch updatedb command and, soon 
after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with 
"crawldb /path/to/crawl/db" in their names (in addition to the actual 
deduplication job).
In my situation, the "crawldb" job launched by dedup takes twice as long as the 
one launched by updatedb. Why should that be? Is it doing something different?
I notice that the script passes $commonOptions to updatedb but not to dedup.

Reply via email to