In the standard crawl script, there is a _bin_nutch updatedb command and, soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs with "crawldb /path/to/crawl/db" in their names (in addition to the actual deduplication job). In my situation, the "crawldb" job launched by dedup takes twice as long as the one launched by updatedb. Why should that be? Is it doing something different? I notice that the script passes $commonOptions to updatedb but not to dedup.
- crawlDb speed around deduplication Michael Coffey
- Re: crawlDb speed around deduplication Sebastian Nagel
- Re: crawlDb speed around deduplication Michael Coffey
- Re: crawlDb speed around deduplication Sebastian Nagel
- RE: crawlDb speed around deduplication Markus Jelsma

