I am looking for a methodology for making the crawler cycle go faster. I had expected the run-time to be dominated by fetcher performance but, instead, the greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update + generate-select.
Can anyone provide an outline of such a methodology, or a link to one already published? Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that the numbers of mappers and reducers are the first things to check. I know how to set the number of reducers, but it's not obvious how to control the number of mappers. In my situation (1.9e8+ urls in crawldb), I get orders of magnitude more mappers than there are disks (or cpu cores) in my cluster. Are there things I should do to bring it down to something less than 10x the number of disks or 4x the number of cores, or something like that?

