I am looking for a methodology for making the crawler cycle go faster. I had
expected the run-time to be dominated by fetcher performance but, instead, the
greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update +
generate-select.
Can anyone provide an outline of such a methodology, or a link to one already
published?
Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that
the numbers of mappers and reducers are the first things to check. I know how
to set the number of reducers, but it's not obvious how to control the number
of mappers. In my situation (1.7e8+ urls in crawldb), I get orders of magnitude
more mappers than there are disks (or cpu cores) in my cluster. Are there
things I should do to bring it down to something less than 10x the number of
disks or 4x the number of cores, or something like that?