I am looking for a methodology for making the crawler cycle go faster. I had 
expected the run-time to be dominated by fetcher performance but, instead, the 
greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update + 
generate-select.
Can anyone provide an outline of such a methodology, or a link to one already 
published?
Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that 
the numbers of mappers and reducers are the first things to check. I know how 
to set the number of reducers, but it's not obvious how to control the number 
of mappers. In my situation (1.7e8+ urls in crawldb), I get orders of magnitude 
more mappers than there are disks (or cpu cores) in my cluster. Are there 
things I should do to bring it down to something less than 10x the number of 
disks or 4x the number of cores, or something like that?

Reply via email to