tuning for speed

Michael Coffey Tue, 16 May 2017 11:08:37 -0700

I am looking for a methodology for making the crawler cycle go faster. I had 
expected the run-time to be dominated by fetcher performance but, instead, the 
greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update + 
generate-select.



Can anyone provide an outline of such a methodology, or a link to one already 
published?
Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that 
the numbers of mappers and reducers are the first things to check. I know how 
to set the number of reducers, but it's not obvious how to control the number 
of mappers. In my situation (1.9e8+ urls in crawldb), I get orders of magnitude 
more mappers than there are disks (or cpu cores) in my cluster. Are there 
things I should do to bring it down to something less than 10x the number of 
disks or 4x the number of cores, or something like that?

tuning for speed

Reply via email to