Hi Michael,

operations on a large CrawlDb of 200 million become slow, that's a matter of 
fact and a well-known
limitation of Nutch 1.x :(  The CrawlDb is a large Hadoop map file and needs to 
be rewritten for
every update (even if it's small).

If your workflow does allow it, you could process multiple segments in one cycle
- generate N segments in one turn
- fetch them sequentially
- (you may start fetching the next one if the previous is in the reduce phase)
- do update, linkdb, index in one turn for all N segments

Regarding mappers and reducers: take a multiple of what you can run in parallel 
on your cluster
to avoid that the cluster is underutilized when a job is waiting for few tasks 
to finish.
The number of mappers is first determined by the number of input partitions 
(determined by number of
reducers writing the CrawlDb or LinkDb). If partitions are small or splittable 
there are a couple of
Hadoop configuration properties to tune the data size processed by a single map 
task.

I would view from the data size: if a mapper processes only few MBs of the 
CrawlDb - it's too small,
there is too much overhead, if it's multiple GBs (compressed) reducers will run 
too long (also
mappers if not splittable). But details depend on your cluster hardware.

Best,
Sebastian


On 05/16/2017 08:04 PM, Michael Coffey wrote:
> I am looking for a methodology for making the crawler cycle go faster. I had 
> expected the run-time to be dominated by fetcher performance but, instead, 
> the greater bulk of the time is taken by linkdb-merge + indexer + 
> crawldb-update + generate-select.
> 
> 
> Can anyone provide an outline of such a methodology, or a link to one already 
> published?
> Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that 
> the numbers of mappers and reducers are the first things to check. I know how 
> to set the number of reducers, but it's not obvious how to control the number 
> of mappers. In my situation (1.9e8+ urls in crawldb), I get orders of 
> magnitude more mappers than there are disks (or cpu cores) in my cluster. Are 
> there things I should do to bring it down to something less than 10x the 
> number of disks or 4x the number of cores, or something like that?
>    
> 

Reply via email to