Hi Michael, operations on a large CrawlDb of 200 million become slow, that's a matter of fact and a well-known limitation of Nutch 1.x :( The CrawlDb is a large Hadoop map file and needs to be rewritten for every update (even if it's small).
If your workflow does allow it, you could process multiple segments in one cycle - generate N segments in one turn - fetch them sequentially - (you may start fetching the next one if the previous is in the reduce phase) - do update, linkdb, index in one turn for all N segments Regarding mappers and reducers: take a multiple of what you can run in parallel on your cluster to avoid that the cluster is underutilized when a job is waiting for few tasks to finish. The number of mappers is first determined by the number of input partitions (determined by number of reducers writing the CrawlDb or LinkDb). If partitions are small or splittable there are a couple of Hadoop configuration properties to tune the data size processed by a single map task. I would view from the data size: if a mapper processes only few MBs of the CrawlDb - it's too small, there is too much overhead, if it's multiple GBs (compressed) reducers will run too long (also mappers if not splittable). But details depend on your cluster hardware. Best, Sebastian On 05/16/2017 08:04 PM, Michael Coffey wrote: > I am looking for a methodology for making the crawler cycle go faster. I had > expected the run-time to be dominated by fetcher performance but, instead, > the greater bulk of the time is taken by linkdb-merge + indexer + > crawldb-update + generate-select. > > > Can anyone provide an outline of such a methodology, or a link to one already > published? > Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that > the numbers of mappers and reducers are the first things to check. I know how > to set the number of reducers, but it's not obvious how to control the number > of mappers. In my situation (1.9e8+ urls in crawldb), I get orders of > magnitude more mappers than there are disks (or cpu cores) in my cluster. Are > there things I should do to bring it down to something less than 10x the > number of disks or 4x the number of cores, or something like that? > >

