Thanks for the notes, I can confirm that updating from more than one segment at a time is very helpful. Right now, I am using 2, but should be able to increase that with a little work.
I am still not clear about how to control the number of mappers in various jobs. For example, the generate-select spawns 152 maps, the crawldb-update spawns 156, the linkdb-merge spawns 72, and the indexer 230. Those seem to be very large multiples of what my still-small cluster can do in parallel. Each job has 8 or fewer reducers. Regarding data size, I looked at the "bytes_read" counters for the map tasks of a few recent runs. The mean is about 1.3e08 bytes. Is that the right number to look at to determine whether the maps are too big or too small? If this is the correct metric, then I guess it's already in the range you suggested. If not, what is the correct metric? Should I be using Nutch 2.x, instead? ---- Hi Michael, operations on a large CrawlDb of 200 million become slow, that's a matter of fact and a well-known limitation of Nutch 1.x :( The CrawlDb is a large Hadoop map file and needs to be rewritten for every update (even if it's small). If your workflow does allow it, you could process multiple segments in one cycle - generate N segments in one turn - fetch them sequentially - (you may start fetching the next one if the previous is in the reduce phase) - do update, linkdb, index in one turn for all N segments Regarding mappers and reducers: take a multiple of what you can run in parallel on your cluster to avoid that the cluster is underutilized when a job is waiting for few tasks to finish. The number of mappers is first determined by the number of input partitions (determined by number of reducers writing the CrawlDb or LinkDb). If partitions are small or splittable there are a couple of Hadoop configuration properties to tune the data size processed by a single map task. I would view from the data size: if a mapper processes only few MBs of the CrawlDb - it's too small, there is too much overhead, if it's multiple GBs (compressed) reducers will run too long (also mappers if not splittable). But details depend on your cluster hardware. Best, Sebastian On 05/16/2017 08:04 PM, Michael Coffey wrote: > I am looking for a methodology for making the crawler cycle go faster. I had > expected the run-time to be dominated by fetcher performance but, instead, the greater bulk of the time is taken by linkdb-merge + indexer + crawldb-update + generate-select. > > > Can anyone provide an outline of such a methodology, or a link to one already > published? > Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that > the numbers of mappers and reducers are the first things to check. I know how to set the number of reducers, but it's not obvious how to control the number of mappers. In my situation (1.9e8+ urls in crawldb), I get orders of magnitude more mappers than there are disks (or cpu cores) in my cluster. Are there things I should do to bring it down to something less than 10x the number of disks or 4x the number of cores, or something like that? > >

