Thanks for the notes, I can confirm that updating from more than one segment at 
a time is very helpful. Right now, I am using 2, but should be able to increase 
that with a little work.

I am still not clear about how to control the number of mappers in various 
jobs. For example, the generate-select spawns 152 maps, the crawldb-update 
spawns 156, the linkdb-merge spawns 72, and the indexer 230. Those seem to be 
very large multiples of what my still-small cluster can do in parallel. Each 
job has 8 or fewer reducers.

Regarding data size, I looked at the "bytes_read" counters for the map tasks of 
a few recent runs. The mean is about 1.3e08 bytes. Is that the right number to 
look at to determine whether the maps are too big or too small? If this is the 
correct metric, then I guess it's already in the range you suggested. If not, 
what is the correct metric?

Should I be using Nutch 2.x, instead?

----
Hi Michael,

operations on a large CrawlDb of 200 million become slow, that's a matter of 
fact and a well-known
limitation of Nutch 1.x :(  The CrawlDb is a large Hadoop map file and needs to 
be rewritten
for
every update (even if it's small).

If your workflow does allow it, you could process multiple segments in one cycle
- generate N segments in one turn
- fetch them sequentially
- (you may start fetching the next one if the previous is in the reduce phase)
- do update, linkdb, index in one turn for all N segments

Regarding mappers and reducers: take a multiple of what you can run in parallel 
on your cluster
to avoid that the cluster is underutilized when a job is waiting for few tasks 
to finish.
The number of mappers is first determined by the number of input partitions 
(determined by
number of
reducers writing the CrawlDb or LinkDb). If partitions are small or splittable 
there are a
couple of
Hadoop configuration properties to tune the data size processed by a single map 
task.

I would view from the data size: if a mapper processes only few MBs of the 
CrawlDb - it's
too small,
there is too much overhead, if it's multiple GBs (compressed) reducers will run 
too long (also
mappers if not splittable). But details depend on your cluster hardware.

Best,
Sebastian


On 05/16/2017 08:04 PM, Michael Coffey wrote:
> I am looking for a methodology for making the crawler cycle go faster. I had 
> expected
the run-time to be dominated by fetcher performance but, instead, the greater 
bulk of the
time is taken by linkdb-merge + indexer + crawldb-update + generate-select.
> 
> 
> Can anyone provide an outline of such a methodology, or a link to one already 
> published?
> Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" that 
> the numbers
of mappers and reducers are the first things to check. I know how to set the 
number of reducers,
but it's not obvious how to control the number of mappers. In my situation 
(1.9e8+ urls in
crawldb), I get orders of magnitude more mappers than there are disks (or cpu 
cores) in my
cluster. Are there things I should do to bring it down to something less than 
10x the number
of disks or 4x the number of cores, or something like that?
>    
> 


Reply via email to