RE: tuning for speed

Markus Jelsma Tue, 23 May 2017 14:53:55 -0700

See inline.
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Friday 19th May 2017 2:08
> To: [email protected]
> Subject: Re: tuning for speed
> 
> Thanks for the notes, I can confirm that updating from more than one segment 
> at a time is very helpful. Right now, I am using 2, but should be able to 
> increase that with a little work.
> 
> I am still not clear about how to control the number of mappers in various 
> jobs. For example, the generate-select spawns 152 maps, the crawldb-update 
> spawns 156, the linkdb-merge spawns 72, and the indexer 230. Those seem to be 
> very large multiples of what my still-small cluster can do in parallel. Each 
> job has 8 or fewer reducers.


Number of mappers is almost uncontrollable, except number of fetchers 
(numFetchers). This will greatly increase your fetch speed. Increasing the 
number of reducers is always a good thing if you can.

> 
> Regarding data size, I looked at the "bytes_read" counters for the map tasks 
> of a few recent runs. The mean is about 1.3e08 bytes. Is that the right 
> number to look at to determine whether the maps are too big or too small? If 
> this is the correct metric, then I guess it's already in the range you 
> suggested. If not, what is the correct metric?

No, this is not the correct metric. When we need to maximize hardware 
utilization, we look at time spent in job and resources not used. Increase 
resources used and reduce time. Increasing resources used is Hadoop-speak and 
can be hard to grasp. At least, use numFetchers to run more fetchers.


> 
> Should I be using Nutch 2.x, instead?

Probably not. It's strategy is similar to 1.x, it's data-backend is not. Nutch 
2.x just introduces another software solution to worry about.

You have some 100m URL's, and how many hosts? If you just have a few hosts and 
they are NOT wikipedia size, something else is wrong, that would smell spider 
traps, a problem not solved with resources. What are the stats on number of 
hosts and domains? And recrawl strategy? Does number of db_unfetched never 
decreases proportionally?

Regards,
Markus

> 
> ----
> Hi Michael,
> 
> operations on a large CrawlDb of 200 million become slow, that's a matter of 
> fact and a well-known
> limitation of Nutch 1.x :(  The CrawlDb is a large Hadoop map file and needs 
> to be rewritten
> for
> every update (even if it's small).
> 
> If your workflow does allow it, you could process multiple segments in one 
> cycle
> - generate N segments in one turn
> - fetch them sequentially
> - (you may start fetching the next one if the previous is in the reduce phase)
> - do update, linkdb, index in one turn for all N segments
> 
> Regarding mappers and reducers: take a multiple of what you can run in 
> parallel on your cluster
> to avoid that the cluster is underutilized when a job is waiting for few 
> tasks to finish.
> The number of mappers is first determined by the number of input partitions 
> (determined by
> number of
> reducers writing the CrawlDb or LinkDb). If partitions are small or 
> splittable there are a
> couple of
> Hadoop configuration properties to tune the data size processed by a single 
> map task.
> 
> I would view from the data size: if a mapper processes only few MBs of the 
> CrawlDb - it's
> too small,
> there is too much overhead, if it's multiple GBs (compressed) reducers will 
> run too long (also
> mappers if not splittable). But details depend on your cluster hardware.
> 
> Best,
> Sebastian
> 
> 
> On 05/16/2017 08:04 PM, Michael Coffey wrote:
> > I am looking for a methodology for making the crawler cycle go faster. I 
> > had expected
> the run-time to be dominated by fetcher performance but, instead, the greater 
> bulk of the
> time is taken by linkdb-merge + indexer + crawldb-update + generate-select.
> > 
> > 
> > Can anyone provide an outline of such a methodology, or a link to one 
> > already published?
> > Also, more tactically speaking, I read in "Hadoop, the Definitive Guide" 
> > that the numbers
> of mappers and reducers are the first things to check. I know how to set the 
> number of reducers,
> but it's not obvious how to control the number of mappers. In my situation 
> (1.9e8+ urls in
> crawldb), I get orders of magnitude more mappers than there are disks (or cpu 
> cores) in my
> cluster. Are there things I should do to bring it down to something less than 
> 10x the number
> of disks or 4x the number of cores, or something like that?
> >    
> > 
> 
> 
>

RE: tuning for speed

Reply via email to