Re: performance for small cluster

Ken Krugler Fri, 03 Sep 2010 11:26:06 -0700


On Sep 3, 2010, at 11:07am, AJ Chen wrote:

I also try larger number of maps, e.g.
mapred.map.tasks=100
mapred.tasktracker.map.tasks.maximum=5

however, the hadoop console shows num map tasks = 40. why is totalmap

tasks capped at 40?  maybe another config parameter overrides the
mapred.map.tasks?

The number of mappers (child JVMs launched by the TaskTracker on aslave) is controllable by you, in the hadoop configuration xml files.

The number of map tasks for a given job is essentially out of yourcontrol - it's determined by the system, based on the number of splitscalculated by the input format, for the specified input data. InHadoop 0.20 they've removed this configuration, IIRC, since it wasconfusing for users to try to set this, and then have the value ignored.

Typically splits are on a per-HDFS block basis, so if you need to getmore mappers running you can configure your HDFS to use a smallerblock size (default is 64MB). But typically the number of map tasksdoesn't have a big impact on overall performance, other than the caseof having an unsplittable input file (e.g. a .gz compressed file)which then means a single map task has to process the entire file.


-- Ken

On Thu, Sep 2, 2010 at 10:43 AM, AJ Chen <[email protected]>wrote:
The other option for reducing time in fetching the last 1% urls maybe
using a smaller queue size, I think.
In Fetcher class, the queue size is magically determined asthreadCount *
50.
   feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
Is there any good reason for factor 50? If using 100 threads, thequeuesize is 5000, which seems to cause long waiting time toward the endoffetch. I want to reduce the queue size to 100 regardless the numberof
threads.  Dos this make sense? Will smaller queue size has any other
negative effect?

-aj
On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <[email protected]>wrote:
Thanks for suggesting multiple segments approach - it's the way togo forfurther increasing crawling throughput. I tried the -maxNumSegments 3option in local mode, but it did not generate 3 segments. Doesthe option
work? It may be only work in distributed mode.
I also observe that, when fetching a 1M urls segment, 99% is donein 4hours, but the last 1% takes forever. For performance reason, itmakes senseto drop the last 1% urls. One option is to setfetcher.timelimit.mins to anappropriate time span. But, estimating the time span may not bereliable. Isthere another smarter way to empty the queues toward the end offetching(before Fetcher is done)? This could potentially save severalhours per
fetch operation.

thanks,
-aj
On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]>wrote:
On 2010-08-17 23:16, AJ Chen wrote:
Scott, thanks again for your insights. My 4 cheap linux boxes isnowcrawling selected sites at about 1M pages per day. The fetchitself isreasonable fast. But, when crawl db has>10M urls, lots of timeis spend
in
generating segment (2-3 hours) and update crawldb (4-5 hoursafter eachsegment). I expect these non-fetching time will be increasingas the
crawl
db grows to 100M urls. Is there any good way to reduce the non-fetching
time (i.e. generate segment and update crawldb)?
That's surprisingly long for this configuration... What do youthinktakes most time in e.g. updatedb job? map, shuffle, sort orreduce phase?
One strategy to minimize the turnaround time is to overlap crawlcycles.E.g. you can generate multiple fetchlists in one go, then fetchone. Next,start fetching the next one, and in parallel you can startparsing/updatedbfrom the first segment. Note that you need to either generatemultiplesegments (there's an option in Generator to do so), or you needto turn on
generate.update.crawldb, but you don't need both.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Re: performance for small cluster

Reply via email to