On Sep 3, 2010, at 11:07am, AJ Chen wrote:

I also try larger number of maps, e.g.
mapred.map.tasks=100
mapred.tasktracker.map.tasks.maximum=5
however, the hadoop console shows num map tasks = 40. why is total map
tasks capped at 40?  maybe another config parameter overrides the
mapred.map.tasks?

The number of mappers (child JVMs launched by the TaskTracker on a slave) is controllable by you, in the hadoop configuration xml files.

The number of map tasks for a given job is essentially out of your control - it's determined by the system, based on the number of splits calculated by the input format, for the specified input data. In Hadoop 0.20 they've removed this configuration, IIRC, since it was confusing for users to try to set this, and then have the value ignored.

Typically splits are on a per-HDFS block basis, so if you need to get more mappers running you can configure your HDFS to use a smaller block size (default is 64MB). But typically the number of map tasks doesn't have a big impact on overall performance, other than the case of having an unsplittable input file (e.g. a .gz compressed file) which then means a single map task has to process the entire file.

-- Ken

On Thu, Sep 2, 2010 at 10:43 AM, AJ Chen <[email protected]> wrote:

The other option for reducing time in fetching the last 1% urls may be
using a smaller queue size, I think.
In Fetcher class, the queue size is magically determined as threadCount *
50.
   feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
Is there any good reason for factor 50? If using 100 threads, the queue size is 5000, which seems to cause long waiting time toward the end of fetch. I want to reduce the queue size to 100 regardless the number of
threads.  Dos this make sense? Will smaller queue size has any other
negative effect?

-aj

On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <[email protected]> wrote:

Thanks for suggesting multiple segments approach - it's the way to go for further increasing crawling throughput. I tried the - maxNumSegments 3 option in local mode, but it did not generate 3 segments. Does the option
work? It may be only work in distributed mode.

I also observe that, when fetching a 1M urls segment, 99% is done in 4 hours, but the last 1% takes forever. For performance reason, it makes sense to drop the last 1% urls. One option is to set fetcher.timelimit.mins to an appropriate time span. But, estimating the time span may not be reliable. Is there another smarter way to empty the queues toward the end of fetching (before Fetcher is done)? This could potentially save several hours per
fetch operation.

thanks,
-aj



On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]> wrote:

On 2010-08-17 23:16, AJ Chen wrote:

Scott, thanks again for your insights. My 4 cheap linux boxes is now crawling selected sites at about 1M pages per day. The fetch itself is reasonable fast. But, when crawl db has>10M urls, lots of time is spend
in
generating segment (2-3 hours) and update crawldb (4-5 hours after each segment). I expect these non-fetching time will be increasing as the
crawl
db grows to 100M urls. Is there any good way to reduce the non- fetching
time (i.e. generate segment and update crawldb)?


That's surprisingly long for this configuration... What do you think takes most time in e.g. updatedb job? map, shuffle, sort or reduce phase?

One strategy to minimize the turnaround time is to overlap crawl cycles. E.g. you can generate multiple fetchlists in one go, then fetch one. Next, start fetching the next one, and in parallel you can start parsing/updatedb from the first segment. Note that you need to either generate multiple segments (there's an option in Generator to do so), or you need to turn on
generate.update.crawldb, but you don't need both.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com




--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA




--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA




--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g





Reply via email to