On Sep 3, 2010, at 11:07am, AJ Chen wrote:
I also try larger number of maps, e.g.
mapred.map.tasks=100
mapred.tasktracker.map.tasks.maximum=5
however, the hadoop console shows num map tasks = 40. why is total
map
tasks capped at 40? maybe another config parameter overrides the
mapred.map.tasks?
The number of mappers (child JVMs launched by the TaskTracker on a
slave) is controllable by you, in the hadoop configuration xml files.
The number of map tasks for a given job is essentially out of your
control - it's determined by the system, based on the number of splits
calculated by the input format, for the specified input data. In
Hadoop 0.20 they've removed this configuration, IIRC, since it was
confusing for users to try to set this, and then have the value ignored.
Typically splits are on a per-HDFS block basis, so if you need to get
more mappers running you can configure your HDFS to use a smaller
block size (default is 64MB). But typically the number of map tasks
doesn't have a big impact on overall performance, other than the case
of having an unsplittable input file (e.g. a .gz compressed file)
which then means a single map task has to process the entire file.
-- Ken
On Thu, Sep 2, 2010 at 10:43 AM, AJ Chen <[email protected]>
wrote:
The other option for reducing time in fetching the last 1% urls may
be
using a smaller queue size, I think.
In Fetcher class, the queue size is magically determined as
threadCount *
50.
feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
Is there any good reason for factor 50? If using 100 threads, the
queue
size is 5000, which seems to cause long waiting time toward the end
of
fetch. I want to reduce the queue size to 100 regardless the number
of
threads. Dos this make sense? Will smaller queue size has any other
negative effect?
-aj
On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <[email protected]>
wrote:
Thanks for suggesting multiple segments approach - it's the way to
go for
further increasing crawling throughput. I tried the -
maxNumSegments 3
option in local mode, but it did not generate 3 segments. Does
the option
work? It may be only work in distributed mode.
I also observe that, when fetching a 1M urls segment, 99% is done
in 4
hours, but the last 1% takes forever. For performance reason, it
makes sense
to drop the last 1% urls. One option is to set
fetcher.timelimit.mins to an
appropriate time span. But, estimating the time span may not be
reliable. Is
there another smarter way to empty the queues toward the end of
fetching
(before Fetcher is done)? This could potentially save several
hours per
fetch operation.
thanks,
-aj
On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]>
wrote:
On 2010-08-17 23:16, AJ Chen wrote:
Scott, thanks again for your insights. My 4 cheap linux boxes is
now
crawling selected sites at about 1M pages per day. The fetch
itself is
reasonable fast. But, when crawl db has>10M urls, lots of time
is spend
in
generating segment (2-3 hours) and update crawldb (4-5 hours
after each
segment). I expect these non-fetching time will be increasing
as the
crawl
db grows to 100M urls. Is there any good way to reduce the non-
fetching
time (i.e. generate segment and update crawldb)?
That's surprisingly long for this configuration... What do you
think
takes most time in e.g. updatedb job? map, shuffle, sort or
reduce phase?
One strategy to minimize the turnaround time is to overlap crawl
cycles.
E.g. you can generate multiple fetchlists in one go, then fetch
one. Next,
start fetching the next one, and in parallel you can start
parsing/updatedb
from the first segment. Note that you need to either generate
multiple
segments (there's an option in Generator to do so), or you need
to turn on
generate.update.crawldb, but you don't need both.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA
--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c w e b m i n i n g