Hi Andrzej,
During updatedb, reduce tasks (as seen in log) take most of the time.  There
are lots of messages (below) indicating some problems, but I'm not sure.
How to prevent these slowdown?

2010-08-17 09:31:54,564 INFO  mapred.ReduceTask -
attempt_201008141418_0023_r_000004_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2010-08-17 09:31:54,653 INFO  mapred.ReduceTask - header:
attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed len: 2
2010-08-17 09:31:54,653 INFO  mapred.ReduceTask - Shuffling 2 bytes (6 raw
bytes) into RAM from attempt_201008141418_0023_m_000023_0
2010-08-17 09:31:54,664 INFO  mapred.ReduceTask - Read 2 bytes from
map-output for attempt_201008141418_0023_m_000023_0
2010-08-17 09:31:54,664 INFO  mapred.ReduceTask - Rec #1 from
attempt_201008141418_0023_m_000023_0 -> (-1, -1) from
vmo-crawl05-dev.healthline.com
2010-08-17 09:31:55,264 INFO  mapred.ReduceTask -
attempt_201008141418_0023_r_000000_0 Scheduled 1 outputs (0 slow hosts and0
dup hosts)
2010-08-17 09:31:55,308 INFO  mapred.ReduceTask - header:
attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed len: 2
2010-08-17 09:31:55,308 INFO  mapred.ReduceTask - Shuffling 2 bytes (6 raw
bytes) into RAM from attempt_201008141418_0023_m_000023_0
2010-08-17 09:31:55,394 INFO  mapred.ReduceTask - Read 2 bytes from
map-output for attempt_201008141418_0023_m_000023_0
2010-08-17 09:31:55,395 INFO  mapred.ReduceTask - Rec #1 from
attempt_201008141418_0023_m_000023_0 -> (-1, -1) from
vmo-crawl05-dev.healthline.com
2010-08-17 00:43:05,175 INFO  mapred.ReduceTask -
attempt_201008141418_0022_r_000000_0 Need another 36 map output(s) where 0
is already in progress
2010-08-17 00:43:05,176 INFO  mapred.ReduceTask -
attempt_201008141418_0022_r_000000_0 Scheduled 0 outputs (0 slow hosts and0
dup hosts)
2010-08-17 00:43:07,679 INFO  mapred.ReduceTask -
attempt_201008141418_0022_r_000004_0 Need another 36 map output(s) where 0
is already in progress
2010-08-17 00:43:07,679 INFO  mapred.ReduceTask -
attempt_201008141418_0022_r_000004_0 Scheduled 0 outputs (0 slow hosts and0
dup hosts)
2010-08-17 00:44:05,224 INFO  mapred.ReduceTask -
attempt_201008141418_0022_r_000000_0 Need another 36 map output(s) where 0
is already in progress
2010-08-17 00:44:05,224 INFO  mapred.ReduceTask -
attempt_201008141418_0022_r_000000_0 Scheduled 0 outputs (0 slow hosts and0
dup hosts)


thanks,
-aj


On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]> wrote:

> On 2010-08-17 23:16, AJ Chen wrote:
>
>> Scott, thanks again for your insights. My 4 cheap linux boxes is now
>> crawling selected sites at about 1M pages per day. The fetch itself is
>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend in
>> generating segment (2-3 hours) and update crawldb (4-5 hours after each
>> segment).  I expect these non-fetching time will be increasing as the
>> crawl
>> db grows to 100M urls.  Is there any good way to reduce the non-fetching
>> time (i.e. generate segment and update crawldb)?
>>
>
> That's surprisingly long for this configuration... What do you think takes
> most time in e.g. updatedb job? map, shuffle, sort or reduce phase?
>
> One strategy to minimize the turnaround time is to overlap crawl cycles.
> E.g. you can generate multiple fetchlists in one go, then fetch one. Next,
> start fetching the next one, and in parallel you can start parsing/updatedb
> from the first segment. Note that you need to either generate multiple
> segments (there's an option in Generator to do so), or you need to turn on
> generate.update.crawldb, but you don't need both.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to