Hi Andrzej, During updatedb, reduce tasks (as seen in log) take most of the time. There are lots of messages (below) indicating some problems, but I'm not sure. How to prevent these slowdown?
2010-08-17 09:31:54,564 INFO mapred.ReduceTask - attempt_201008141418_0023_r_000004_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) 2010-08-17 09:31:54,653 INFO mapred.ReduceTask - header: attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed len: 2 2010-08-17 09:31:54,653 INFO mapred.ReduceTask - Shuffling 2 bytes (6 raw bytes) into RAM from attempt_201008141418_0023_m_000023_0 2010-08-17 09:31:54,664 INFO mapred.ReduceTask - Read 2 bytes from map-output for attempt_201008141418_0023_m_000023_0 2010-08-17 09:31:54,664 INFO mapred.ReduceTask - Rec #1 from attempt_201008141418_0023_m_000023_0 -> (-1, -1) from vmo-crawl05-dev.healthline.com 2010-08-17 09:31:55,264 INFO mapred.ReduceTask - attempt_201008141418_0023_r_000000_0 Scheduled 1 outputs (0 slow hosts and0 dup hosts) 2010-08-17 09:31:55,308 INFO mapred.ReduceTask - header: attempt_201008141418_0023_m_000023_0, compressed len: 6, decompressed len: 2 2010-08-17 09:31:55,308 INFO mapred.ReduceTask - Shuffling 2 bytes (6 raw bytes) into RAM from attempt_201008141418_0023_m_000023_0 2010-08-17 09:31:55,394 INFO mapred.ReduceTask - Read 2 bytes from map-output for attempt_201008141418_0023_m_000023_0 2010-08-17 09:31:55,395 INFO mapred.ReduceTask - Rec #1 from attempt_201008141418_0023_m_000023_0 -> (-1, -1) from vmo-crawl05-dev.healthline.com 2010-08-17 00:43:05,175 INFO mapred.ReduceTask - attempt_201008141418_0022_r_000000_0 Need another 36 map output(s) where 0 is already in progress 2010-08-17 00:43:05,176 INFO mapred.ReduceTask - attempt_201008141418_0022_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts) 2010-08-17 00:43:07,679 INFO mapred.ReduceTask - attempt_201008141418_0022_r_000004_0 Need another 36 map output(s) where 0 is already in progress 2010-08-17 00:43:07,679 INFO mapred.ReduceTask - attempt_201008141418_0022_r_000004_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts) 2010-08-17 00:44:05,224 INFO mapred.ReduceTask - attempt_201008141418_0022_r_000000_0 Need another 36 map output(s) where 0 is already in progress 2010-08-17 00:44:05,224 INFO mapred.ReduceTask - attempt_201008141418_0022_r_000000_0 Scheduled 0 outputs (0 slow hosts and0 dup hosts) thanks, -aj On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]> wrote: > On 2010-08-17 23:16, AJ Chen wrote: > >> Scott, thanks again for your insights. My 4 cheap linux boxes is now >> crawling selected sites at about 1M pages per day. The fetch itself is >> reasonable fast. But, when crawl db has>10M urls, lots of time is spend in >> generating segment (2-3 hours) and update crawldb (4-5 hours after each >> segment). I expect these non-fetching time will be increasing as the >> crawl >> db grows to 100M urls. Is there any good way to reduce the non-fetching >> time (i.e. generate segment and update crawldb)? >> > > That's surprisingly long for this configuration... What do you think takes > most time in e.g. updatedb job? map, shuffle, sort or reduce phase? > > One strategy to minimize the turnaround time is to overlap crawl cycles. > E.g. you can generate multiple fetchlists in one go, then fetch one. Next, > start fetching the next one, and in parallel you can start parsing/updatedb > from the first segment. Note that you need to either generate multiple > segments (there's an option in Generator to do so), or you need to turn on > generate.update.crawldb, but you don't need both. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

