On 2010-08-17 23:16, AJ Chen wrote:
Scott, thanks again for your insights. My 4 cheap linux boxes is now crawling selected sites at about 1M pages per day. The fetch itself is reasonable fast. But, when crawl db has>10M urls, lots of time is spend in generating segment (2-3 hours) and update crawldb (4-5 hours after each segment). I expect these non-fetching time will be increasing as the crawl db grows to 100M urls. Is there any good way to reduce the non-fetching time (i.e. generate segment and update crawldb)?
That's surprisingly long for this configuration... What do you think takes most time in e.g. updatedb job? map, shuffle, sort or reduce phase?
One strategy to minimize the turnaround time is to overlap crawl cycles. E.g. you can generate multiple fetchlists in one go, then fetch one. Next, start fetching the next one, and in parallel you can start parsing/updatedb from the first segment. Note that you need to either generate multiple segments (there's an option in Generator to do so), or you need to turn on generate.update.crawldb, but you don't need both.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

