On 2010-08-17 23:16, AJ Chen wrote:
Scott, thanks again for your insights. My 4 cheap linux boxes is now
crawling selected sites at about 1M pages per day. The fetch itself is
reasonable fast. But, when crawl db has>10M urls, lots of time is spend in
generating segment (2-3 hours) and update crawldb (4-5 hours after each
segment).  I expect these non-fetching time will be increasing as the crawl
db grows to 100M urls.  Is there any good way to reduce the non-fetching
time (i.e. generate segment and update crawldb)?

That's surprisingly long for this configuration... What do you think takes most time in e.g. updatedb job? map, shuffle, sort or reduce phase?

One strategy to minimize the turnaround time is to overlap crawl cycles. E.g. you can generate multiple fetchlists in one go, then fetch one. Next, start fetching the next one, and in parallel you can start parsing/updatedb from the first segment. Note that you need to either generate multiple segments (there's an option in Generator to do so), or you need to turn on generate.update.crawldb, but you don't need both.

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to