Re: performance for small cluster

Andrzej Bialecki Tue, 17 Aug 2010 14:32:40 -0700

On 2010-08-17 23:16, AJ Chen wrote:

Scott, thanks again for your insights. My 4 cheap linux boxes is now
crawling selected sites at about 1M pages per day. The fetch itself is
reasonable fast. But, when crawl db has>10M urls, lots of time is spend in
generating segment (2-3 hours) and update crawldb (4-5 hours after each
segment).  I expect these non-fetching time will be increasing as the crawl
db grows to 100M urls.  Is there any good way to reduce the non-fetching
time (i.e. generate segment and update crawldb)?

That's surprisingly long for this configuration... What do you thinktakes most time in e.g. updatedb job? map, shuffle, sort or reduce phase?

One strategy to minimize the turnaround time is to overlap crawl cycles.E.g. you can generate multiple fetchlists in one go, then fetch one.Next, start fetching the next one, and in parallel you can startparsing/updatedb from the first segment. Note that you need to eithergenerate multiple segments (there's an option in Generator to do so), oryou need to turn on generate.update.crawldb, but you don't need both.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: performance for small cluster

Reply via email to