Thanks for suggesting multiple segments approach - it's the way to go for
further increasing crawling throughput.  I tried the -maxNumSegments 3
option in local mode, but it did not generate 3 segments.  Does the option
work? It may be only work in distributed mode.

I also observe that, when fetching a 1M urls segment, 99% is done in 4
hours, but the last 1% takes forever. For performance reason, it makes sense
to drop the last 1% urls.  One option is to set fetcher.timelimit.mins to an
appropriate time span. But, estimating the time span may not be reliable. Is
there another smarter way to empty the queues toward the end of fetching
(before Fetcher is done)?  This could potentially save several hours per
fetch operation.

thanks,
-aj


On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]> wrote:

> On 2010-08-17 23:16, AJ Chen wrote:
>
>> Scott, thanks again for your insights. My 4 cheap linux boxes is now
>> crawling selected sites at about 1M pages per day. The fetch itself is
>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend in
>> generating segment (2-3 hours) and update crawldb (4-5 hours after each
>> segment).  I expect these non-fetching time will be increasing as the
>> crawl
>> db grows to 100M urls.  Is there any good way to reduce the non-fetching
>> time (i.e. generate segment and update crawldb)?
>>
>
> That's surprisingly long for this configuration... What do you think takes
> most time in e.g. updatedb job? map, shuffle, sort or reduce phase?
>
> One strategy to minimize the turnaround time is to overlap crawl cycles.
> E.g. you can generate multiple fetchlists in one go, then fetch one. Next,
> start fetching the next one, and in parallel you can start parsing/updatedb
> from the first segment. Note that you need to either generate multiple
> segments (there's an option in Generator to do so), or you need to turn on
> generate.update.crawldb, but you don't need both.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to