The other option for reducing time in fetching the last 1% urls may be using
a smaller queue size, I think.
In Fetcher class, the queue size is magically determined as threadCount *
50.
    feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
Is there any good reason for factor 50?  If using 100 threads, the queue
size is 5000, which seems to cause long waiting time toward the end of
fetch. I want to reduce the queue size to 100 regardless the number of
threads.  Dos this make sense? Will smaller queue size has any other
negative effect?

-aj

On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <[email protected]> wrote:

> Thanks for suggesting multiple segments approach - it's the way to go for
> further increasing crawling throughput.  I tried the -maxNumSegments 3
> option in local mode, but it did not generate 3 segments.  Does the option
> work? It may be only work in distributed mode.
>
> I also observe that, when fetching a 1M urls segment, 99% is done in 4
> hours, but the last 1% takes forever. For performance reason, it makes sense
> to drop the last 1% urls.  One option is to set fetcher.timelimit.mins to an
> appropriate time span. But, estimating the time span may not be reliable. Is
> there another smarter way to empty the queues toward the end of fetching
> (before Fetcher is done)?  This could potentially save several hours per
> fetch operation.
>
> thanks,
> -aj
>
>
>
> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]> wrote:
>
>> On 2010-08-17 23:16, AJ Chen wrote:
>>
>>> Scott, thanks again for your insights. My 4 cheap linux boxes is now
>>> crawling selected sites at about 1M pages per day. The fetch itself is
>>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend
>>> in
>>> generating segment (2-3 hours) and update crawldb (4-5 hours after each
>>> segment).  I expect these non-fetching time will be increasing as the
>>> crawl
>>> db grows to 100M urls.  Is there any good way to reduce the non-fetching
>>> time (i.e. generate segment and update crawldb)?
>>>
>>
>> That's surprisingly long for this configuration... What do you think takes
>> most time in e.g. updatedb job? map, shuffle, sort or reduce phase?
>>
>> One strategy to minimize the turnaround time is to overlap crawl cycles.
>> E.g. you can generate multiple fetchlists in one go, then fetch one. Next,
>> start fetching the next one, and in parallel you can start parsing/updatedb
>> from the first segment. Note that you need to either generate multiple
>> segments (there's an option in Generator to do so), or you need to turn on
>> generate.update.crawldb, but you don't need both.
>>
>> --
>> Best regards,
>> Andrzej Bialecki     <><
>>  ___. ___ ___ ___ _ _   __________________________________
>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>> http://www.sigram.com  Contact: info at sigram dot com
>>
>>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to