The other option for reducing time in fetching the last 1% urls may be using
a smaller queue size, I think.
In Fetcher class, the queue size is magically determined as threadCount *
50.
feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
Is there any good reason for factor 50? If using 100 threads, the queue
size is 5000, which seems to cause long waiting time toward the end of
fetch. I want to reduce the queue size to 100 regardless the number of
threads. Dos this make sense? Will smaller queue size has any other
negative effect?
-aj
On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <[email protected]> wrote:
> Thanks for suggesting multiple segments approach - it's the way to go for
> further increasing crawling throughput. I tried the -maxNumSegments 3
> option in local mode, but it did not generate 3 segments. Does the option
> work? It may be only work in distributed mode.
>
> I also observe that, when fetching a 1M urls segment, 99% is done in 4
> hours, but the last 1% takes forever. For performance reason, it makes sense
> to drop the last 1% urls. One option is to set fetcher.timelimit.mins to an
> appropriate time span. But, estimating the time span may not be reliable. Is
> there another smarter way to empty the queues toward the end of fetching
> (before Fetcher is done)? This could potentially save several hours per
> fetch operation.
>
> thanks,
> -aj
>
>
>
> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]> wrote:
>
>> On 2010-08-17 23:16, AJ Chen wrote:
>>
>>> Scott, thanks again for your insights. My 4 cheap linux boxes is now
>>> crawling selected sites at about 1M pages per day. The fetch itself is
>>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend
>>> in
>>> generating segment (2-3 hours) and update crawldb (4-5 hours after each
>>> segment). I expect these non-fetching time will be increasing as the
>>> crawl
>>> db grows to 100M urls. Is there any good way to reduce the non-fetching
>>> time (i.e. generate segment and update crawldb)?
>>>
>>
>> That's surprisingly long for this configuration... What do you think takes
>> most time in e.g. updatedb job? map, shuffle, sort or reduce phase?
>>
>> One strategy to minimize the turnaround time is to overlap crawl cycles.
>> E.g. you can generate multiple fetchlists in one go, then fetch one. Next,
>> start fetching the next one, and in parallel you can start parsing/updatedb
>> from the first segment. Note that you need to either generate multiple
>> segments (there's an option in Generator to do so), or you need to turn on
>> generate.update.crawldb, but you don't need both.
>>
>> --
>> Best regards,
>> Andrzej Bialecki <><
>> ___. ___ ___ ___ _ _ __________________________________
>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web
>> ___|||__|| \| || | Embedded Unix, System Integration
>> http://www.sigram.com Contact: info at sigram dot com
>>
>>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>
--
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA