I also try larger number of maps, e.g.
mapred.map.tasks=100
mapred.tasktracker.map.tasks.maximum=5
however, the hadoop console shows num map tasks = 40.  why is total map
tasks capped at 40?  maybe another config parameter overrides the
mapred.map.tasks?
-aj

On Thu, Sep 2, 2010 at 10:43 AM, AJ Chen <[email protected]> wrote:

> The other option for reducing time in fetching the last 1% urls may be
> using a smaller queue size, I think.
> In Fetcher class, the queue size is magically determined as threadCount *
> 50.
>     feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
> Is there any good reason for factor 50?  If using 100 threads, the queue
> size is 5000, which seems to cause long waiting time toward the end of
> fetch. I want to reduce the queue size to 100 regardless the number of
> threads.  Dos this make sense? Will smaller queue size has any other
> negative effect?
>
> -aj
>
> On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <[email protected]> wrote:
>
>> Thanks for suggesting multiple segments approach - it's the way to go for
>> further increasing crawling throughput.  I tried the -maxNumSegments 3
>> option in local mode, but it did not generate 3 segments.  Does the option
>> work? It may be only work in distributed mode.
>>
>> I also observe that, when fetching a 1M urls segment, 99% is done in 4
>> hours, but the last 1% takes forever. For performance reason, it makes sense
>> to drop the last 1% urls.  One option is to set fetcher.timelimit.mins to an
>> appropriate time span. But, estimating the time span may not be reliable. Is
>> there another smarter way to empty the queues toward the end of fetching
>> (before Fetcher is done)?  This could potentially save several hours per
>> fetch operation.
>>
>> thanks,
>> -aj
>>
>>
>>
>> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]> wrote:
>>
>>> On 2010-08-17 23:16, AJ Chen wrote:
>>>
>>>> Scott, thanks again for your insights. My 4 cheap linux boxes is now
>>>> crawling selected sites at about 1M pages per day. The fetch itself is
>>>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend
>>>> in
>>>> generating segment (2-3 hours) and update crawldb (4-5 hours after each
>>>> segment).  I expect these non-fetching time will be increasing as the
>>>> crawl
>>>> db grows to 100M urls.  Is there any good way to reduce the non-fetching
>>>> time (i.e. generate segment and update crawldb)?
>>>>
>>>
>>> That's surprisingly long for this configuration... What do you think
>>> takes most time in e.g. updatedb job? map, shuffle, sort or reduce phase?
>>>
>>> One strategy to minimize the turnaround time is to overlap crawl cycles.
>>> E.g. you can generate multiple fetchlists in one go, then fetch one. Next,
>>> start fetching the next one, and in parallel you can start parsing/updatedb
>>> from the first segment. Note that you need to either generate multiple
>>> segments (there's an option in Generator to do so), or you need to turn on
>>> generate.update.crawldb, but you don't need both.
>>>
>>> --
>>> Best regards,
>>> Andrzej Bialecki     <><
>>>  ___. ___ ___ ___ _ _   __________________________________
>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>> http://www.sigram.com  Contact: info at sigram dot com
>>>
>>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
>
>
> --
> AJ Chen, PhD
> Chair, Semantic Web SIG, sdforum.org
> http://web2express.org
> twitter @web2express
> Palo Alto, CA, USA
>



-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to