makes sense. thank you. -aj

On Fri, Sep 3, 2010 at 11:25 AM, Ken Krugler <[email protected]>wrote:

>
> On Sep 3, 2010, at 11:07am, AJ Chen wrote:
>
>  I also try larger number of maps, e.g.
>> mapred.map.tasks=100
>> mapred.tasktracker.map.tasks.maximum=5
>> however, the hadoop console shows num map tasks = 40.  why is total map
>> tasks capped at 40?  maybe another config parameter overrides the
>> mapred.map.tasks?
>>
>
> The number of mappers (child JVMs launched by the TaskTracker on a slave)
> is controllable by you, in the hadoop configuration xml files.
>
> The number of map tasks for a given job is essentially out of your control
> - it's determined by the system, based on the number of splits calculated by
> the input format, for the specified input data. In Hadoop 0.20 they've
> removed this configuration, IIRC, since it was confusing for users to try to
> set this, and then have the value ignored.
>
> Typically splits are on a per-HDFS block basis, so if you need to get more
> mappers running you can configure your HDFS to use a smaller block size
> (default is 64MB). But typically the number of map tasks doesn't have a big
> impact on overall performance, other than the case of having an unsplittable
> input file (e.g. a .gz compressed file) which then means a single map task
> has to process the entire file.
>
> -- Ken
>
>
>  On Thu, Sep 2, 2010 at 10:43 AM, AJ Chen <[email protected]> wrote:
>>
>>  The other option for reducing time in fetching the last 1% urls may be
>>> using a smaller queue size, I think.
>>> In Fetcher class, the queue size is magically determined as threadCount *
>>> 50.
>>>   feeder = new QueueFeeder(input, fetchQueues, threadCount * 50);
>>> Is there any good reason for factor 50?  If using 100 threads, the queue
>>> size is 5000, which seems to cause long waiting time toward the end of
>>> fetch. I want to reduce the queue size to 100 regardless the number of
>>> threads.  Dos this make sense? Will smaller queue size has any other
>>> negative effect?
>>>
>>> -aj
>>>
>>> On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <[email protected]> wrote:
>>>
>>>  Thanks for suggesting multiple segments approach - it's the way to go
>>>> for
>>>> further increasing crawling throughput.  I tried the -maxNumSegments 3
>>>> option in local mode, but it did not generate 3 segments.  Does the
>>>> option
>>>> work? It may be only work in distributed mode.
>>>>
>>>> I also observe that, when fetching a 1M urls segment, 99% is done in 4
>>>> hours, but the last 1% takes forever. For performance reason, it makes
>>>> sense
>>>> to drop the last 1% urls.  One option is to set fetcher.timelimit.mins
>>>> to an
>>>> appropriate time span. But, estimating the time span may not be
>>>> reliable. Is
>>>> there another smarter way to empty the queues toward the end of fetching
>>>> (before Fetcher is done)?  This could potentially save several hours per
>>>> fetch operation.
>>>>
>>>> thanks,
>>>> -aj
>>>>
>>>>
>>>>
>>>> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]>
>>>> wrote:
>>>>
>>>>  On 2010-08-17 23:16, AJ Chen wrote:
>>>>>
>>>>>  Scott, thanks again for your insights. My 4 cheap linux boxes is now
>>>>>> crawling selected sites at about 1M pages per day. The fetch itself is
>>>>>> reasonable fast. But, when crawl db has>10M urls, lots of time is
>>>>>> spend
>>>>>> in
>>>>>> generating segment (2-3 hours) and update crawldb (4-5 hours after
>>>>>> each
>>>>>> segment).  I expect these non-fetching time will be increasing as the
>>>>>> crawl
>>>>>> db grows to 100M urls.  Is there any good way to reduce the
>>>>>> non-fetching
>>>>>> time (i.e. generate segment and update crawldb)?
>>>>>>
>>>>>>
>>>>> That's surprisingly long for this configuration... What do you think
>>>>> takes most time in e.g. updatedb job? map, shuffle, sort or reduce
>>>>> phase?
>>>>>
>>>>> One strategy to minimize the turnaround time is to overlap crawl
>>>>> cycles.
>>>>> E.g. you can generate multiple fetchlists in one go, then fetch one.
>>>>> Next,
>>>>> start fetching the next one, and in parallel you can start
>>>>> parsing/updatedb
>>>>> from the first segment. Note that you need to either generate multiple
>>>>> segments (there's an option in Generator to do so), or you need to turn
>>>>> on
>>>>> generate.update.crawldb, but you don't need both.
>>>>>
>>>>> --
>>>>> Best regards,
>>>>> Andrzej Bialecki     <><
>>>>> ___. ___ ___ ___ _ _   __________________________________
>>>>> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
>>>>> ___|||__||  \|  ||  |  Embedded Unix, System Integration
>>>>> http://www.sigram.com  Contact: info at sigram dot com
>>>>>
>>>>>
>>>>>
>>>>
>>>> --
>>>> AJ Chen, PhD
>>>> Chair, Semantic Web SIG, sdforum.org
>>>> http://web2express.org
>>>> twitter @web2express
>>>> Palo Alto, CA, USA
>>>>
>>>>
>>>
>>>
>>> --
>>> AJ Chen, PhD
>>> Chair, Semantic Web SIG, sdforum.org
>>> http://web2express.org
>>> twitter @web2express
>>> Palo Alto, CA, USA
>>>
>>>
>>
>>
>> --
>> AJ Chen, PhD
>> Chair, Semantic Web SIG, sdforum.org
>> http://web2express.org
>> twitter @web2express
>> Palo Alto, CA, USA
>>
>
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
>
>
>
>
>
>


-- 
AJ Chen, PhD
Chair, Semantic Web SIG, sdforum.org
http://web2express.org
twitter @web2express
Palo Alto, CA, USA

Reply via email to