makes sense. thank you. -aj On Fri, Sep 3, 2010 at 11:25 AM, Ken Krugler <[email protected]>wrote:
> > On Sep 3, 2010, at 11:07am, AJ Chen wrote: > > I also try larger number of maps, e.g. >> mapred.map.tasks=100 >> mapred.tasktracker.map.tasks.maximum=5 >> however, the hadoop console shows num map tasks = 40. why is total map >> tasks capped at 40? maybe another config parameter overrides the >> mapred.map.tasks? >> > > The number of mappers (child JVMs launched by the TaskTracker on a slave) > is controllable by you, in the hadoop configuration xml files. > > The number of map tasks for a given job is essentially out of your control > - it's determined by the system, based on the number of splits calculated by > the input format, for the specified input data. In Hadoop 0.20 they've > removed this configuration, IIRC, since it was confusing for users to try to > set this, and then have the value ignored. > > Typically splits are on a per-HDFS block basis, so if you need to get more > mappers running you can configure your HDFS to use a smaller block size > (default is 64MB). But typically the number of map tasks doesn't have a big > impact on overall performance, other than the case of having an unsplittable > input file (e.g. a .gz compressed file) which then means a single map task > has to process the entire file. > > -- Ken > > > On Thu, Sep 2, 2010 at 10:43 AM, AJ Chen <[email protected]> wrote: >> >> The other option for reducing time in fetching the last 1% urls may be >>> using a smaller queue size, I think. >>> In Fetcher class, the queue size is magically determined as threadCount * >>> 50. >>> feeder = new QueueFeeder(input, fetchQueues, threadCount * 50); >>> Is there any good reason for factor 50? If using 100 threads, the queue >>> size is 5000, which seems to cause long waiting time toward the end of >>> fetch. I want to reduce the queue size to 100 regardless the number of >>> threads. Dos this make sense? Will smaller queue size has any other >>> negative effect? >>> >>> -aj >>> >>> On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <[email protected]> wrote: >>> >>> Thanks for suggesting multiple segments approach - it's the way to go >>>> for >>>> further increasing crawling throughput. I tried the -maxNumSegments 3 >>>> option in local mode, but it did not generate 3 segments. Does the >>>> option >>>> work? It may be only work in distributed mode. >>>> >>>> I also observe that, when fetching a 1M urls segment, 99% is done in 4 >>>> hours, but the last 1% takes forever. For performance reason, it makes >>>> sense >>>> to drop the last 1% urls. One option is to set fetcher.timelimit.mins >>>> to an >>>> appropriate time span. But, estimating the time span may not be >>>> reliable. Is >>>> there another smarter way to empty the queues toward the end of fetching >>>> (before Fetcher is done)? This could potentially save several hours per >>>> fetch operation. >>>> >>>> thanks, >>>> -aj >>>> >>>> >>>> >>>> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]> >>>> wrote: >>>> >>>> On 2010-08-17 23:16, AJ Chen wrote: >>>>> >>>>> Scott, thanks again for your insights. My 4 cheap linux boxes is now >>>>>> crawling selected sites at about 1M pages per day. The fetch itself is >>>>>> reasonable fast. But, when crawl db has>10M urls, lots of time is >>>>>> spend >>>>>> in >>>>>> generating segment (2-3 hours) and update crawldb (4-5 hours after >>>>>> each >>>>>> segment). I expect these non-fetching time will be increasing as the >>>>>> crawl >>>>>> db grows to 100M urls. Is there any good way to reduce the >>>>>> non-fetching >>>>>> time (i.e. generate segment and update crawldb)? >>>>>> >>>>>> >>>>> That's surprisingly long for this configuration... What do you think >>>>> takes most time in e.g. updatedb job? map, shuffle, sort or reduce >>>>> phase? >>>>> >>>>> One strategy to minimize the turnaround time is to overlap crawl >>>>> cycles. >>>>> E.g. you can generate multiple fetchlists in one go, then fetch one. >>>>> Next, >>>>> start fetching the next one, and in parallel you can start >>>>> parsing/updatedb >>>>> from the first segment. Note that you need to either generate multiple >>>>> segments (there's an option in Generator to do so), or you need to turn >>>>> on >>>>> generate.update.crawldb, but you don't need both. >>>>> >>>>> -- >>>>> Best regards, >>>>> Andrzej Bialecki <>< >>>>> ___. ___ ___ ___ _ _ __________________________________ >>>>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>>>> ___|||__|| \| || | Embedded Unix, System Integration >>>>> http://www.sigram.com Contact: info at sigram dot com >>>>> >>>>> >>>>> >>>> >>>> -- >>>> AJ Chen, PhD >>>> Chair, Semantic Web SIG, sdforum.org >>>> http://web2express.org >>>> twitter @web2express >>>> Palo Alto, CA, USA >>>> >>>> >>> >>> >>> -- >>> AJ Chen, PhD >>> Chair, Semantic Web SIG, sdforum.org >>> http://web2express.org >>> twitter @web2express >>> Palo Alto, CA, USA >>> >>> >> >> >> -- >> AJ Chen, PhD >> Chair, Semantic Web SIG, sdforum.org >> http://web2express.org >> twitter @web2express >> Palo Alto, CA, USA >> > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

