I also try larger number of maps, e.g. mapred.map.tasks=100 mapred.tasktracker.map.tasks.maximum=5 however, the hadoop console shows num map tasks = 40. why is total map tasks capped at 40? maybe another config parameter overrides the mapred.map.tasks? -aj
On Thu, Sep 2, 2010 at 10:43 AM, AJ Chen <[email protected]> wrote: > The other option for reducing time in fetching the last 1% urls may be > using a smaller queue size, I think. > In Fetcher class, the queue size is magically determined as threadCount * > 50. > feeder = new QueueFeeder(input, fetchQueues, threadCount * 50); > Is there any good reason for factor 50? If using 100 threads, the queue > size is 5000, which seems to cause long waiting time toward the end of > fetch. I want to reduce the queue size to 100 regardless the number of > threads. Dos this make sense? Will smaller queue size has any other > negative effect? > > -aj > > On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <[email protected]> wrote: > >> Thanks for suggesting multiple segments approach - it's the way to go for >> further increasing crawling throughput. I tried the -maxNumSegments 3 >> option in local mode, but it did not generate 3 segments. Does the option >> work? It may be only work in distributed mode. >> >> I also observe that, when fetching a 1M urls segment, 99% is done in 4 >> hours, but the last 1% takes forever. For performance reason, it makes sense >> to drop the last 1% urls. One option is to set fetcher.timelimit.mins to an >> appropriate time span. But, estimating the time span may not be reliable. Is >> there another smarter way to empty the queues toward the end of fetching >> (before Fetcher is done)? This could potentially save several hours per >> fetch operation. >> >> thanks, >> -aj >> >> >> >> On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]> wrote: >> >>> On 2010-08-17 23:16, AJ Chen wrote: >>> >>>> Scott, thanks again for your insights. My 4 cheap linux boxes is now >>>> crawling selected sites at about 1M pages per day. The fetch itself is >>>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend >>>> in >>>> generating segment (2-3 hours) and update crawldb (4-5 hours after each >>>> segment). I expect these non-fetching time will be increasing as the >>>> crawl >>>> db grows to 100M urls. Is there any good way to reduce the non-fetching >>>> time (i.e. generate segment and update crawldb)? >>>> >>> >>> That's surprisingly long for this configuration... What do you think >>> takes most time in e.g. updatedb job? map, shuffle, sort or reduce phase? >>> >>> One strategy to minimize the turnaround time is to overlap crawl cycles. >>> E.g. you can generate multiple fetchlists in one go, then fetch one. Next, >>> start fetching the next one, and in parallel you can start parsing/updatedb >>> from the first segment. Note that you need to either generate multiple >>> segments (there's an option in Generator to do so), or you need to turn on >>> generate.update.crawldb, but you don't need both. >>> >>> -- >>> Best regards, >>> Andrzej Bialecki <>< >>> ___. ___ ___ ___ _ _ __________________________________ >>> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >>> ___|||__|| \| || | Embedded Unix, System Integration >>> http://www.sigram.com Contact: info at sigram dot com >>> >>> >> >> >> -- >> AJ Chen, PhD >> Chair, Semantic Web SIG, sdforum.org >> http://web2express.org >> twitter @web2express >> Palo Alto, CA, USA >> > > > > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

