in distributed mode, "generate -topN 1000000 -maxNumSegments 3" creates 3 segments, but the size is very uneven: 1.7M, 0.8M, 0.5M.
I also tried fetcher.timelimit.mins=240 in distributed mode. but the fetcher did not stop after 4 hours. any idea? -aj On Tue, Aug 31, 2010 at 4:24 PM, AJ Chen <[email protected]> wrote: > Thanks for suggesting multiple segments approach - it's the way to go for > further increasing crawling throughput. I tried the -maxNumSegments 3 > option in local mode, but it did not generate 3 segments. Does the option > work? It may be only work in distributed mode. > > I also observe that, when fetching a 1M urls segment, 99% is done in 4 > hours, but the last 1% takes forever. For performance reason, it makes sense > to drop the last 1% urls. One option is to set fetcher.timelimit.mins to an > appropriate time span. But, estimating the time span may not be reliable. Is > there another smarter way to empty the queues toward the end of fetching > (before Fetcher is done)? This could potentially save several hours per > fetch operation. > > thanks, > -aj > > > > On Tue, Aug 17, 2010 at 2:31 PM, Andrzej Bialecki <[email protected]> wrote: > >> On 2010-08-17 23:16, AJ Chen wrote: >> >>> Scott, thanks again for your insights. My 4 cheap linux boxes is now >>> crawling selected sites at about 1M pages per day. The fetch itself is >>> reasonable fast. But, when crawl db has>10M urls, lots of time is spend >>> in >>> generating segment (2-3 hours) and update crawldb (4-5 hours after each >>> segment). I expect these non-fetching time will be increasing as the >>> crawl >>> db grows to 100M urls. Is there any good way to reduce the non-fetching >>> time (i.e. generate segment and update crawldb)? >>> >> >> That's surprisingly long for this configuration... What do you think takes >> most time in e.g. updatedb job? map, shuffle, sort or reduce phase? >> >> One strategy to minimize the turnaround time is to overlap crawl cycles. >> E.g. you can generate multiple fetchlists in one go, then fetch one. Next, >> start fetching the next one, and in parallel you can start parsing/updatedb >> from the first segment. Note that you need to either generate multiple >> segments (there's an option in Generator to do so), or you need to turn on >> generate.update.crawldb, but you don't need both. >> >> -- >> Best regards, >> Andrzej Bialecki <>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration >> http://www.sigram.com Contact: info at sigram dot com >> >> > > > -- > AJ Chen, PhD > Chair, Semantic Web SIG, sdforum.org > http://web2express.org > twitter @web2express > Palo Alto, CA, USA > -- AJ Chen, PhD Chair, Semantic Web SIG, sdforum.org http://web2express.org twitter @web2express Palo Alto, CA, USA

