This does not indicate any number of hosts. Please use the hostdb tool to generate host statistics. Markus.
-----Original message----- > From:Michael Coffey <[email protected]> > Sent: Saturday 5th November 2016 15:59 > To: [email protected] > Subject: Re: crawling speed when polite > > Yes, after a couple of rounds there are many, many hosts in the crawldb. Here > are statistics after a bunch of rounds. It seems like we should be able to > have a bunch of threads going. > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: Statistics for CrawlDb: > /orgs/data/crawldb > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: TOTAL urls: 4635265 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: retry 0: 4634831 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: retry 1: 434 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: min score: 0.0 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: avg score: 1.7258992E-6 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: max score: 1.0 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 1 (db_unfetched): > 4530150 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 2 (db_fetched): 70219 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 3 (db_gone): 21209 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 4 (db_redir_temp): 3747 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 5 (db_redir_perm): 9222 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 7 (db_duplicate): 718 > 16/11/05 06:38:45 INFO crawl.CrawlDbReader: CrawlDb statistics: done > > > From: Markus Jelsma <[email protected]> > To: "[email protected]" <[email protected]> > Sent: Saturday, November 5, 2016 4:23 AM > Subject: RE: crawling speed when polite > > Hi- If you crawl polite (>=1 second per URL per host/domain) then that is > obviously your maximum speed, no setting will ever change that. If you want > to do 2k URL's per second, you just have to crawl 2k different hosts or > domains, depending on your generate.queue. > > -----Original message----- > > From:Michael Coffey <[email protected]> > > Sent: Friday 4th November 2016 22:31 > > To: [email protected] > > Subject: crawling speed when polite > > > > Can anyone point me to some good information on how to optimize crawling > > speed while maintaining politeness? > > My current situation is that Nutch is running reliably for me on a single > > hadoop node. Before bringing up additional nodes, I want to make it go > > reasonably fast on this one node. At the moment it is only trying to fetch > > less than 1 url per second. It seems like it should be able to do much more > > than this, but it is utilizing very little internet bandwidth and CPU time. > > > > I originally seeded it with 6 urls, each on a different domain. I generate > > topN 1000 in each round. I have set generate.max.count to 100 and > > fetcher.server.delay to 1.0. I do not explicitly set any number of threads. > > After 10 rounds, I get the following statistics. This took about 12 hours > > of elapsed time. > > 16/11/04 08:17:47 INFO crawl.CrawlDbReader: Statistics for CrawlDb: > > /orgs/data/crawldb16/11/04 08:17:47 INFO crawl.CrawlDbReader: TOTAL urls: > > 5697616/11/04 08:17:47 INFO crawl.CrawlDbReader: retry 0: 5694916/11/04 > > 08:17:47 INFO crawl.CrawlDbReader: retry 1: 2716/11/04 08:17:47 INFO > > crawl.CrawlDbReader: min score: 0.016/11/04 08:17:47 INFO > > crawl.CrawlDbReader: avg score: 1.2285875E-416/11/04 08:17:47 INFO > > crawl.CrawlDbReader: max score: 1.016/11/04 08:17:47 INFO > > crawl.CrawlDbReader: status 1 (db_unfetched): 4748616/11/04 08:17:47 > > INFO crawl.CrawlDbReader: status 2 (db_fetched): 669716/11/04 08:17:47 > > INFO crawl.CrawlDbReader: status 3 (db_gone): 242416/11/04 08:17:47 INFO > > crawl.CrawlDbReader: status 4 (db_redir_temp): 3816/11/04 08:17:47 INFO > > crawl.CrawlDbReader: status 5 (db_redir_perm): 20216/11/04 08:17:47 INFO > > crawl.CrawlDbReader: status 7 (db_duplicate): 12916/11/0 4 > 08:17:47 INFO crawl.CrawlDbReader: CrawlDb statistics: doneFri Nov 4 > 08:17:48 PDT 2016 : Finished loop with 10 iterations > > I use the standard crawl script, with only sizeFetchlist changed. It issues > > the following generate command > > /home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch generate -D > > mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D > > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D > > mapreduce.map.output.compress=true /orgs/data/crawldb /orgs/data/segments > > -topN 1000 -numFetchers 1 -noFilter -adddays 30 > > > > It issues the following fetch command > > /home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch fetch -D > > mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D > > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D > > mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 > > /orgs/data/segments/20161104110458 -noParsing -threads 50 > > > > > > Any suggestions would be greatly appreciated. By the way, thanks for all > > the help so far! > > > > > >

