Hi- If you crawl polite (>=1 second per URL per host/domain) then that is 
obviously your maximum speed, no setting will ever change that. If you want to 
do 2k URL's per second, you just have to crawl 2k different hosts or domains, 
depending on your generate.queue.

-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Friday 4th November 2016 22:31
> To: [email protected]
> Subject: crawling speed when polite
> 
> Can anyone point me to some good information on how to optimize crawling 
> speed while maintaining politeness?
> My current situation is that Nutch is running reliably for me on a single 
> hadoop node. Before bringing up additional nodes, I want to make it go 
> reasonably fast on this one node. At the moment it is only trying to fetch 
> less than 1 url per second. It seems like it should be able to do much more 
> than this, but it is utilizing very little internet bandwidth and CPU time.
> 
> I originally seeded it with 6 urls, each on a different domain. I generate 
> topN 1000 in each round. I have set generate.max.count to 100 and 
> fetcher.server.delay to 1.0. I do not explicitly set any number of threads.
> After 10 rounds, I get the following statistics. This took about 12 hours of 
> elapsed time.
> 16/11/04 08:17:47 INFO crawl.CrawlDbReader: Statistics for CrawlDb: 
> /orgs/data/crawldb16/11/04 08:17:47 INFO crawl.CrawlDbReader: TOTAL urls: 
> 5697616/11/04 08:17:47 INFO crawl.CrawlDbReader: retry 0:    5694916/11/04 
> 08:17:47 INFO crawl.CrawlDbReader: retry 1:    2716/11/04 08:17:47 INFO 
> crawl.CrawlDbReader: min score:  0.016/11/04 08:17:47 INFO 
> crawl.CrawlDbReader: avg score:  1.2285875E-416/11/04 08:17:47 INFO 
> crawl.CrawlDbReader: max score:  1.016/11/04 08:17:47 INFO 
> crawl.CrawlDbReader: status 1 (db_unfetched):    4748616/11/04 08:17:47 INFO 
> crawl.CrawlDbReader: status 2 (db_fetched):      669716/11/04 08:17:47 INFO 
> crawl.CrawlDbReader: status 3 (db_gone): 242416/11/04 08:17:47 INFO 
> crawl.CrawlDbReader: status 4 (db_redir_temp):   3816/11/04 08:17:47 INFO 
> crawl.CrawlDbReader: status 5 (db_redir_perm):   20216/11/04 08:17:47 INFO 
> crawl.CrawlDbReader: status 7 (db_duplicate):    12916/11/04
 08:17:47 INFO crawl.CrawlDbReader: CrawlDb statistics: doneFri Nov 4 08:17:48 
PDT 2016 : Finished loop with 10 iterations
> I use the standard crawl script, with only sizeFetchlist changed. It issues 
> the following generate command
> /home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch generate -D 
> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
> mapreduce.map.output.compress=true /orgs/data/crawldb /orgs/data/segments 
> -topN 1000 -numFetchers 1 -noFilter -adddays 30
> 
> It issues the following fetch command
> /home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch fetch -D 
> mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
> mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
> mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 
> /orgs/data/segments/20161104110458 -noParsing -threads 50
> 
> 
> Any suggestions would be greatly appreciated. By the way, thanks for all the 
> help so far!
> 
> 

Reply via email to