RE: crawling speed when polite

Markus Jelsma Sun, 06 Nov 2016 02:45:08 -0800

This does not indicate any number of hosts. Please use the hostdb tool to 
generate host statistics.
Markus.


 
 
-----Original message-----
> From:Michael Coffey <[email protected]>
> Sent: Saturday 5th November 2016 15:59
> To: [email protected]
> Subject: Re: crawling speed when polite
> 
> Yes, after a couple of rounds there are many, many hosts in the crawldb. Here 
> are statistics after a bunch of rounds. It seems like we should be able to 
> have a bunch of  threads going.
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: Statistics for CrawlDb: 
> /orgs/data/crawldb
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: TOTAL urls: 4635265
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: retry 0:    4634831
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: retry 1:    434
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: min score:  0.0
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: avg score:  1.7258992E-6
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: max score:  1.0
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 1 (db_unfetched):    
> 4530150
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 2 (db_fetched):      70219
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 3 (db_gone): 21209
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 4 (db_redir_temp):   3747
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 5 (db_redir_perm):   9222
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: status 7 (db_duplicate):    718
> 16/11/05 06:38:45 INFO crawl.CrawlDbReader: CrawlDb statistics: done
> 
> 
>       From: Markus Jelsma <[email protected]>
>  To: "[email protected]" <[email protected]> 
>  Sent: Saturday, November 5, 2016 4:23 AM
>  Subject: RE: crawling speed when polite
>    
> Hi- If you crawl polite (>=1 second per URL per host/domain) then that is 
> obviously your maximum speed, no setting will ever change that. If you want 
> to do 2k URL's per second, you just have to crawl 2k different hosts or 
> domains, depending on your generate.queue.
> 
> -----Original message-----
> > From:Michael Coffey <[email protected]>
> > Sent: Friday 4th November 2016 22:31
> > To: [email protected]
> > Subject: crawling speed when polite
> > 
> > Can anyone point me to some good information on how to optimize crawling 
> > speed while maintaining politeness?
> > My current situation is that Nutch is running reliably for me on a single 
> > hadoop node. Before bringing up additional nodes, I want to make it go 
> > reasonably fast on this one node. At the moment it is only trying to fetch 
> > less than 1 url per second. It seems like it should be able to do much more 
> > than this, but it is utilizing very little internet bandwidth and CPU time.
> > 
> > I originally seeded it with 6 urls, each on a different domain. I generate 
> > topN 1000 in each round. I have set generate.max.count to 100 and 
> > fetcher.server.delay to 1.0. I do not explicitly set any number of threads.
> > After 10 rounds, I get the following statistics. This took about 12 hours 
> > of elapsed time.
> > 16/11/04 08:17:47 INFO crawl.CrawlDbReader: Statistics for CrawlDb: 
> > /orgs/data/crawldb16/11/04 08:17:47 INFO crawl.CrawlDbReader: TOTAL urls: 
> > 5697616/11/04 08:17:47 INFO crawl.CrawlDbReader: retry 0:    5694916/11/04 
> > 08:17:47 INFO crawl.CrawlDbReader: retry 1:    2716/11/04 08:17:47 INFO 
> > crawl.CrawlDbReader: min score:  0.016/11/04 08:17:47 INFO 
> > crawl.CrawlDbReader: avg score:  1.2285875E-416/11/04 08:17:47 INFO 
> > crawl.CrawlDbReader: max score:  1.016/11/04 08:17:47 INFO 
> > crawl.CrawlDbReader: status 1 (db_unfetched):    4748616/11/04 08:17:47 
> > INFO crawl.CrawlDbReader: status 2 (db_fetched):      669716/11/04 08:17:47 
> > INFO crawl.CrawlDbReader: status 3 (db_gone): 242416/11/04 08:17:47 INFO 
> > crawl.CrawlDbReader: status 4 (db_redir_temp):   3816/11/04 08:17:47 INFO 
> > crawl.CrawlDbReader: status 5 (db_redir_perm):   20216/11/04 08:17:47 INFO 
> > crawl.CrawlDbReader: status 7 (db_duplicate):    12916/11/0
 4
>  08:17:47 INFO crawl.CrawlDbReader: CrawlDb statistics: doneFri Nov 4 
> 08:17:48 PDT 2016 : Finished loop with 10 iterations
> > I use the standard crawl script, with only sizeFetchlist changed. It issues 
> > the following generate command
> > /home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch generate -D 
> > mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
> > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
> > mapreduce.map.output.compress=true /orgs/data/crawldb /orgs/data/segments 
> > -topN 1000 -numFetchers 1 -noFilter -adddays 30
> > 
> > It issues the following fetch command
> > /home/mjc/apache-nutch-1.12/runtime/deploy/bin/nutch fetch -D 
> > mapreduce.job.reduces=2 -D mapred.child.java.opts=-Xmx1000m -D 
> > mapreduce.reduce.speculative=false -D mapreduce.map.speculative=false -D 
> > mapreduce.map.output.compress=true -D fetcher.timelimit.mins=180 
> > /orgs/data/segments/20161104110458 -noParsing -threads 50
> > 
> > 
> > Any suggestions would be greatly appreciated. By the way, thanks for all 
> > the help so far!
> > 
> > 
> 
>

RE: crawling speed when polite

Reply via email to