Hi, On Nov 3, 2010, at 9:12 AM, Eric Martin wrote: > Thank you very much. I can understand what you wrote! My crawl has been > running for days. Here are some new settings I would like to put into > nutch-site.xml. > > 1.) Does anyone see them as rude? > 2.) How can I stop the crawler without losing the crawled data (I'm running > 1.2 and should be able to use kill-nutch but can't find out where to do that) > 3.) how do I restart the currently stopped crawler with the new > nutch-site.xml settings? > > <name>fetcher.server.min.delay</name> > <value>0.5</value> > > <name>fetcher.server.delay</name> > <value>1.0</value> > > <name>fetcher.threads.per.host</name> > <value>50</value>
If fetcher.threads.per.host > 1, then a fetcher will make more than 1 active request to servers. I don't know about the sites below but that is generally frowned upon. > > <name>fetcher.max.crawl.delay</name> > <value>1</value> > > <name>http.threads.per.host</name> > <value>50</value> > > <name>http.max.delays</name> > <value>1</value> > > Target Sites: > > http://www.ecasebriefs.com/blog/law/ > http://www.lawnix.com/cases/cases-index/ > http://www.oyez.org/ > http://www.4lawnotes.com/ > http://www.docstoc.com/documents/education/law-school/case-briefs > http://www.lawschoolcasebriefs.com/ > http://dictionary.findlaw.com > > -----Original Message----- > From: Andrzej Bialecki [mailto:[email protected]] > Sent: Wednesday, November 03, 2010 2:55 AM > To: [email protected] > Subject: Re: Logs Spin - Active Thread - Spin Waiting - Basic > > On 2010-11-03 02:40, Eric Martin wrote: >> Hi, >> >> >> >> I am getting these logs and I have no idea what they mean. I have searched >> google and found very little documentation on it. That doesn't mean it >> doesn't exist just that I have a hard time finding it. I may have missed an >> obvious discussion of this and I am sorry if I did. Can someone point me to >> the documentation or an answer? I'm a law student. Thanks! > > Given the composition of your fetch list (all remaining URLs in the > queue are from the same host) what you see is perfectly normal. There > are 50 fetching threads that can fetch items from any host. However, all > remaining items are from the same single host. Due to the politeness > limits Nutch won't make more than one connection to the host, and it > will space its requests N seconds apart - otherwise a multi-threaded > distributed crawler could easily overwhelm the target host. > > So the logs indicate that only one thread is fetching at any given time, > there are at least 2500 items in the queue, and every N seconds the > thread is allowed to fetch an item. All other threads are spinning idle. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com >

