Thank you very much. I can understand what you wrote! My crawl has been running for days. Here are some new settings I would like to put into nutch-site.xml.
1.) Does anyone see them as rude? 2.) How can I stop the crawler without losing the crawled data (I'm running 1.2 and should be able to use kill-nutch but can't find out where to do that) 3.) how do I restart the currently stopped crawler with the new nutch-site.xml settings? <name>fetcher.server.min.delay</name> <value>0.5</value> <name>fetcher.server.delay</name> <value>1.0</value> <name>fetcher.threads.per.host</name> <value>50</value> <name>fetcher.max.crawl.delay</name> <value>1</value> <name>http.threads.per.host</name> <value>50</value> <name>http.max.delays</name> <value>1</value> Target Sites: http://www.ecasebriefs.com/blog/law/ http://www.lawnix.com/cases/cases-index/ http://www.oyez.org/ http://www.4lawnotes.com/ http://www.docstoc.com/documents/education/law-school/case-briefs http://www.lawschoolcasebriefs.com/ http://dictionary.findlaw.com -----Original Message----- From: Andrzej Bialecki [mailto:[email protected]] Sent: Wednesday, November 03, 2010 2:55 AM To: [email protected] Subject: Re: Logs Spin - Active Thread - Spin Waiting - Basic On 2010-11-03 02:40, Eric Martin wrote: > Hi, > > > > I am getting these logs and I have no idea what they mean. I have searched > google and found very little documentation on it. That doesn't mean it > doesn't exist just that I have a hard time finding it. I may have missed an > obvious discussion of this and I am sorry if I did. Can someone point me to > the documentation or an answer? I'm a law student. Thanks! Given the composition of your fetch list (all remaining URLs in the queue are from the same host) what you see is perfectly normal. There are 50 fetching threads that can fetch items from any host. However, all remaining items are from the same single host. Due to the politeness limits Nutch won't make more than one connection to the host, and it will space its requests N seconds apart - otherwise a multi-threaded distributed crawler could easily overwhelm the target host. So the logs indicate that only one thread is fetching at any given time, there are at least 2500 items in the queue, and every N seconds the thread is allowed to fetch an item. All other threads are spinning idle. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com

