You can also specify a non-default value for <property> > <name>fetcher.timelimit.mins</name> > <value>-1</value> > <description>This is the number of minutes allocated to the fetching. > Once this value is reached, any remaining entry from the input URL list > is skipped > and all active queues are emptied. The default value of -1 deactivates > the time limit. > </description> > </property> >
which will prevent the fetching from taking too long, typically when a single host is holding everything up, which I think was your case HTH Julien On 3 November 2010 16:12, Eric Martin <[email protected]> wrote: > Thank you very much. I can understand what you wrote! My crawl has been > running for days. Here are some new settings I would like to put into > nutch-site.xml. > > 1.) Does anyone see them as rude? > 2.) How can I stop the crawler without losing the crawled data (I'm running > 1.2 and should be able to use kill-nutch but can't find out where to do > that) > 3.) how do I restart the currently stopped crawler with the new > nutch-site.xml settings? > > <name>fetcher.server.min.delay</name> > <value>0.5</value> > > <name>fetcher.server.delay</name> > <value>1.0</value> > > <name>fetcher.threads.per.host</name> > <value>50</value> > > <name>fetcher.max.crawl.delay</name> > <value>1</value> > > <name>http.threads.per.host</name> > <value>50</value> > > <name>http.max.delays</name> > <value>1</value> > > Target Sites: > > http://www.ecasebriefs.com/blog/law/ > http://www.lawnix.com/cases/cases-index/ > http://www.oyez.org/ > http://www.4lawnotes.com/ > http://www.docstoc.com/documents/education/law-school/case-briefs > http://www.lawschoolcasebriefs.com/ > http://dictionary.findlaw.com > > -----Original Message----- > From: Andrzej Bialecki [mailto:[email protected]] > Sent: Wednesday, November 03, 2010 2:55 AM > To: [email protected] > Subject: Re: Logs Spin - Active Thread - Spin Waiting - Basic > > On 2010-11-03 02:40, Eric Martin wrote: > > Hi, > > > > > > > > I am getting these logs and I have no idea what they mean. I have > searched > > google and found very little documentation on it. That doesn't mean it > > doesn't exist just that I have a hard time finding it. I may have missed > an > > obvious discussion of this and I am sorry if I did. Can someone point me > to > > the documentation or an answer? I'm a law student. Thanks! > > Given the composition of your fetch list (all remaining URLs in the > queue are from the same host) what you see is perfectly normal. There > are 50 fetching threads that can fetch items from any host. However, all > remaining items are from the same single host. Due to the politeness > limits Nutch won't make more than one connection to the host, and it > will space its requests N seconds apart - otherwise a multi-threaded > distributed crawler could easily overwhelm the target host. > > So the logs indicate that only one thread is fetching at any given time, > there are at least 2500 items in the queue, and every N seconds the > thread is allowed to fetch an item. All other threads are spinning idle. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com

