RE: Logs Spin - Active Thread - Spin Waiting - Basic

Eric Martin Wed, 03 Nov 2010 09:13:07 -0700

Thank you very much. I can understand what you wrote! My crawl has been running 
for days. Here are some new settings I would like to put into nutch-site.xml.

1.) Does anyone see them as rude? 
2.) How can I stop the crawler without losing the crawled data (I'm running 1.2 
and should be able to use kill-nutch but can't find out where to do that) 
3.) how do I restart the currently stopped crawler with the new nutch-site.xml 
settings?

  <name>fetcher.server.min.delay</name>
  <value>0.5</value>

  <name>fetcher.server.delay</name>
  <value>1.0</value>

  <name>fetcher.threads.per.host</name>
  <value>50</value>

  <name>fetcher.max.crawl.delay</name>
  <value>1</value>

  <name>http.threads.per.host</name>
  <value>50</value>

  <name>http.max.delays</name>
  <value>1</value>

Target Sites:

http://www.ecasebriefs.com/blog/law/
http://www.lawnix.com/cases/cases-index/
http://www.oyez.org/
http://www.4lawnotes.com/
http://www.docstoc.com/documents/education/law-school/case-briefs
http://www.lawschoolcasebriefs.com/
http://dictionary.findlaw.com

-----Original Message-----
From: Andrzej Bialecki [mailto:[email protected]] 
Sent: Wednesday, November 03, 2010 2:55 AM
To: [email protected]
Subject: Re: Logs Spin - Active Thread - Spin Waiting - Basic

On 2010-11-03 02:40, Eric Martin wrote:
> Hi,
> 
>  
> 
> I am getting these logs and I have no idea what they mean. I have searched
> google and found very little documentation on it. That doesn't mean it
> doesn't exist just that I have a hard time finding it. I may have missed an
> obvious discussion of this and I am sorry if I did. Can someone point me to
> the documentation or an answer? I'm a law student. Thanks!

Given the composition of your fetch list (all remaining URLs in the
queue are from the same host) what you see is perfectly normal. There
are 50 fetching threads that can fetch items from any host. However, all
remaining items are from the same single host. Due to the politeness
limits Nutch won't make more than one connection to the host, and it
will space its requests N seconds apart - otherwise a multi-threaded
distributed crawler could easily overwhelm the target host.

So the logs indicate that only one thread is fetching at any given time,
there are at least 2500 items in the queue, and every N seconds the
thread is allowed to fetch an item. All other threads are spinning idle.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

RE: Logs Spin - Active Thread - Spin Waiting - Basic

Reply via email to