Hi,
On Nov 3, 2010, at 9:12 AM, Eric Martin wrote:

> Thank you very much. I can understand what you wrote! My crawl has been 
> running for days. Here are some new settings I would like to put into 
> nutch-site.xml. 
> 
> 1.) Does anyone see them as rude? 
> 2.) How can I stop the crawler without losing the crawled data (I'm running 
> 1.2 and should be able to use kill-nutch but can't find out where to do that) 
> 3.) how do I restart the currently stopped crawler with the new 
> nutch-site.xml settings?
> 
>  <name>fetcher.server.min.delay</name>
>  <value>0.5</value>
> 
>  <name>fetcher.server.delay</name>
>  <value>1.0</value>
> 
>  <name>fetcher.threads.per.host</name>
>  <value>50</value>

If fetcher.threads.per.host > 1, then a fetcher will make more than 1 active 
request to servers. I don't know
about the sites below but that is generally frowned upon.

> 
>  <name>fetcher.max.crawl.delay</name>
>  <value>1</value>
> 
>  <name>http.threads.per.host</name>
>  <value>50</value>
> 
>  <name>http.max.delays</name>
>  <value>1</value>
> 
> Target Sites:
> 
> http://www.ecasebriefs.com/blog/law/
> http://www.lawnix.com/cases/cases-index/
> http://www.oyez.org/
> http://www.4lawnotes.com/
> http://www.docstoc.com/documents/education/law-school/case-briefs
> http://www.lawschoolcasebriefs.com/
> http://dictionary.findlaw.com
> 
> -----Original Message-----
> From: Andrzej Bialecki [mailto:[email protected]] 
> Sent: Wednesday, November 03, 2010 2:55 AM
> To: [email protected]
> Subject: Re: Logs Spin - Active Thread - Spin Waiting - Basic
> 
> On 2010-11-03 02:40, Eric Martin wrote:
>> Hi,
>> 
>> 
>> 
>> I am getting these logs and I have no idea what they mean. I have searched
>> google and found very little documentation on it. That doesn't mean it
>> doesn't exist just that I have a hard time finding it. I may have missed an
>> obvious discussion of this and I am sorry if I did. Can someone point me to
>> the documentation or an answer? I'm a law student. Thanks!
> 
> Given the composition of your fetch list (all remaining URLs in the
> queue are from the same host) what you see is perfectly normal. There
> are 50 fetching threads that can fetch items from any host. However, all
> remaining items are from the same single host. Due to the politeness
> limits Nutch won't make more than one connection to the host, and it
> will space its requests N seconds apart - otherwise a multi-threaded
> distributed crawler could easily overwhelm the target host.
> 
> So the logs indicate that only one thread is fetching at any given time,
> there are at least 2500 items in the queue, and every N seconds the
> thread is allowed to fetch an item. All other threads are spinning idle.
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
> 

Reply via email to