You can also specify a non-default value for

<property>
>   <name>fetcher.timelimit.mins</name>
>   <value>-1</value>
>   <description>This is the number of minutes allocated to the fetching.
>   Once this value is reached, any remaining entry from the input URL list
> is skipped
>   and all active queues are emptied. The default value of -1 deactivates
> the time limit.
>   </description>
> </property>
>

which will prevent the fetching from taking too long, typically when a
single host is holding everything up, which I think was your case

HTH

Julien


On 3 November 2010 16:12, Eric Martin <[email protected]> wrote:

> Thank you very much. I can understand what you wrote! My crawl has been
> running for days. Here are some new settings I would like to put into
> nutch-site.xml.
>
> 1.) Does anyone see them as rude?
> 2.) How can I stop the crawler without losing the crawled data (I'm running
> 1.2 and should be able to use kill-nutch but can't find out where to do
> that)
> 3.) how do I restart the currently stopped crawler with the new
> nutch-site.xml settings?
>
>  <name>fetcher.server.min.delay</name>
>  <value>0.5</value>
>
>  <name>fetcher.server.delay</name>
>  <value>1.0</value>
>
>  <name>fetcher.threads.per.host</name>
>  <value>50</value>
>
>  <name>fetcher.max.crawl.delay</name>
>  <value>1</value>
>
>  <name>http.threads.per.host</name>
>  <value>50</value>
>
>  <name>http.max.delays</name>
>  <value>1</value>
>
> Target Sites:
>
> http://www.ecasebriefs.com/blog/law/
> http://www.lawnix.com/cases/cases-index/
> http://www.oyez.org/
> http://www.4lawnotes.com/
> http://www.docstoc.com/documents/education/law-school/case-briefs
> http://www.lawschoolcasebriefs.com/
> http://dictionary.findlaw.com
>
> -----Original Message-----
> From: Andrzej Bialecki [mailto:[email protected]]
> Sent: Wednesday, November 03, 2010 2:55 AM
> To: [email protected]
> Subject: Re: Logs Spin - Active Thread - Spin Waiting - Basic
>
> On 2010-11-03 02:40, Eric Martin wrote:
> > Hi,
> >
> >
> >
> > I am getting these logs and I have no idea what they mean. I have
> searched
> > google and found very little documentation on it. That doesn't mean it
> > doesn't exist just that I have a hard time finding it. I may have missed
> an
> > obvious discussion of this and I am sorry if I did. Can someone point me
> to
> > the documentation or an answer? I'm a law student. Thanks!
>
> Given the composition of your fetch list (all remaining URLs in the
> queue are from the same host) what you see is perfectly normal. There
> are 50 fetching threads that can fetch items from any host. However, all
> remaining items are from the same single host. Due to the politeness
> limits Nutch won't make more than one connection to the host, and it
> will space its requests N seconds apart - otherwise a multi-threaded
> distributed crawler could easily overwhelm the target host.
>
> So the logs indicate that only one thread is fetching at any given time,
> there are at least 2500 items in the queue, and every N seconds the
> thread is allowed to fetch an item. All other threads are spinning idle.
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
*
*Open Source Solutions for Text Engineering

http://digitalpebble.blogspot.com/
http://www.digitalpebble.com

Reply via email to