I think your're looking at, https://issues.apache.org/jira/browse/NUTCH-1182 logging of hung threads was added in 1.9, so it should happen in 1.8 as well, but not being logged. Markus
-----Original message----- > From:Issam Maamria <[email protected]> > Sent: Friday 28th November 2014 15:37 > To: [email protected] > Cc: Mourad K <[email protected]>; Vitaly Savicks <[email protected]> > Subject: Nutch 1.9 Fetchers Hung > > Hi all, > > I am running the crawl command with depth 2 using a seed file containing > 120 urls (about 2000 documents). Halfway through, the following output is > logged: > > *-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=51, > fetchQueues.getQueueCount=24* > > And after a while: > > *Aborting with 50 hung threads.* > > I am trying exactly the same thing using 1.8, and it is *working fine*. > Please note that I am not applying any customisations apart from the > following nutch-site.xml: > > <?xml version="1.0"?> > > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > > > <!-- Put site-specific property overrides in this file. --> > > > <configuration> > > <property> > > <name>http.agent.name</name> > > <value>MyAgent</value> > > </property> > > > <property> > > <name>http.robots.agents</name> > > <value> MyAgent,*</value> > > <description>The agent strings we'll look for in robots.txt files, > > comma-separated, in decreasing order of precedence. You should > > put the value of http.agent.name as the first agent name, and keep the > > default * at the end of the list. E.g.: BlurflDev,Blurfl,* > > </description> > > </property> > > > <!-- HTTP properties --> > > > <property> > > <name>http.redirect.max</name> > > <value>2</value> > > <description>The maximum number of redirects the fetcher will follow when > > trying to fetch a page. If set to negative or 0, fetcher won't immediately > > follow redirected URLs, instead it will record them for later fetching. > > </description> > > </property> > > > <property> > > <name>http.content.limit</name> > > <value>-1</value> > > <description>The length limit for downloaded content using the http:// > > protocol, in bytes. If this value is nonnegative (>=0), content longer > > than it will be truncated; otherwise, no truncation at all. Do not > > confuse this setting with the file.content.limit setting. > > </description> > > </property> > > > <!-- web db properties --> > > > <!-- fetcher properties --> > > > <property> > > <name>fetcher.server.delay</name> > > <value>4.0</value> > > <description>The number of seconds the fetcher will delay between > > successive requests to the same server.</description> > > </property> > > > <property> > > <name>fetcher.threads.fetch</name> > > <value>20</value> > > <description>The number of FetcherThreads the fetcher should use. > > This is also determines the maximum number of requests that are > > made at once (each FetcherThread handles one connection). The total > > number of threads running in distributed mode will be the number of > > fetcher threads * number of nodes as fetcher has one map task per node. > > </description> > > </property> > > > <property> > > <name>fetcher.threads.per.queue</name> > > <value>10</value> > > <description>This number is the maximum number of threads that > > should be allowed to access a queue at one time. > > </description> > > </property> > > > <!-- plugin properties --> > > > <property> > > <name>plugin.includes</name> > > <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|urlnormalizer > -(pass|regex|basic)</value> > > <!-- <value>protocol-http|urlfilter-regex|parse-(html|tika > )|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > --> > > <description>Regular expression naming plugin directory names to > > include. Any plugin not matching this expression is excluded. > > In any case you need at least include the nutch-extensionpoints plugin. By > > default Nutch includes crawling just HTML and plain text via HTTP, > > and basic indexing and search plugins. In order to use HTTPS please > enable > > protocol-httpclient, but be aware of possible intermittent problems with > the > > underlying commons-httpclient library. > > </description> > > </property> > > > <!-- parser properties --> > > > <property> > > <name>parser.character.encoding.default</name> > > <value>utf-8</value> > > <description>The character encoding to fall back to when no other > information > > is available</description> > > </property> > > > <property> > > <name>parser.timeout</name> > > <value>-1</value> > > <description>Timeout in seconds for the parsing of a document, otherwise > treats it as an exception and > > moves on the the following documents. This parameter is applied to any > Parser implementation. > > Set to -1 to deactivate, bearing in mind that this could cause > > the parsing to crash because of a very long or corrupted document. > > </description> > > </property> > > </configuration> > > ---- > > Help is greatly appreciated. > > Kind regards, > > Issam >

