Hi Issam, hi Markus, the warning that there are hung threads is shown also in 1.8. With NUTCH-1182 the hung threads are logged (if they are alive): - URL in process / being fetched - with DEBUG logging: stack where thread is hanging
If the problem persists, would it be possible to see more context from the log file? Ideally with debug logging turned on in $NUTCH_HOME/conf/log4j.properties : log4j.logger.org.apache.nutch.fetcher.Fetcher=DEBUG,cmdstdout Thanks, Sebastian On 11/28/2014 03:41 PM, Markus Jelsma wrote: > I think your're looking at, https://issues.apache.org/jira/browse/NUTCH-1182 > logging of hung threads was added in 1.9, so it should happen in 1.8 as well, > but not being logged. > Markus > > > > -----Original message----- >> From:Issam Maamria <[email protected]> >> Sent: Friday 28th November 2014 15:37 >> To: [email protected] >> Cc: Mourad K <[email protected]>; Vitaly Savicks <[email protected]> >> Subject: Nutch 1.9 Fetchers Hung >> >> Hi all, >> >> I am running the crawl command with depth 2 using a seed file containing >> 120 urls (about 2000 documents). Halfway through, the following output is >> logged: >> >> *-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=51, >> fetchQueues.getQueueCount=24* >> >> And after a while: >> >> *Aborting with 50 hung threads.* >> >> I am trying exactly the same thing using 1.8, and it is *working fine*. >> Please note that I am not applying any customisations apart from the >> following nutch-site.xml: >> >> <?xml version="1.0"?> >> >> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> >> >> >> <!-- Put site-specific property overrides in this file. --> >> >> >> <configuration> >> >> <property> >> >> <name>http.agent.name</name> >> >> <value>MyAgent</value> >> >> </property> >> >> >> <property> >> >> <name>http.robots.agents</name> >> >> <value> MyAgent,*</value> >> >> <description>The agent strings we'll look for in robots.txt files, >> >> comma-separated, in decreasing order of precedence. You should >> >> put the value of http.agent.name as the first agent name, and keep the >> >> default * at the end of the list. E.g.: BlurflDev,Blurfl,* >> >> </description> >> >> </property> >> >> >> <!-- HTTP properties --> >> >> >> <property> >> >> <name>http.redirect.max</name> >> >> <value>2</value> >> >> <description>The maximum number of redirects the fetcher will follow when >> >> trying to fetch a page. If set to negative or 0, fetcher won't immediately >> >> follow redirected URLs, instead it will record them for later fetching. >> >> </description> >> >> </property> >> >> >> <property> >> >> <name>http.content.limit</name> >> >> <value>-1</value> >> >> <description>The length limit for downloaded content using the http:// >> >> protocol, in bytes. If this value is nonnegative (>=0), content longer >> >> than it will be truncated; otherwise, no truncation at all. Do not >> >> confuse this setting with the file.content.limit setting. >> >> </description> >> >> </property> >> >> >> <!-- web db properties --> >> >> >> <!-- fetcher properties --> >> >> >> <property> >> >> <name>fetcher.server.delay</name> >> >> <value>4.0</value> >> >> <description>The number of seconds the fetcher will delay between >> >> successive requests to the same server.</description> >> >> </property> >> >> >> <property> >> >> <name>fetcher.threads.fetch</name> >> >> <value>20</value> >> >> <description>The number of FetcherThreads the fetcher should use. >> >> This is also determines the maximum number of requests that are >> >> made at once (each FetcherThread handles one connection). The total >> >> number of threads running in distributed mode will be the number of >> >> fetcher threads * number of nodes as fetcher has one map task per node. >> >> </description> >> >> </property> >> >> >> <property> >> >> <name>fetcher.threads.per.queue</name> >> >> <value>10</value> >> >> <description>This number is the maximum number of threads that >> >> should be allowed to access a queue at one time. >> >> </description> >> >> </property> >> >> >> <!-- plugin properties --> >> >> >> <property> >> >> <name>plugin.includes</name> >> >> <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|urlnormalizer >> -(pass|regex|basic)</value> >> >> <!-- <value>protocol-http|urlfilter-regex|parse-(html|tika >> )|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> >> --> >> >> <description>Regular expression naming plugin directory names to >> >> include. Any plugin not matching this expression is excluded. >> >> In any case you need at least include the nutch-extensionpoints plugin. By >> >> default Nutch includes crawling just HTML and plain text via HTTP, >> >> and basic indexing and search plugins. In order to use HTTPS please >> enable >> >> protocol-httpclient, but be aware of possible intermittent problems with >> the >> >> underlying commons-httpclient library. >> >> </description> >> >> </property> >> >> >> <!-- parser properties --> >> >> >> <property> >> >> <name>parser.character.encoding.default</name> >> >> <value>utf-8</value> >> >> <description>The character encoding to fall back to when no other >> information >> >> is available</description> >> >> </property> >> >> >> <property> >> >> <name>parser.timeout</name> >> >> <value>-1</value> >> >> <description>Timeout in seconds for the parsing of a document, otherwise >> treats it as an exception and >> >> moves on the the following documents. This parameter is applied to any >> Parser implementation. >> >> Set to -1 to deactivate, bearing in mind that this could cause >> >> the parsing to crash because of a very long or corrupted document. >> >> </description> >> >> </property> >> >> </configuration> >> >> ---- >> >> Help is greatly appreciated. >> >> Kind regards, >> >> Issam >>

