Re: Nutch 1.9 Fetchers Hung

Sebastian Nagel Fri, 28 Nov 2014 12:45:59 -0800

Hi Issam, hi Markus,

the warning that there are hung threads is shown also in 1.8.
With NUTCH-1182 the hung threads are logged (if they are alive):
- URL in process / being fetched
- with DEBUG logging: stack where thread is hanging


If the problem persists, would it be possible to see more
context from the log file? Ideally with debug logging turned on
in $NUTCH_HOME/conf/log4j.properties :
  log4j.logger.org.apache.nutch.fetcher.Fetcher=DEBUG,cmdstdout


Thanks,
Sebastian


On 11/28/2014 03:41 PM, Markus Jelsma wrote:
> I think your're looking at, https://issues.apache.org/jira/browse/NUTCH-1182 
> logging of hung threads was added in 1.9, so it should happen in 1.8 as well, 
> but not being logged.
> Markus
> 
>  
>  
> -----Original message-----
>> From:Issam Maamria <[email protected]>
>> Sent: Friday 28th November 2014 15:37
>> To: [email protected]
>> Cc: Mourad K <[email protected]>; Vitaly Savicks <[email protected]>
>> Subject: Nutch 1.9 Fetchers Hung
>>
>> Hi all,
>>
>> I am running the crawl command with depth 2 using a seed file containing
>> 120 urls (about 2000 documents). Halfway through, the following output is
>> logged:
>>
>> *-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=51,
>> fetchQueues.getQueueCount=24*
>>
>> And after a while:
>>
>> *Aborting with 50 hung threads.*
>>
>> I am trying exactly the same thing using 1.8, and it is *working fine*.
>> Please note that I am not applying any customisations apart from the
>> following nutch-site.xml:
>>
>> <?xml version="1.0"?>
>>
>> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
>>
>>
>> <!-- Put site-specific property overrides in this file. -->
>>
>>
>> <configuration>
>>
>> <property>
>>
>>  <name>http.agent.name</name>
>>
>>  <value>MyAgent</value>
>>
>> </property>
>>
>>
>> <property>
>>
>>   <name>http.robots.agents</name>
>>
>>   <value> MyAgent,*</value>
>>
>>   <description>The agent strings we'll look for in robots.txt files,
>>
>>   comma-separated, in decreasing order of precedence. You should
>>
>>   put the value of http.agent.name as the first agent name, and keep the
>>
>>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
>>
>>   </description>
>>
>> </property>
>>
>>
>> <!-- HTTP properties -->
>>
>>
>> <property>
>>
>>   <name>http.redirect.max</name>
>>
>>   <value>2</value>
>>
>>   <description>The maximum number of redirects the fetcher will follow when
>>
>>   trying to fetch a page. If set to negative or 0, fetcher won't immediately
>>
>>   follow redirected URLs, instead it will record them for later fetching.
>>
>>   </description>
>>
>> </property>
>>
>>
>> <property>
>>
>>   <name>http.content.limit</name>
>>
>>   <value>-1</value>
>>
>>   <description>The length limit for downloaded content using the http://
>>
>>   protocol, in bytes. If this value is nonnegative (>=0), content longer
>>
>>   than it will be truncated; otherwise, no truncation at all. Do not
>>
>>   confuse this setting with the file.content.limit setting.
>>
>>   </description>
>>
>> </property>
>>
>>
>> <!-- web db properties -->
>>
>>
>> <!-- fetcher properties -->
>>
>>
>> <property>
>>
>>   <name>fetcher.server.delay</name>
>>
>>   <value>4.0</value>
>>
>>   <description>The number of seconds the fetcher will delay between
>>
>>    successive requests to the same server.</description>
>>
>> </property>
>>
>>
>> <property>
>>
>>   <name>fetcher.threads.fetch</name>
>>
>>   <value>20</value>
>>
>>   <description>The number of FetcherThreads the fetcher should use.
>>
>>   This is also determines the maximum number of requests that are
>>
>>   made at once (each FetcherThread handles one connection). The total
>>
>>   number of threads running in distributed mode will be the number of
>>
>>   fetcher threads * number of nodes as fetcher has one map task per node.
>>
>>   </description>
>>
>> </property>
>>
>>
>> <property>
>>
>>   <name>fetcher.threads.per.queue</name>
>>
>>   <value>10</value>
>>
>>   <description>This number is the maximum number of threads that
>>
>>     should be allowed to access a queue at one time.
>>
>>    </description>
>>
>> </property>
>>
>>
>> <!-- plugin properties -->
>>
>>
>> <property>
>>
>>   <name>plugin.includes</name>
>>
>>   <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|urlnormalizer
>> -(pass|regex|basic)</value>
>>
>>   <!-- <value>protocol-http|urlfilter-regex|parse-(html|tika
>> )|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
>> -->
>>
>>   <description>Regular expression naming plugin directory names to
>>
>>   include.  Any plugin not matching this expression is excluded.
>>
>>   In any case you need at least include the nutch-extensionpoints plugin. By
>>
>>   default Nutch includes crawling just HTML and plain text via HTTP,
>>
>>   and basic indexing and search plugins. In order to use HTTPS please
>> enable
>>
>>   protocol-httpclient, but be aware of possible intermittent problems with
>> the
>>
>>   underlying commons-httpclient library.
>>
>>   </description>
>>
>> </property>
>>
>>
>> <!-- parser properties -->
>>
>>
>> <property>
>>
>>   <name>parser.character.encoding.default</name>
>>
>>   <value>utf-8</value>
>>
>>   <description>The character encoding to fall back to when no other
>> information
>>
>>   is available</description>
>>
>> </property>
>>
>>
>> <property>
>>
>>   <name>parser.timeout</name>
>>
>>   <value>-1</value>
>>
>>   <description>Timeout in seconds for the parsing of a document, otherwise
>> treats it as an exception and
>>
>>   moves on the the following documents. This parameter is applied to any
>> Parser implementation.
>>
>>   Set to -1 to deactivate, bearing in mind that this could cause
>>
>>   the parsing to crash because of a very long or corrupted document.
>>
>>   </description>
>>
>> </property>
>>
>> </configuration>
>>
>> ----
>>
>> Help is greatly appreciated.
>>
>> Kind regards,
>>
>> Issam
>>

Re: Nutch 1.9 Fetchers Hung

Reply via email to