RE: Nutch 1.9 Fetchers Hung

Markus Jelsma Fri, 28 Nov 2014 06:42:32 -0800

I think your're looking at, https://issues.apache.org/jira/browse/NUTCH-1182 
logging of hung threads was added in 1.9, so it should happen in 1.8 as well, 
but not being logged.
Markus


 
 
-----Original message-----
> From:Issam Maamria <[email protected]>
> Sent: Friday 28th November 2014 15:37
> To: [email protected]
> Cc: Mourad K <[email protected]>; Vitaly Savicks <[email protected]>
> Subject: Nutch 1.9 Fetchers Hung
> 
> Hi all,
> 
> I am running the crawl command with depth 2 using a seed file containing
> 120 urls (about 2000 documents). Halfway through, the following output is
> logged:
> 
> *-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=51,
> fetchQueues.getQueueCount=24*
> 
> And after a while:
> 
> *Aborting with 50 hung threads.*
> 
> I am trying exactly the same thing using 1.8, and it is *working fine*.
> Please note that I am not applying any customisations apart from the
> following nutch-site.xml:
> 
> <?xml version="1.0"?>
> 
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> 
> 
> <!-- Put site-specific property overrides in this file. -->
> 
> 
> <configuration>
> 
> <property>
> 
>  <name>http.agent.name</name>
> 
>  <value>MyAgent</value>
> 
> </property>
> 
> 
> <property>
> 
>   <name>http.robots.agents</name>
> 
>   <value> MyAgent,*</value>
> 
>   <description>The agent strings we'll look for in robots.txt files,
> 
>   comma-separated, in decreasing order of precedence. You should
> 
>   put the value of http.agent.name as the first agent name, and keep the
> 
>   default * at the end of the list. E.g.: BlurflDev,Blurfl,*
> 
>   </description>
> 
> </property>
> 
> 
> <!-- HTTP properties -->
> 
> 
> <property>
> 
>   <name>http.redirect.max</name>
> 
>   <value>2</value>
> 
>   <description>The maximum number of redirects the fetcher will follow when
> 
>   trying to fetch a page. If set to negative or 0, fetcher won't immediately
> 
>   follow redirected URLs, instead it will record them for later fetching.
> 
>   </description>
> 
> </property>
> 
> 
> <property>
> 
>   <name>http.content.limit</name>
> 
>   <value>-1</value>
> 
>   <description>The length limit for downloaded content using the http://
> 
>   protocol, in bytes. If this value is nonnegative (>=0), content longer
> 
>   than it will be truncated; otherwise, no truncation at all. Do not
> 
>   confuse this setting with the file.content.limit setting.
> 
>   </description>
> 
> </property>
> 
> 
> <!-- web db properties -->
> 
> 
> <!-- fetcher properties -->
> 
> 
> <property>
> 
>   <name>fetcher.server.delay</name>
> 
>   <value>4.0</value>
> 
>   <description>The number of seconds the fetcher will delay between
> 
>    successive requests to the same server.</description>
> 
> </property>
> 
> 
> <property>
> 
>   <name>fetcher.threads.fetch</name>
> 
>   <value>20</value>
> 
>   <description>The number of FetcherThreads the fetcher should use.
> 
>   This is also determines the maximum number of requests that are
> 
>   made at once (each FetcherThread handles one connection). The total
> 
>   number of threads running in distributed mode will be the number of
> 
>   fetcher threads * number of nodes as fetcher has one map task per node.
> 
>   </description>
> 
> </property>
> 
> 
> <property>
> 
>   <name>fetcher.threads.per.queue</name>
> 
>   <value>10</value>
> 
>   <description>This number is the maximum number of threads that
> 
>     should be allowed to access a queue at one time.
> 
>    </description>
> 
> </property>
> 
> 
> <!-- plugin properties -->
> 
> 
> <property>
> 
>   <name>plugin.includes</name>
> 
>   <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|urlnormalizer
> -(pass|regex|basic)</value>
> 
>   <!-- <value>protocol-http|urlfilter-regex|parse-(html|tika
> )|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> -->
> 
>   <description>Regular expression naming plugin directory names to
> 
>   include.  Any plugin not matching this expression is excluded.
> 
>   In any case you need at least include the nutch-extensionpoints plugin. By
> 
>   default Nutch includes crawling just HTML and plain text via HTTP,
> 
>   and basic indexing and search plugins. In order to use HTTPS please
> enable
> 
>   protocol-httpclient, but be aware of possible intermittent problems with
> the
> 
>   underlying commons-httpclient library.
> 
>   </description>
> 
> </property>
> 
> 
> <!-- parser properties -->
> 
> 
> <property>
> 
>   <name>parser.character.encoding.default</name>
> 
>   <value>utf-8</value>
> 
>   <description>The character encoding to fall back to when no other
> information
> 
>   is available</description>
> 
> </property>
> 
> 
> <property>
> 
>   <name>parser.timeout</name>
> 
>   <value>-1</value>
> 
>   <description>Timeout in seconds for the parsing of a document, otherwise
> treats it as an exception and
> 
>   moves on the the following documents. This parameter is applied to any
> Parser implementation.
> 
>   Set to -1 to deactivate, bearing in mind that this could cause
> 
>   the parsing to crash because of a very long or corrupted document.
> 
>   </description>
> 
> </property>
> 
> </configuration>
> 
> ----
> 
> Help is greatly appreciated.
> 
> Kind regards,
> 
> Issam
>

RE: Nutch 1.9 Fetchers Hung

Reply via email to