Nutch 1.9 Fetchers Hung

Issam Maamria Fri, 28 Nov 2014 06:38:02 -0800

Hi all,

I am running the crawl command with depth 2 using a seed file containing
120 urls (about 2000 documents). Halfway through, the following output is
logged:


*-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=51,
fetchQueues.getQueueCount=24*

And after a while:

*Aborting with 50 hung threads.*

I am trying exactly the same thing using 1.8, and it is *working fine*.
Please note that I am not applying any customisations apart from the
following nutch-site.xml:

<?xml version="1.0"?>

<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>


<!-- Put site-specific property overrides in this file. -->


<configuration>

<property>

 <name>http.agent.name</name>

 <value>MyAgent</value>

</property>


<property>

  <name>http.robots.agents</name>

  <value> MyAgent,*</value>

  <description>The agent strings we'll look for in robots.txt files,

  comma-separated, in decreasing order of precedence. You should

  put the value of http.agent.name as the first agent name, and keep the

  default * at the end of the list. E.g.: BlurflDev,Blurfl,*

  </description>

</property>


<!-- HTTP properties -->


<property>

  <name>http.redirect.max</name>

  <value>2</value>

  <description>The maximum number of redirects the fetcher will follow when

  trying to fetch a page. If set to negative or 0, fetcher won't immediately

  follow redirected URLs, instead it will record them for later fetching.

  </description>

</property>


<property>

  <name>http.content.limit</name>

  <value>-1</value>

  <description>The length limit for downloaded content using the http://

  protocol, in bytes. If this value is nonnegative (>=0), content longer

  than it will be truncated; otherwise, no truncation at all. Do not

  confuse this setting with the file.content.limit setting.

  </description>

</property>


<!-- web db properties -->


<!-- fetcher properties -->


<property>

  <name>fetcher.server.delay</name>

  <value>4.0</value>

  <description>The number of seconds the fetcher will delay between

   successive requests to the same server.</description>

</property>


<property>

  <name>fetcher.threads.fetch</name>

  <value>20</value>

  <description>The number of FetcherThreads the fetcher should use.

  This is also determines the maximum number of requests that are

  made at once (each FetcherThread handles one connection). The total

  number of threads running in distributed mode will be the number of

  fetcher threads * number of nodes as fetcher has one map task per node.

  </description>

</property>


<property>

  <name>fetcher.threads.per.queue</name>

  <value>10</value>

  <description>This number is the maximum number of threads that

    should be allowed to access a queue at one time.

   </description>

</property>


<!-- plugin properties -->


<property>

  <name>plugin.includes</name>

  <value>protocol-http|parse-(html|tika)|index-(basic|anchor)|urlnormalizer
-(pass|regex|basic)</value>

  <!-- <value>protocol-http|urlfilter-regex|parse-(html|tika
)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
-->

  <description>Regular expression naming plugin directory names to

  include.  Any plugin not matching this expression is excluded.

  In any case you need at least include the nutch-extensionpoints plugin. By

  default Nutch includes crawling just HTML and plain text via HTTP,

  and basic indexing and search plugins. In order to use HTTPS please
enable

  protocol-httpclient, but be aware of possible intermittent problems with
the

  underlying commons-httpclient library.

  </description>

</property>


<!-- parser properties -->


<property>

  <name>parser.character.encoding.default</name>

  <value>utf-8</value>

  <description>The character encoding to fall back to when no other
information

  is available</description>

</property>


<property>

  <name>parser.timeout</name>

  <value>-1</value>

  <description>Timeout in seconds for the parsing of a document, otherwise
treats it as an exception and

  moves on the the following documents. This parameter is applied to any
Parser implementation.

  Set to -1 to deactivate, bearing in mind that this could cause

  the parsing to crash because of a very long or corrupted document.

  </description>

</property>

</configuration>

----

Help is greatly appreciated.

Kind regards,

Issam

Nutch 1.9 Fetchers Hung

Reply via email to