Hi all,
I am running the crawl command with depth 2 using a seed file containing
120 urls (about 2000 documents). Halfway through, the following output is
logged:
*-activeThreads=50, spinWaiting=50, fetchQueues.totalSize=51,
fetchQueues.getQueueCount=24*
And after a while:
*Aborting with 50 hung threads.*
I am trying exactly the same thing using 1.8, and it is *working fine*.
Please note that I am not applying any customisations apart from the
following nutch-site.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>MyAgent</value>
</property>
<property>
<name>http.robots.agents</name>
<value> MyAgent,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<!-- HTTP properties -->
<property>
<name>http.redirect.max</name>
<value>2</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
<property>
<name>http.content.limit</name>
<value>-1</value>
<description>The length limit for downloaded content using the http://
protocol, in bytes. If this value is nonnegative (>=0), content longer
than it will be truncated; otherwise, no truncation at all. Do not
confuse this setting with the file.content.limit setting.
</description>
</property>
<!-- web db properties -->
<!-- fetcher properties -->
<property>
<name>fetcher.server.delay</name>
<value>4.0</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>20</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>10</value>
<description>This number is the maximum number of threads that
should be allowed to access a queue at one time.
</description>
</property>
<!-- plugin properties -->
<property>
<name>plugin.includes</name>
<value>protocol-http|parse-(html|tika)|index-(basic|anchor)|urlnormalizer
-(pass|regex|basic)</value>
<!-- <value>protocol-http|urlfilter-regex|parse-(html|tika
)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
-->
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please
enable
protocol-httpclient, but be aware of possible intermittent problems with
the
underlying commons-httpclient library.
</description>
</property>
<!-- parser properties -->
<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other
information
is available</description>
</property>
<property>
<name>parser.timeout</name>
<value>-1</value>
<description>Timeout in seconds for the parsing of a document, otherwise
treats it as an exception and
moves on the the following documents. This parameter is applied to any
Parser implementation.
Set to -1 to deactivate, bearing in mind that this could cause
the parsing to crash because of a very long or corrupted document.
</description>
</property>
</configuration>
----
Help is greatly appreciated.
Kind regards,
Issam