Hi,
I am using nutch 1.10 and crawling using crawl script. Occasionally I get
SocketTimeoutException.
Here are few properties I have overridden using notch-site.xml and exception
stack trace. I set http.timeout to 30000 , still getting same. When crawling
same url separately it get crawled.
<property>
<name>http.timeout</name>
<value>30000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>.250</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server. Note that this might get
overriden by a Crawl-Delay from a robots.txt and is used ONLY if
fetcher.threads.per.queue is set to 1.
</description>
</property>
<property>
<name>fetcher.threads.fetch</name>
<value>100</value>
<description>The number of FetcherThreads the fetcher should use.
This is also determines the maximum number of requests that are
made at once (each FetcherThread handles one connection). The total
number of threads running in distributed mode will be the number of
fetcher threads * number of nodes as fetcher has one map task per node.
</description>
</property>
<property>
<name>fetcher.threads.per.queue</name>
<value>25</value>
<description>This number is the maximum number of threads that
should be allowed to access a queue at one time. Setting it to
a value > 1 will cause the Crawl-Delay value from robots.txt to
be ignored and the value of fetcher.server.min.delay to be used
as a delay between successive requests to the same server instead
of fetcher.server.delay.
</description>
</property>
<property>
<name>fetcher.server.min.delay</name>
<value>1</value>
<description>The minimum number of seconds the fetcher will delay between
successive requests to the same server. This value is applicable ONLY
if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking
is turned off).</description>
</property>
<property>
<name>parser.timeout</name>
<value>-1</value>
<description>Timeout in seconds for the parsing of a document, otherwise
treats it as an exception and
moves on the the following documents. This parameter is applied to any Parser
implementation.
Set to -1 to deactivate, bearing in mind that this could cause
the parsing to crash because of a very long or corrupted document.
</description>
</property>
<property>
<name>fetcher.queue.mode</name>
<value>byHost</value>
<description>Determines how to put URLs into queues. Default value is
'byHost',
also takes 'byDomain' or 'byIP'.
</description>
</property>
<property>
<name>http.redirect.max</name>
<value>2</value>
<description>The maximum number of redirects the fetcher will follow when
trying to fetch a page. If set to negative or 0, fetcher won't immediately
follow redirected URLs, instead it will record them for later fetching.
</description>
</property>
<property>
<name>fetcher.queue.mode</name>
<value>byHost</value>
<description>Determines how to put URLs into queues. Default value is
'byHost',
also takes 'byDomain' or 'byIP'.
</description>
</property>
</configuration>
2015-12-17 15:04:55,721 ERROR http.Http - Failed to get protocol output
java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.read(InputRecord.java:480)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
at java.io.FilterInputStream.read(FilterInputStream.java:83)
at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
at
org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:498)
at
org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:415)
at
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:216)
at org.apache.nutch.protocol.http.Http.getResponse(Http.java:70)
at
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:255)
at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:778)
2015-12-17 15:04:55,722 INFO fetcher.Fetcher - fetch of
https://itunes.apple.com/us/genre/music-blues/id2?letter=D failed with:
java.net.SocketTimeoutException: Read timed out
Thanks