Hi,

I am using nutch 1.10 and crawling using crawl script. Occasionally I get 
SocketTimeoutException.
Here are few properties I have overridden using notch-site.xml and exception 
stack trace. I set http.timeout to 30000 , still getting same. When crawling 
same url separately it get crawled.

<property>
  <name>http.timeout</name>
  <value>30000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>
<property>
  <name>fetcher.server.delay</name>
  <value>.250</value>
  <description>The number of seconds the fetcher will delay between 
   successive requests to the same server. Note that this might get
   overriden by a Crawl-Delay from a robots.txt and is used ONLY if 
   fetcher.threads.per.queue is set to 1.
   </description>
</property>
<property>
  <name>fetcher.threads.fetch</name>
  <value>100</value>
  <description>The number of FetcherThreads the fetcher should use.
  This is also determines the maximum number of requests that are
  made at once (each FetcherThread handles one connection). The total
  number of threads running in distributed mode will be the number of
  fetcher threads * number of nodes as fetcher has one map task per node.
  </description>
</property>
<property>
  <name>fetcher.threads.per.queue</name>
  <value>25</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a queue at one time. Setting it to 
    a value > 1 will cause the Crawl-Delay value from robots.txt to
    be ignored and the value of fetcher.server.min.delay to be used
    as a delay between successive requests to the same server instead 
    of fetcher.server.delay.
   </description>
</property>
<property>
  <name>fetcher.server.min.delay</name>
  <value>1</value>
  <description>The minimum number of seconds the fetcher will delay between 
  successive requests to the same server. This value is applicable ONLY
  if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking
  is turned off).</description>
</property>
<property>
  <name>parser.timeout</name>
  <value>-1</value>
  <description>Timeout in seconds for the parsing of a document, otherwise 
treats it as an exception and 
  moves on the the following documents. This parameter is applied to any Parser 
implementation. 
  Set to -1 to deactivate, bearing in mind that this could cause
  the parsing to crash because of a very long or corrupted document.
  </description>
</property>
<property>
  <name>fetcher.queue.mode</name>
  <value>byHost</value>
  <description>Determines how to put URLs into queues. Default value is 
'byHost', 
  also takes 'byDomain' or 'byIP'. 
  </description>
</property>
<property>
  <name>http.redirect.max</name>
  <value>2</value>
  <description>The maximum number of redirects the fetcher will follow when
  trying to fetch a page. If set to negative or 0, fetcher won't immediately
  follow redirected URLs, instead it will record them for later fetching.
  </description>
</property>
<property>
  <name>fetcher.queue.mode</name>
  <value>byHost</value>
  <description>Determines how to put URLs into queues. Default value is 
'byHost', 
  also takes 'byDomain' or 'byIP'. 
  </description>
</property>
</configuration>

2015-12-17 15:04:55,721 ERROR http.Http - Failed to get protocol output
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.read(SocketInputStream.java:152)
        at java.net.SocketInputStream.read(SocketInputStream.java:122)
        at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
        at sun.security.ssl.InputRecord.read(InputRecord.java:480)
        at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
        at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
        at java.io.FilterInputStream.read(FilterInputStream.java:83)
        at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
        at 
org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:498)
        at 
org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:415)
        at 
org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:216)
        at org.apache.nutch.protocol.http.Http.getResponse(Http.java:70)
        at 
org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:255)
        at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:778)
2015-12-17 15:04:55,722 INFO  fetcher.Fetcher - fetch of 
https://itunes.apple.com/us/genre/music-blues/id2?letter=D failed with: 
java.net.SocketTimeoutException: Read timed out

Thanks


Reply via email to