Hello Manish - connection time outs can also occur at the server side, on which Nutch has no control. It is also possible that you have been firewalled - if the server's iptables decides to drop you packets, a connection time out will occur.
Markus -----Original message----- > From:Manish Verma <[email protected]> > Sent: Friday 18th December 2015 0:14 > To: [email protected] > Subject: SocketTimeoutException > > Hi, > > I am using nutch 1.10 and crawling using crawl script. Occasionally I get > SocketTimeoutException. > Here are few properties I have overridden using notch-site.xml and exception > stack trace. I set http.timeout to 30000 , still getting same. When crawling > same url separately it get crawled. > > <property> > <name>http.timeout</name> > <value>30000</value> > <description>The default network timeout, in milliseconds.</description> > </property> > <property> > <name>fetcher.server.delay</name> > <value>.250</value> > <description>The number of seconds the fetcher will delay between > successive requests to the same server. Note that this might get > overriden by a Crawl-Delay from a robots.txt and is used ONLY if > fetcher.threads.per.queue is set to 1. > </description> > </property> > <property> > <name>fetcher.threads.fetch</name> > <value>100</value> > <description>The number of FetcherThreads the fetcher should use. > This is also determines the maximum number of requests that are > made at once (each FetcherThread handles one connection). The total > number of threads running in distributed mode will be the number of > fetcher threads * number of nodes as fetcher has one map task per node. > </description> > </property> > <property> > <name>fetcher.threads.per.queue</name> > <value>25</value> > <description>This number is the maximum number of threads that > should be allowed to access a queue at one time. Setting it to > a value > 1 will cause the Crawl-Delay value from robots.txt to > be ignored and the value of fetcher.server.min.delay to be used > as a delay between successive requests to the same server instead > of fetcher.server.delay. > </description> > </property> > <property> > <name>fetcher.server.min.delay</name> > <value>1</value> > <description>The minimum number of seconds the fetcher will delay between > successive requests to the same server. This value is applicable ONLY > if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking > is turned off).</description> > </property> > <property> > <name>parser.timeout</name> > <value>-1</value> > <description>Timeout in seconds for the parsing of a document, otherwise > treats it as an exception and > moves on the the following documents. This parameter is applied to any > Parser implementation. > Set to -1 to deactivate, bearing in mind that this could cause > the parsing to crash because of a very long or corrupted document. > </description> > </property> > <property> > <name>fetcher.queue.mode</name> > <value>byHost</value> > <description>Determines how to put URLs into queues. Default value is > 'byHost', > also takes 'byDomain' or 'byIP'. > </description> > </property> > <property> > <name>http.redirect.max</name> > <value>2</value> > <description>The maximum number of redirects the fetcher will follow when > trying to fetch a page. If set to negative or 0, fetcher won't immediately > follow redirected URLs, instead it will record them for later fetching. > </description> > </property> > <property> > <name>fetcher.queue.mode</name> > <value>byHost</value> > <description>Determines how to put URLs into queues. Default value is > 'byHost', > also takes 'byDomain' or 'byIP'. > </description> > </property> > </configuration> > > 2015-12-17 15:04:55,721 ERROR http.Http - Failed to get protocol output > java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:152) > at java.net.SocketInputStream.read(SocketInputStream.java:122) > at sun.security.ssl.InputRecord.readFully(InputRecord.java:442) > at sun.security.ssl.InputRecord.read(InputRecord.java:480) > at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934) > at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891) > at sun.security.ssl.AppInputStream.read(AppInputStream.java:102) > at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) > at java.io.BufferedInputStream.read(BufferedInputStream.java:254) > at java.io.FilterInputStream.read(FilterInputStream.java:83) > at java.io.PushbackInputStream.read(PushbackInputStream.java:139) > at > org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:498) > at > org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:415) > at > org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:216) > at org.apache.nutch.protocol.http.Http.getResponse(Http.java:70) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:255) > at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:778) > 2015-12-17 15:04:55,722 INFO fetcher.Fetcher - fetch of > https://itunes.apple.com/us/genre/music-blues/id2?letter=D failed with: > java.net.SocketTimeoutException: Read timed out > > Thanks > > >

