Hello Manish - connection time outs can also occur at the server side, on which 
Nutch has no control. It is also possible that you have been firewalled - if 
the server's iptables decides to drop you packets, a connection time out will 
occur.

Markus

 
 
-----Original message-----
> From:Manish Verma <[email protected]>
> Sent: Friday 18th December 2015 0:14
> To: [email protected]
> Subject: SocketTimeoutException
> 
> Hi,
> 
> I am using nutch 1.10 and crawling using crawl script. Occasionally I get 
> SocketTimeoutException.
> Here are few properties I have overridden using notch-site.xml and exception 
> stack trace. I set http.timeout to 30000 , still getting same. When crawling 
> same url separately it get crawled.
> 
> <property>
>   <name>http.timeout</name>
>   <value>30000</value>
>   <description>The default network timeout, in milliseconds.</description>
> </property>
> <property>
>   <name>fetcher.server.delay</name>
>   <value>.250</value>
>   <description>The number of seconds the fetcher will delay between 
>    successive requests to the same server. Note that this might get
>    overriden by a Crawl-Delay from a robots.txt and is used ONLY if 
>    fetcher.threads.per.queue is set to 1.
>    </description>
> </property>
> <property>
>   <name>fetcher.threads.fetch</name>
>   <value>100</value>
>   <description>The number of FetcherThreads the fetcher should use.
>   This is also determines the maximum number of requests that are
>   made at once (each FetcherThread handles one connection). The total
>   number of threads running in distributed mode will be the number of
>   fetcher threads * number of nodes as fetcher has one map task per node.
>   </description>
> </property>
> <property>
>   <name>fetcher.threads.per.queue</name>
>   <value>25</value>
>   <description>This number is the maximum number of threads that
>     should be allowed to access a queue at one time. Setting it to 
>     a value > 1 will cause the Crawl-Delay value from robots.txt to
>     be ignored and the value of fetcher.server.min.delay to be used
>     as a delay between successive requests to the same server instead 
>     of fetcher.server.delay.
>    </description>
> </property>
> <property>
>   <name>fetcher.server.min.delay</name>
>   <value>1</value>
>   <description>The minimum number of seconds the fetcher will delay between 
>   successive requests to the same server. This value is applicable ONLY
>   if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking
>   is turned off).</description>
> </property>
> <property>
>   <name>parser.timeout</name>
>   <value>-1</value>
>   <description>Timeout in seconds for the parsing of a document, otherwise 
> treats it as an exception and 
>   moves on the the following documents. This parameter is applied to any 
> Parser implementation. 
>   Set to -1 to deactivate, bearing in mind that this could cause
>   the parsing to crash because of a very long or corrupted document.
>   </description>
> </property>
> <property>
>   <name>fetcher.queue.mode</name>
>   <value>byHost</value>
>   <description>Determines how to put URLs into queues. Default value is 
> 'byHost', 
>   also takes 'byDomain' or 'byIP'. 
>   </description>
> </property>
> <property>
>   <name>http.redirect.max</name>
>   <value>2</value>
>   <description>The maximum number of redirects the fetcher will follow when
>   trying to fetch a page. If set to negative or 0, fetcher won't immediately
>   follow redirected URLs, instead it will record them for later fetching.
>   </description>
> </property>
> <property>
>   <name>fetcher.queue.mode</name>
>   <value>byHost</value>
>   <description>Determines how to put URLs into queues. Default value is 
> 'byHost', 
>   also takes 'byDomain' or 'byIP'. 
>   </description>
> </property>
> </configuration>
> 
> 2015-12-17 15:04:55,721 ERROR http.Http - Failed to get protocol output
> java.net.SocketTimeoutException: Read timed out
>       at java.net.SocketInputStream.socketRead0(Native Method)
>       at java.net.SocketInputStream.read(SocketInputStream.java:152)
>       at java.net.SocketInputStream.read(SocketInputStream.java:122)
>       at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>       at sun.security.ssl.InputRecord.read(InputRecord.java:480)
>       at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
>       at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
>       at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>       at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>       at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>       at java.io.FilterInputStream.read(FilterInputStream.java:83)
>       at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
>       at 
> org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:498)
>       at 
> org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:415)
>       at 
> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:216)
>       at org.apache.nutch.protocol.http.Http.getResponse(Http.java:70)
>       at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:255)
>       at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:778)
> 2015-12-17 15:04:55,722 INFO  fetcher.Fetcher - fetch of 
> https://itunes.apple.com/us/genre/music-blues/id2?letter=D failed with: 
> java.net.SocketTimeoutException: Read timed out
> 
> Thanks
> 
> 
> 

Reply via email to