Re: SocketTimeoutException

Manish Verma Fri, 18 Dec 2015 10:09:43 -0800

Thanks For Replying Markus,

If this is from server side then how it get crawled when I crawl these 
seprately ?


Thanks
Manish Verma
AML Search
+1 669 224 9924

> On Dec 18, 2015, at 4:18 AM, Markus Jelsma <[email protected]> wrote:
> 
> Hello Manish - connection time outs can also occur at the server side, on 
> which Nutch has no control. It is also possible that you have been firewalled 
> - if the server's iptables decides to drop you packets, a connection time out 
> will occur.
> 
> Markus
> 
> 
> 
> -----Original message-----
>> From:Manish Verma <[email protected]>
>> Sent: Friday 18th December 2015 0:14
>> To: [email protected]
>> Subject: SocketTimeoutException
>> 
>> Hi,
>> 
>> I am using nutch 1.10 and crawling using crawl script. Occasionally I get 
>> SocketTimeoutException.
>> Here are few properties I have overridden using notch-site.xml and exception 
>> stack trace. I set http.timeout to 30000 , still getting same. When crawling 
>> same url separately it get crawled.
>> 
>> <property>
>>  <name>http.timeout</name>
>>  <value>30000</value>
>>  <description>The default network timeout, in milliseconds.</description>
>> </property>
>> <property>
>>  <name>fetcher.server.delay</name>
>>  <value>.250</value>
>>  <description>The number of seconds the fetcher will delay between 
>>   successive requests to the same server. Note that this might get
>>   overriden by a Crawl-Delay from a robots.txt and is used ONLY if 
>>   fetcher.threads.per.queue is set to 1.
>>   </description>
>> </property>
>> <property>
>>  <name>fetcher.threads.fetch</name>
>>  <value>100</value>
>>  <description>The number of FetcherThreads the fetcher should use.
>>  This is also determines the maximum number of requests that are
>>  made at once (each FetcherThread handles one connection). The total
>>  number of threads running in distributed mode will be the number of
>>  fetcher threads * number of nodes as fetcher has one map task per node.
>>  </description>
>> </property>
>> <property>
>>  <name>fetcher.threads.per.queue</name>
>>  <value>25</value>
>>  <description>This number is the maximum number of threads that
>>    should be allowed to access a queue at one time. Setting it to 
>>    a value > 1 will cause the Crawl-Delay value from robots.txt to
>>    be ignored and the value of fetcher.server.min.delay to be used
>>    as a delay between successive requests to the same server instead 
>>    of fetcher.server.delay.
>>   </description>
>> </property>
>> <property>
>>  <name>fetcher.server.min.delay</name>
>>  <value>1</value>
>>  <description>The minimum number of seconds the fetcher will delay between 
>>  successive requests to the same server. This value is applicable ONLY
>>  if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking
>>  is turned off).</description>
>> </property>
>> <property>
>>  <name>parser.timeout</name>
>>  <value>-1</value>
>>  <description>Timeout in seconds for the parsing of a document, otherwise 
>> treats it as an exception and 
>>  moves on the the following documents. This parameter is applied to any 
>> Parser implementation. 
>>  Set to -1 to deactivate, bearing in mind that this could cause
>>  the parsing to crash because of a very long or corrupted document.
>>  </description>
>> </property>
>> <property>
>>  <name>fetcher.queue.mode</name>
>>  <value>byHost</value>
>>  <description>Determines how to put URLs into queues. Default value is 
>> 'byHost', 
>>  also takes 'byDomain' or 'byIP'. 
>>  </description>
>> </property>
>> <property>
>>  <name>http.redirect.max</name>
>>  <value>2</value>
>>  <description>The maximum number of redirects the fetcher will follow when
>>  trying to fetch a page. If set to negative or 0, fetcher won't immediately
>>  follow redirected URLs, instead it will record them for later fetching.
>>  </description>
>> </property>
>> <property>
>>  <name>fetcher.queue.mode</name>
>>  <value>byHost</value>
>>  <description>Determines how to put URLs into queues. Default value is 
>> 'byHost', 
>>  also takes 'byDomain' or 'byIP'. 
>>  </description>
>> </property>
>> </configuration>
>> 
>> 2015-12-17 15:04:55,721 ERROR http.Http - Failed to get protocol output
>> java.net.SocketTimeoutException: Read timed out
>>      at java.net.SocketInputStream.socketRead0(Native Method)
>>      at java.net.SocketInputStream.read(SocketInputStream.java:152)
>>      at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>      at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>>      at sun.security.ssl.InputRecord.read(InputRecord.java:480)
>>      at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
>>      at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891)
>>      at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>>      at java.io.BufferedInputStream.fill(BufferedInputStream.java:235)
>>      at java.io.BufferedInputStream.read(BufferedInputStream.java:254)
>>      at java.io.FilterInputStream.read(FilterInputStream.java:83)
>>      at java.io.PushbackInputStream.read(PushbackInputStream.java:139)
>>      at 
>> org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:498)
>>      at 
>> org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:415)
>>      at 
>> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:216)
>>      at org.apache.nutch.protocol.http.Http.getResponse(Http.java:70)
>>      at 
>> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:255)
>>      at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:778)
>> 2015-12-17 15:04:55,722 INFO  fetcher.Fetcher - fetch of 
>> https://itunes.apple.com/us/genre/music-blues/id2?letter=D failed with: 
>> java.net.SocketTimeoutException: Read timed out
>> 
>> Thanks
>> 
>> 
>>

Re: SocketTimeoutException

Reply via email to