Thanks For Replying Markus, If this is from server side then how it get crawled when I crawl these seprately ?
Thanks Manish Verma AML Search +1 669 224 9924 > On Dec 18, 2015, at 4:18 AM, Markus Jelsma <[email protected]> wrote: > > Hello Manish - connection time outs can also occur at the server side, on > which Nutch has no control. It is also possible that you have been firewalled > - if the server's iptables decides to drop you packets, a connection time out > will occur. > > Markus > > > > -----Original message----- >> From:Manish Verma <[email protected]> >> Sent: Friday 18th December 2015 0:14 >> To: [email protected] >> Subject: SocketTimeoutException >> >> Hi, >> >> I am using nutch 1.10 and crawling using crawl script. Occasionally I get >> SocketTimeoutException. >> Here are few properties I have overridden using notch-site.xml and exception >> stack trace. I set http.timeout to 30000 , still getting same. When crawling >> same url separately it get crawled. >> >> <property> >> <name>http.timeout</name> >> <value>30000</value> >> <description>The default network timeout, in milliseconds.</description> >> </property> >> <property> >> <name>fetcher.server.delay</name> >> <value>.250</value> >> <description>The number of seconds the fetcher will delay between >> successive requests to the same server. Note that this might get >> overriden by a Crawl-Delay from a robots.txt and is used ONLY if >> fetcher.threads.per.queue is set to 1. >> </description> >> </property> >> <property> >> <name>fetcher.threads.fetch</name> >> <value>100</value> >> <description>The number of FetcherThreads the fetcher should use. >> This is also determines the maximum number of requests that are >> made at once (each FetcherThread handles one connection). The total >> number of threads running in distributed mode will be the number of >> fetcher threads * number of nodes as fetcher has one map task per node. >> </description> >> </property> >> <property> >> <name>fetcher.threads.per.queue</name> >> <value>25</value> >> <description>This number is the maximum number of threads that >> should be allowed to access a queue at one time. Setting it to >> a value > 1 will cause the Crawl-Delay value from robots.txt to >> be ignored and the value of fetcher.server.min.delay to be used >> as a delay between successive requests to the same server instead >> of fetcher.server.delay. >> </description> >> </property> >> <property> >> <name>fetcher.server.min.delay</name> >> <value>1</value> >> <description>The minimum number of seconds the fetcher will delay between >> successive requests to the same server. This value is applicable ONLY >> if fetcher.threads.per.queue is greater than 1 (i.e. the host blocking >> is turned off).</description> >> </property> >> <property> >> <name>parser.timeout</name> >> <value>-1</value> >> <description>Timeout in seconds for the parsing of a document, otherwise >> treats it as an exception and >> moves on the the following documents. This parameter is applied to any >> Parser implementation. >> Set to -1 to deactivate, bearing in mind that this could cause >> the parsing to crash because of a very long or corrupted document. >> </description> >> </property> >> <property> >> <name>fetcher.queue.mode</name> >> <value>byHost</value> >> <description>Determines how to put URLs into queues. Default value is >> 'byHost', >> also takes 'byDomain' or 'byIP'. >> </description> >> </property> >> <property> >> <name>http.redirect.max</name> >> <value>2</value> >> <description>The maximum number of redirects the fetcher will follow when >> trying to fetch a page. If set to negative or 0, fetcher won't immediately >> follow redirected URLs, instead it will record them for later fetching. >> </description> >> </property> >> <property> >> <name>fetcher.queue.mode</name> >> <value>byHost</value> >> <description>Determines how to put URLs into queues. Default value is >> 'byHost', >> also takes 'byDomain' or 'byIP'. >> </description> >> </property> >> </configuration> >> >> 2015-12-17 15:04:55,721 ERROR http.Http - Failed to get protocol output >> java.net.SocketTimeoutException: Read timed out >> at java.net.SocketInputStream.socketRead0(Native Method) >> at java.net.SocketInputStream.read(SocketInputStream.java:152) >> at java.net.SocketInputStream.read(SocketInputStream.java:122) >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442) >> at sun.security.ssl.InputRecord.read(InputRecord.java:480) >> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934) >> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:891) >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102) >> at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) >> at java.io.BufferedInputStream.read(BufferedInputStream.java:254) >> at java.io.FilterInputStream.read(FilterInputStream.java:83) >> at java.io.PushbackInputStream.read(PushbackInputStream.java:139) >> at >> org.apache.nutch.protocol.http.HttpResponse.readLine(HttpResponse.java:498) >> at >> org.apache.nutch.protocol.http.HttpResponse.parseStatusLine(HttpResponse.java:415) >> at >> org.apache.nutch.protocol.http.HttpResponse.<init>(HttpResponse.java:216) >> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:70) >> at >> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:255) >> at org.apache.nutch.fetcher.Fetcher$FetcherThread.run(Fetcher.java:778) >> 2015-12-17 15:04:55,722 INFO fetcher.Fetcher - fetch of >> https://itunes.apple.com/us/genre/music-blues/id2?letter=D failed with: >> java.net.SocketTimeoutException: Read timed out >> >> Thanks >> >> >>

