The socket read timeouts are actually fairly short for recovery - we should probably bump them up. Can you file a JIRA issue? It may be a symptom rather than a cause, but given a slow env, bumping them up makes sense.
- Mark > On Nov 11, 2013, at 8:27 AM, Henrik Ossipoff Hansen > <h...@entertainment-trading.com> wrote: > > The joy was short-lived. > > Tonight our environment was “down/slow” a bit longer than usual. It looks > like two of our nodes never recovered, clusterstate says everything is > active. All nodes are throwing this in the log (the nodes they have trouble > reaching are the ones that are affected) - the error comes about several > cores: > > ERROR - 2013-11-11 09:16:42.735; org.apache.solr.common.SolrException; Error > while trying to recover. > core=products_se_shard1_replica2:org.apache.solr.client.solrj.SolrServerException: > Timeout occured while waiting response from server at: > http://solr04.cd-et.com:8080/solr > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:431) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) > at > org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342) > at > org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) > Caused by: java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:150) > at java.net.SocketInputStream.read(SocketInputStream.java:121) > at > org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166) > at > org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90) > at > org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281) > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92) > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62) > at > org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) > at > org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) > at > org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) > at > org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) > at > org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) > at > org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:365) > ... 4 more > > ERROR - 2013-11-11 09:16:42.736; org.apache.solr.cloud.RecoveryStrategy; > Recovery failed - trying again... (30) core=products_se_shard1_replica2 > -- > Henrik Ossipoff Hansen > Developer, Entertainment Trading > > > On 10. nov. 2013 at 21.07.32, Henrik Ossipoff Hansen > (h...@entertainment-trading.com<mailto://h...@entertainment-trading.com>) > wrote: > > Solr version is 4.5.0. > > I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg > and the Zookeeper timeout in solr.xml seemed to somewhat minimize the > problem, but it still did occur. I next stopped all larger batch indexing in > the period where the issues happened, which also seemed to help somewhat. Now > the next thing weirds me out a bit - I switched from using Tomcat7 to using > the Jetty that ships with Solr, and that actually seems to have fixed the > last issues (together with stopping a few smaller updates - very few). > > During the "slow period" in the night, I get something like this: > > 03:11:49 ERROR ZkController There was a problem finding the leader in > zk:org.apache.solr.common.SolrException: Could not get leader props > 03:06:47 ERROR Overseer Could not create Overseer node > 03:06:47 WARN LeaderElector > 03:06:47 WARN ZkStateReader ZooKeeper watch triggered, but Solr cannot talk > to ZK > 03:07:41 WARN RecoveryStrategy Stopping recovery for > zkNodeName=solr04.cd-et.com:8080_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2 > > After this, the cluster state seems to be fine, and I'm not being spammed > with errors in the log files. > > Bottom line is that the issues are fixed for now it seems, but I still find > it weird that Solr was not able to fully receover. > > // Henrik Ossipoff > > -----Original Message----- > From: Mark Miller [mailto:markrmil...@gmail.com] > Sent: 10. november 2013 19:27 > To: solr-user@lucene.apache.org > Subject: Re: SolrCloud never fully recovers after slow disks > > Which version of solr are you using? Regardless of your env, this is a fail > safe that you should not hit. > > - Mark > >> On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen >> <h...@entertainment-trading.com> wrote: >> >> I previously made a post on this, but have since narrowed down the issue and >> am now giving this another try, with another spin to it. >> >> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external >> ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM >> is using our Storage system (NFS share in VMWare). >> >> Now I do realize and have heard, that NFS is not the greatest way to run >> Solr on, but we have never had this issue on non-SolrCloud setups. >> >> Basically, each night when we run our backup jobs, our storage becomes a bit >> slow in response - this is obviously something we’re trying to solve, but >> bottom line is, that all our other systems somehow stays alive or recovers >> gracefully when bandwidth exists again. >> SolrCloud - not so much. Typically after a session like this, 3-5 nodes will >> either go into a Down state or a Recovering state - and stay that way. >> Sometimes such node will even be marked as leader. A such node will have >> something like this in the log: >> >> ERROR - 2013-11-05 08:57:45.764; >> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState >> says we are the leader, but locally we don't think so ERROR - 2013-11-05 >> 08:57:45.768; org.apache.solr.common.SolrException; >> org.apache.solr.common.SolrException: ClusterState says we are the leader >> (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally >> we don't think so. Request came from >> http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/ >> at >> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381) >> at >> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243) >> at >> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428) >> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247) >> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) >> at >> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) >> at >> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) >> at >> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) >> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) >> at >> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) >> at >> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) >> at >> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) >> at >> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) >> at >> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) >> at >> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) >> at >> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) >> at >> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) >> at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) >> at >> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) >> at >> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) >> at >> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) >> at >> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) >> at >> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307) >> at >> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) >> at >> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >> at java.lang.Thread.run(Thread.java:724) >> >> On the other nodes, an error similar to this will be in the log: >> >> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: >> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apa >> che.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server >> at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2 >> returned non ok status:503, message:Service Unavailable >> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to >> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - >> retrying ... >> >> Does anyone have any ideas or leads towards a solution - one that doesn’t >> involve getting a new storage system (a solution we *are* actively working >> on, but that’s not a quick fix in our case). Shouldn’t a setup like this be >> possible? And even more so - shouldn’t SolrCloud be able to gracefully >> recover after issues like this? >> >> -- >> Henrik Ossipoff Hansen >> Developer, Entertainment Trading