Hi, I have sometimes this exception too, the recovering goes to an state of loop and I can only finish the recovering if I restart the replica that has the stuck core.
In my case I have ssd but replicas with 40 or 50 gigas. If I have 3 replicas in recovery mode and they are replicating from a same node, I have this error. My rate of indexing is high too (~500 doc/s). -- Yago Riveiro Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Monday, November 11, 2013 at 8:27 AM, Henrik Ossipoff Hansen wrote: > The joy was short-lived. > > Tonight our environment was “down/slow” a bit longer than usual. It looks > like two of our nodes never recovered, clusterstate says everything is > active. All nodes are throwing this in the log (the nodes they have trouble > reaching are the ones that are affected) - the error comes about several > cores: > > ERROR - 2013-11-11 09:16:42.735; org.apache.solr.common.SolrException; Error > while trying to recover. > core=products_se_shard1_replica2:org.apache.solr.client.solrj.SolrServerException: > Timeout occured while waiting response from server at: > http://solr04.cd-et.com:8080/solr > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:431) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180) > at > org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198) > at > org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342) > at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219) > Caused by: java.net.SocketTimeoutException: Read timed out > at java.net.SocketInputStream.socketRead0(Native Method) > at java.net.SocketInputStream.read(SocketInputStream.java:150) > at java.net.SocketInputStream.read(SocketInputStream.java:121) > at > org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166) > at > org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90) > at > org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281) > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92) > at > org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62) > at > org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254) > at > org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289) > at > org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252) > at > org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191) > at > org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300) > at > org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127) > at > org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717) > at > org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805) > at > org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784) > at > org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:365) > ... 4 more > > ERROR - 2013-11-11 09:16:42.736; org.apache.solr.cloud.RecoveryStrategy; > Recovery failed - trying again... (30) core=products_se_shard1_replica2 > -- > Henrik Ossipoff Hansen > Developer, Entertainment Trading > > > On 10. nov. 2013 at 21.07.32, Henrik Ossipoff Hansen > (h...@entertainment-trading.com<mailto://h...@entertainment-trading.com>) > wrote: > > Solr version is 4.5.0. > > I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg > and the Zookeeper timeout in solr.xml seemed to somewhat minimize the > problem, but it still did occur. I next stopped all larger batch indexing in > the period where the issues happened, which also seemed to help somewhat. Now > the next thing weirds me out a bit - I switched from using Tomcat7 to using > the Jetty that ships with Solr, and that actually seems to have fixed the > last issues (together with stopping a few smaller updates - very few). > > During the "slow period" in the night, I get something like this: > > 03:11:49 ERROR ZkController There was a problem finding the leader in > zk:org.apache.solr.common.SolrException: Could not get leader props > 03:06:47 ERROR Overseer Could not create Overseer node > 03:06:47 WARN LeaderElector > 03:06:47 WARN ZkStateReader ZooKeeper watch triggered, but Solr cannot talk > to ZK > 03:07:41 WARN RecoveryStrategy Stopping recovery for > zkNodeName=solr04.cd-et.com:8080 > (http://solr04.cd-et.com:8080)_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2 > > After this, the cluster state seems to be fine, and I'm not being spammed > with errors in the log files. > > Bottom line is that the issues are fixed for now it seems, but I still find > it weird that Solr was not able to fully receover. > > // Henrik Ossipoff > > -----Original Message----- > From: Mark Miller [mailto:markrmil...@gmail.com] > Sent: 10. november 2013 19:27 > To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org) > Subject: Re: SolrCloud never fully recovers after slow disks > > Which version of solr are you using? Regardless of your env, this is a fail > safe that you should not hit. > > - Mark > > > On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen > > <h...@entertainment-trading.com (mailto:h...@entertainment-trading.com)> > > wrote: > > > > I previously made a post on this, but have since narrowed down the issue > > and am now giving this another try, with another spin to it. > > > > We are running a 4 node setup (over Tomcat7) with a 3-ensemble external > > ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM > > is using our Storage system (NFS share in VMWare). > > > > Now I do realize and have heard, that NFS is not the greatest way to run > > Solr on, but we have never had this issue on non-SolrCloud setups. > > > > Basically, each night when we run our backup jobs, our storage becomes a > > bit slow in response - this is obviously something we’re trying to solve, > > but bottom line is, that all our other systems somehow stays alive or > > recovers gracefully when bandwidth exists again. > > SolrCloud - not so much. Typically after a session like this, 3-5 nodes > > will either go into a Down state or a Recovering state - and stay that way. > > Sometimes such node will even be marked as leader. A such node will have > > something like this in the log: > > > > ERROR - 2013-11-05 08:57:45.764; > > org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState > > says we are the leader, but locally we don't think so ERROR - 2013-11-05 > > 08:57:45.768; org.apache.solr.common.SolrException; > > org.apache.solr.common.SolrException: ClusterState says we are the leader > > (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but > > locally we don't think so. Request came from > > http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/ > > at > > org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381) > > at > > org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243) > > at > > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428) > > at > > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247) > > at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174) > > at > > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92) > > at > > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74) > > at > > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135) > > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859) > > at > > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703) > > at > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406) > > at > > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195) > > at > > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243) > > at > > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210) > > at > > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224) > > at > > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169) > > at > > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168) > > at > > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98) > > at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927) > > at > > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118) > > at > > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407) > > at > > org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987) > > at > > org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579) > > at > > org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307) > > at > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > > at > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > > at java.lang.Thread.run(Thread.java:724) > > > > On the other nodes, an error similar to this will be in the log: > > > > 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: > > http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apa > > che.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server > > at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2 > > returned non ok status:503, message:Service Unavailable > > 09:27:34 -ERROR - SolrCmdDistributor forwarding update to > > http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - > > retrying ... > > > > Does anyone have any ideas or leads towards a solution - one that doesn’t > > involve getting a new storage system (a solution we *are* actively working > > on, but that’s not a quick fix in our case). Shouldn’t a setup like this be > > possible? And even more so - shouldn’t SolrCloud be able to gracefully > > recover after issues like this? > > > > -- > > Henrik Ossipoff Hansen > > Developer, Entertainment Trading > > > > >