Re: SolrCloud never fully recovers after slow disks

Yago Riveiro Mon, 11 Nov 2013 02:19:45 -0800

Hi,   

I have sometimes this exception too, the recovering goes to an state of loop 
and I can only finish the recovering if I restart the replica that has the 
stuck core.


In my case I have ssd but replicas with 40 or 50 gigas. If I have 3 replicas in 
recovery mode and they are replicating from a same node, I have this error.

My rate of indexing is high too (~500 doc/s).

--  
Yago Riveiro
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, November 11, 2013 at 8:27 AM, Henrik Ossipoff Hansen wrote:

> The joy was short-lived.
>  
> Tonight our environment was “down/slow” a bit longer than usual. It looks 
> like two of our nodes never recovered, clusterstate says everything is 
> active. All nodes are throwing this in the log (the nodes they have trouble 
> reaching are the ones that are affected) - the error comes about several 
> cores:
>  
> ERROR - 2013-11-11 09:16:42.735; org.apache.solr.common.SolrException; Error 
> while trying to recover. 
> core=products_se_shard1_replica2:org.apache.solr.client.solrj.SolrServerException:
>  Timeout occured while waiting response from server at: 
> http://solr04.cd-et.com:8080/solr
> at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:431)
> at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
> at 
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
> at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
> at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
> Caused by: java.net.SocketTimeoutException: Read timed out
> at java.net.SocketInputStream.socketRead0(Native Method)
> at java.net.SocketInputStream.read(SocketInputStream.java:150)
> at java.net.SocketInputStream.read(SocketInputStream.java:121)
> at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
> at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
> at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
> at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
> at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
> at 
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
> at 
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
> at 
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
> at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
> at 
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
> at 
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717)
> at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522)
> at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
> at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
> at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
> at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:365)
> ... 4 more
>  
> ERROR - 2013-11-11 09:16:42.736; org.apache.solr.cloud.RecoveryStrategy; 
> Recovery failed - trying again... (30) core=products_se_shard1_replica2
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading
>  
>  
> On 10. nov. 2013 at 21.07.32, Henrik Ossipoff Hansen 
> (h...@entertainment-trading.com<mailto://h...@entertainment-trading.com>) 
> wrote:
>  
> Solr version is 4.5.0.
>  
> I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg 
> and the Zookeeper timeout in solr.xml seemed to somewhat minimize the 
> problem, but it still did occur. I next stopped all larger batch indexing in 
> the period where the issues happened, which also seemed to help somewhat. Now 
> the next thing weirds me out a bit - I switched from using Tomcat7 to using 
> the Jetty that ships with Solr, and that actually seems to have fixed the 
> last issues (together with stopping a few smaller updates - very few).
>  
> During the "slow period" in the night, I get something like this:
>  
> 03:11:49 ERROR ZkController There was a problem finding the leader in 
> zk:org.apache.solr.common.SolrException: Could not get leader props
> 03:06:47 ERROR Overseer Could not create Overseer node
> 03:06:47 WARN LeaderElector
> 03:06:47 WARN ZkStateReader ZooKeeper watch triggered, but Solr cannot talk 
> to ZK
> 03:07:41 WARN RecoveryStrategy Stopping recovery for 
> zkNodeName=solr04.cd-et.com:8080 
> (http://solr04.cd-et.com:8080)_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2
>  
> After this, the cluster state seems to be fine, and I'm not being spammed 
> with errors in the log files.
>  
> Bottom line is that the issues are fixed for now it seems, but I still find 
> it weird that Solr was not able to fully receover.
>  
> // Henrik Ossipoff
>  
> -----Original Message-----
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: 10. november 2013 19:27
> To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org)
> Subject: Re: SolrCloud never fully recovers after slow disks
>  
> Which version of solr are you using? Regardless of your env, this is a fail 
> safe that you should not hit.
>  
> - Mark
>  
> > On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen 
> > <h...@entertainment-trading.com (mailto:h...@entertainment-trading.com)> 
> > wrote:
> >  
> > I previously made a post on this, but have since narrowed down the issue 
> > and am now giving this another try, with another spin to it.
> >  
> > We are running a 4 node setup (over Tomcat7) with a 3-ensemble external 
> > ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM 
> > is using our Storage system (NFS share in VMWare).
> >  
> > Now I do realize and have heard, that NFS is not the greatest way to run 
> > Solr on, but we have never had this issue on non-SolrCloud setups.
> >  
> > Basically, each night when we run our backup jobs, our storage becomes a 
> > bit slow in response - this is obviously something we’re trying to solve, 
> > but bottom line is, that all our other systems somehow stays alive or 
> > recovers gracefully when bandwidth exists again.
> > SolrCloud - not so much. Typically after a session like this, 3-5 nodes 
> > will either go into a Down state or a Recovering state - and stay that way. 
> > Sometimes such node will even be marked as leader. A such node will have 
> > something like this in the log:
> >  
> > ERROR - 2013-11-05 08:57:45.764;
> > org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState 
> > says we are the leader, but locally we don't think so ERROR - 2013-11-05 
> > 08:57:45.768; org.apache.solr.common.SolrException; 
> > org.apache.solr.common.SolrException: ClusterState says we are the leader 
> > (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but 
> > locally we don't think so. Request came from 
> > http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
> > at 
> > org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
> > at 
> > org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
> > at 
> > org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
> > at 
> > org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
> > at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
> > at 
> > org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
> > at 
> > org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
> > at 
> > org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
> > at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
> > at 
> > org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
> > at 
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
> > at 
> > org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
> > at 
> > org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
> > at 
> > org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
> > at 
> > org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
> > at 
> > org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
> > at 
> > org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
> > at 
> > org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
> > at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
> > at 
> > org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
> > at 
> > org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
> > at 
> > org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
> > at 
> > org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
> > at 
> > org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
> > at 
> > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> > at 
> > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> > at java.lang.Thread.run(Thread.java:724)
> >  
> > On the other nodes, an error similar to this will be in the log:
> >  
> > 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
> > http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apa
> > che.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server
> > at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2
> > returned non ok status:503, message:Service Unavailable
> > 09:27:34 -ERROR - SolrCmdDistributor forwarding update to 
> > http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - 
> > retrying ...
> >  
> > Does anyone have any ideas or leads towards a solution - one that doesn’t 
> > involve getting a new storage system (a solution we *are* actively working 
> > on, but that’s not a quick fix in our case). Shouldn’t a setup like this be 
> > possible? And even more so - shouldn’t SolrCloud be able to gracefully 
> > recover after issues like this?
> >  
> > --
> > Henrik Ossipoff Hansen
> > Developer, Entertainment Trading
> >  
>  
>  
>

Re: SolrCloud never fully recovers after slow disks

Reply via email to