Re: SolrCloud never fully recovers after slow disks

Mark Miller Mon, 11 Nov 2013 07:02:00 -0800

The socket read timeouts are actually fairly short for recovery - we should 
probably bump them up. Can you file a JIRA issue? It may be a symptom rather 
than a cause, but given a slow env, bumping them up makes sense.


- Mark

> On Nov 11, 2013, at 8:27 AM, Henrik Ossipoff Hansen 
> <h...@entertainment-trading.com> wrote:
> 
> The joy was short-lived.
> 
> Tonight our environment was “down/slow” a bit longer than usual. It looks 
> like two of our nodes never recovered, clusterstate says everything is 
> active. All nodes are throwing this in the log (the nodes they have trouble 
> reaching are the ones that are affected) - the error comes about several 
> cores:
> 
> ERROR - 2013-11-11 09:16:42.735; org.apache.solr.common.SolrException; Error 
> while trying to recover. 
> core=products_se_shard1_replica2:org.apache.solr.client.solrj.SolrServerException:
>  Timeout occured while waiting response from server at: 
> http://solr04.cd-et.com:8080/solr
>        at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:431)
>        at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
>        at 
> org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
>        at 
> org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
>        at 
> org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
> Caused by: java.net.SocketTimeoutException: Read timed out
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.read(SocketInputStream.java:150)
>        at java.net.SocketInputStream.read(SocketInputStream.java:121)
>        at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
>        at 
> org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
>        at 
> org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
>        at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
>        at 
> org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
>        at 
> org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
>        at 
> org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
>        at 
> org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
>        at 
> org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
>        at 
> org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
>        at 
> org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
>        at 
> org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717)
>        at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522)
>        at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
>        at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
>        at 
> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
>        at 
> org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:365)
>        ... 4 more
> 
> ERROR - 2013-11-11 09:16:42.736; org.apache.solr.cloud.RecoveryStrategy; 
> Recovery failed - trying again... (30) core=products_se_shard1_replica2
> --
> Henrik Ossipoff Hansen
> Developer, Entertainment Trading
> 
> 
> On 10. nov. 2013 at 21.07.32, Henrik Ossipoff Hansen 
> (h...@entertainment-trading.com<mailto://h...@entertainment-trading.com>) 
> wrote:
> 
> Solr version is 4.5.0.
> 
> I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg 
> and the Zookeeper timeout in solr.xml seemed to somewhat minimize the 
> problem, but it still did occur. I next stopped all larger batch indexing in 
> the period where the issues happened, which also seemed to help somewhat. Now 
> the next thing weirds me out a bit - I switched from using Tomcat7 to using 
> the Jetty that ships with Solr, and that actually seems to have fixed the 
> last issues (together with stopping a few smaller updates - very few).
> 
> During the "slow period" in the night, I get something like this:
> 
> 03:11:49 ERROR ZkController There was a problem finding the leader in 
> zk:org.apache.solr.common.SolrException: Could not get leader props
> 03:06:47 ERROR Overseer Could not create Overseer node
> 03:06:47 WARN LeaderElector
> 03:06:47 WARN ZkStateReader ZooKeeper watch triggered, but Solr cannot talk 
> to ZK
> 03:07:41 WARN RecoveryStrategy Stopping recovery for 
> zkNodeName=solr04.cd-et.com:8080_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2
> 
> After this, the cluster state seems to be fine, and I'm not being spammed 
> with errors in the log files.
> 
> Bottom line is that the issues are fixed for now it seems, but I still find 
> it weird that Solr was not able to fully receover.
> 
> // Henrik Ossipoff
> 
> -----Original Message-----
> From: Mark Miller [mailto:markrmil...@gmail.com]
> Sent: 10. november 2013 19:27
> To: solr-user@lucene.apache.org
> Subject: Re: SolrCloud never fully recovers after slow disks
> 
> Which version of solr are you using? Regardless of your env, this is a fail 
> safe that you should not hit.
> 
> - Mark
> 
>> On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen 
>> <h...@entertainment-trading.com> wrote:
>> 
>> I previously made a post on this, but have since narrowed down the issue and 
>> am now giving this another try, with another spin to it.
>> 
>> We are running a 4 node setup (over Tomcat7) with a 3-ensemble external 
>> ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM 
>> is using our Storage system (NFS share in VMWare).
>> 
>> Now I do realize and have heard, that NFS is not the greatest way to run 
>> Solr on, but we have never had this issue on non-SolrCloud setups.
>> 
>> Basically, each night when we run our backup jobs, our storage becomes a bit 
>> slow in response - this is obviously something we’re trying to solve, but 
>> bottom line is, that all our other systems somehow stays alive or recovers 
>> gracefully when bandwidth exists again.
>> SolrCloud - not so much. Typically after a session like this, 3-5 nodes will 
>> either go into a Down state or a Recovering state - and stay that way. 
>> Sometimes such node will even be marked as leader. A such node will have 
>> something like this in the log:
>> 
>> ERROR - 2013-11-05 08:57:45.764;
>> org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState 
>> says we are the leader, but locally we don't think so ERROR - 2013-11-05 
>> 08:57:45.768; org.apache.solr.common.SolrException; 
>> org.apache.solr.common.SolrException: ClusterState says we are the leader 
>> (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally 
>> we don't think so. Request came from 
>> http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
>> at 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
>> at 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
>> at 
>> org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
>> at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
>> at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
>> at 
>> org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
>> at 
>> org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
>> at 
>> org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
>> at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
>> at 
>> org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
>> at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
>> at 
>> org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
>> at 
>> org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
>> at 
>> org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
>> at 
>> org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
>> at 
>> org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
>> at 
>> org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
>> at 
>> org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
>> at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
>> at 
>> org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
>> at 
>> org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
>> at 
>> org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
>> at 
>> org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
>> at 
>> org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
>> at 
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>> at 
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> at java.lang.Thread.run(Thread.java:724)
>> 
>> On the other nodes, an error similar to this will be in the log:
>> 
>> 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
>> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apa
>> che.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server
>> at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2
>> returned non ok status:503, message:Service Unavailable
>> 09:27:34 -ERROR - SolrCmdDistributor forwarding update to 
>> http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - 
>> retrying ...
>> 
>> Does anyone have any ideas or leads towards a solution - one that doesn’t 
>> involve getting a new storage system (a solution we *are* actively working 
>> on, but that’s not a quick fix in our case). Shouldn’t a setup like this be 
>> possible? And even more so - shouldn’t SolrCloud be able to gracefully 
>> recover after issues like this?
>> 
>> --
>> Henrik Ossipoff Hansen
>> Developer, Entertainment Trading

Re: SolrCloud never fully recovers after slow disks

Reply via email to