RE: SolrCloud never fully recovers after slow disks

2013-11-11 Thread Henrik Ossipoff Hansen
The joy was short-lived.

Tonight our environment was “down/slow” a bit longer than usual. It looks like 
two of our nodes never recovered, clusterstate says everything is active. All 
nodes are throwing this in the log (the nodes they have trouble reaching are 
the ones that are affected) - the error comes about several cores:

ERROR - 2013-11-11 09:16:42.735; org.apache.solr.common.SolrException; Error 
while trying to recover. 
core=products_se_shard1_replica2:org.apache.solr.client.solrj.SolrServerException:
 Timeout occured while waiting response from server at: 
http://solr04.cd-et.com:8080/solr
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:431)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:180)
at 
org.apache.solr.cloud.RecoveryStrategy.sendPrepRecoveryCmd(RecoveryStrategy.java:198)
at 
org.apache.solr.cloud.RecoveryStrategy.doRecovery(RecoveryStrategy.java:342)
at org.apache.solr.cloud.RecoveryStrategy.run(RecoveryStrategy.java:219)
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:150)
at java.net.SocketInputStream.read(SocketInputStream.java:121)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.fillBuffer(AbstractSessionInputBuffer.java:166)
at 
org.apache.http.impl.io.SocketInputBuffer.fillBuffer(SocketInputBuffer.java:90)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.readLine(AbstractSessionInputBuffer.java:281)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:92)
at 
org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:62)
at 
org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:254)
at 
org.apache.http.impl.AbstractHttpClientConnection.receiveResponseHeader(AbstractHttpClientConnection.java:289)
at 
org.apache.http.impl.conn.DefaultClientConnection.receiveResponseHeader(DefaultClientConnection.java:252)
at 
org.apache.http.impl.conn.ManagedClientConnectionImpl.receiveResponseHeader(ManagedClientConnectionImpl.java:191)
at 
org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:300)
at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:127)
at 
org.apache.http.impl.client.DefaultRequestDirector.tryExecute(DefaultRequestDirector.java:717)
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:522)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:906)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:805)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:784)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:365)
... 4 more

ERROR - 2013-11-11 09:16:42.736; org.apache.solr.cloud.RecoveryStrategy; 
Recovery failed - trying again... (30) core=products_se_shard1_replica2
--
Henrik Ossipoff Hansen
Developer, Entertainment Trading


On 10. nov. 2013 at 21.07.32, Henrik Ossipoff Hansen 
(h...@entertainment-trading.commailto://h...@entertainment-trading.com) wrote:

Solr version is 4.5.0.

I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg and 
the Zookeeper timeout in solr.xml seemed to somewhat minimize the problem, but 
it still did occur. I next stopped all larger batch indexing in the period 
where the issues happened, which also seemed to help somewhat. Now the next 
thing weirds me out a bit - I switched from using Tomcat7 to using the Jetty 
that ships with Solr, and that actually seems to have fixed the last issues 
(together with stopping a few smaller updates - very few).

During the slow period in the night, I get something like this:

03:11:49 ERROR ZkController There was a problem finding the leader in 
zk:org.apache.solr.common.SolrException: Could not get leader props
03:06:47 ERROR Overseer Could not create Overseer node
03:06:47 WARN LeaderElector
03:06:47 WARN ZkStateReader ZooKeeper watch triggered,​ but Solr cannot talk to 
ZK
03:07:41 WARN RecoveryStrategy Stopping recovery for 
zkNodeName=solr04.cd-et.com:8080_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2

After this, the cluster state seems to be fine, and I'm not being spammed with 
errors in the log files.

Bottom line is that the issues are fixed for now it seems, but I still find it 
weird that Solr was not able to fully receover.

// Henrik Ossipoff

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com]
Sent: 10. november 2013 19:27
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud never fully recovers after

Re: SolrCloud never fully recovers after slow disks

2013-11-11 Thread Yago Riveiro
=auto_suggest_shard1_replica2
  
 After this, the cluster state seems to be fine, and I'm not being spammed 
 with errors in the log files.
  
 Bottom line is that the issues are fixed for now it seems, but I still find 
 it weird that Solr was not able to fully receover.
  
 // Henrik Ossipoff
  
 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: 10. november 2013 19:27
 To: solr-user@lucene.apache.org (mailto:solr-user@lucene.apache.org)
 Subject: Re: SolrCloud never fully recovers after slow disks
  
 Which version of solr are you using? Regardless of your env, this is a fail 
 safe that you should not hit.
  
 - Mark
  
  On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen 
  h...@entertainment-trading.com (mailto:h...@entertainment-trading.com) 
  wrote:
   
  I previously made a post on this, but have since narrowed down the issue 
  and am now giving this another try, with another spin to it.
   
  We are running a 4 node setup (over Tomcat7) with a 3-ensemble external 
  ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM 
  is using our Storage system (NFS share in VMWare).
   
  Now I do realize and have heard, that NFS is not the greatest way to run 
  Solr on, but we have never had this issue on non-SolrCloud setups.
   
  Basically, each night when we run our backup jobs, our storage becomes a 
  bit slow in response - this is obviously something we’re trying to solve, 
  but bottom line is, that all our other systems somehow stays alive or 
  recovers gracefully when bandwidth exists again.
  SolrCloud - not so much. Typically after a session like this, 3-5 nodes 
  will either go into a Down state or a Recovering state - and stay that way. 
  Sometimes such node will even be marked as leader. A such node will have 
  something like this in the log:
   
  ERROR - 2013-11-05 08:57:45.764;
  org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState 
  says we are the leader, but locally we don't think so ERROR - 2013-11-05 
  08:57:45.768; org.apache.solr.common.SolrException; 
  org.apache.solr.common.SolrException: ClusterState says we are the leader 
  (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but 
  locally we don't think so. Request came from 
  http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
  at 
  org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
  at 
  org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
  at 
  org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
  at 
  org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
  at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
  at 
  org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
  at 
  org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
  at 
  org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
  at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
  at 
  org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
  at 
  org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
  at 
  org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
  at 
  org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
  at 
  org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
  at 
  org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
  at 
  org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
  at 
  org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
  at 
  org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
  at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
  at 
  org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
  at 
  org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
  at 
  org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
  at 
  org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
  at 
  org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
  at 
  java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
  at 
  java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
  at java.lang.Thread.run(Thread.java:724)
   
  On the other nodes, an error similar to this will be in the log:
   
  09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
  http://solr04.cd-et.com:8080/solr

Re: SolrCloud never fully recovers after slow disks

2013-11-11 Thread Mark Miller
 to be fine, and I'm not being spammed 
 with errors in the log files.
 
 Bottom line is that the issues are fixed for now it seems, but I still find 
 it weird that Solr was not able to fully receover.
 
 // Henrik Ossipoff
 
 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: 10. november 2013 19:27
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud never fully recovers after slow disks
 
 Which version of solr are you using? Regardless of your env, this is a fail 
 safe that you should not hit.
 
 - Mark
 
 On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen 
 h...@entertainment-trading.com wrote:
 
 I previously made a post on this, but have since narrowed down the issue and 
 am now giving this another try, with another spin to it.
 
 We are running a 4 node setup (over Tomcat7) with a 3-ensemble external 
 ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM 
 is using our Storage system (NFS share in VMWare).
 
 Now I do realize and have heard, that NFS is not the greatest way to run 
 Solr on, but we have never had this issue on non-SolrCloud setups.
 
 Basically, each night when we run our backup jobs, our storage becomes a bit 
 slow in response - this is obviously something we’re trying to solve, but 
 bottom line is, that all our other systems somehow stays alive or recovers 
 gracefully when bandwidth exists again.
 SolrCloud - not so much. Typically after a session like this, 3-5 nodes will 
 either go into a Down state or a Recovering state - and stay that way. 
 Sometimes such node will even be marked as leader. A such node will have 
 something like this in the log:
 
 ERROR - 2013-11-05 08:57:45.764;
 org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState 
 says we are the leader, but locally we don't think so ERROR - 2013-11-05 
 08:57:45.768; org.apache.solr.common.SolrException; 
 org.apache.solr.common.SolrException: ClusterState says we are the leader 
 (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally 
 we don't think so. Request came from 
 http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
 at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
 at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
 at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
 at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
 at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
 at 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
 at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at 
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at 
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
 at 
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)
 
 On the other nodes, an error similar to this will be in the log:
 
 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
 http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apa
 che.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: Server
 at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2
 returned non ok status:503,​ message:Service Unavailable
 09:27:34 -ERROR

Re: SolrCloud never fully recovers after slow disks

2013-11-11 Thread Henrik Ossipoff Hansen
 Could not create Overseer node
 03:06:47 WARN LeaderElector
 03:06:47 WARN ZkStateReader ZooKeeper watch triggered,​ but Solr cannot talk 
 to ZK
 03:07:41 WARN RecoveryStrategy Stopping recovery for 
 zkNodeName=solr04.cd-et.com:8080_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2

 After this, the cluster state seems to be fine, and I'm not being spammed 
 with errors in the log files.

 Bottom line is that the issues are fixed for now it seems, but I still find 
 it weird that Solr was not able to fully receover.

 // Henrik Ossipoff

 -Original Message-
 From: Mark Miller [mailto:markrmil...@gmail.com]
 Sent: 10. november 2013 19:27
 To: solr-user@lucene.apache.org
 Subject: Re: SolrCloud never fully recovers after slow disks

 Which version of solr are you using? Regardless of your env, this is a fail 
 safe that you should not hit.

 - Mark

 On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen 
 h...@entertainment-trading.com wrote:

 I previously made a post on this, but have since narrowed down the issue and 
 am now giving this another try, with another spin to it.

 We are running a 4 node setup (over Tomcat7) with a 3-ensemble external 
 ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM 
 is using our Storage system (NFS share in VMWare).

 Now I do realize and have heard, that NFS is not the greatest way to run 
 Solr on, but we have never had this issue on non-SolrCloud setups.

 Basically, each night when we run our backup jobs, our storage becomes a bit 
 slow in response - this is obviously something we’re trying to solve, but 
 bottom line is, that all our other systems somehow stays alive or recovers 
 gracefully when bandwidth exists again.
 SolrCloud - not so much. Typically after a session like this, 3-5 nodes will 
 either go into a Down state or a Recovering state - and stay that way. 
 Sometimes such node will even be marked as leader. A such node will have 
 something like this in the log:

 ERROR - 2013-11-05 08:57:45.764;
 org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState 
 says we are the leader, but locally we don't think so ERROR - 2013-11-05 
 08:57:45.768; org.apache.solr.common.SolrException; 
 org.apache.solr.common.SolrException: ClusterState says we are the leader 
 (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally 
 we don't think so. Request came from 
 http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
 at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
 at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
 at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
 at org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
 at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
 at 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
 at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
 at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
 at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
 at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at 
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at 
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
 at 
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
 at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)

 On the other nodes, an error similar to this will be in the log:

 09:27:34 - ERROR

Re: SolrCloud never fully recovers after slow disks

2013-11-10 Thread Mark Miller
Which version of solr are you using? Regardless of your env, this is a fail 
safe that you should not hit. 

- Mark

 On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen 
 h...@entertainment-trading.com wrote:
 
 I previously made a post on this, but have since narrowed down the issue and 
 am now giving this another try, with another spin to it.
 
 We are running a 4 node setup (over Tomcat7) with a 3-ensemble external 
 ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM 
 is using our Storage system (NFS share in VMWare).
 
 Now I do realize and have heard, that NFS is not the greatest way to run Solr 
 on, but we have never had this issue on non-SolrCloud setups.
 
 Basically, each night when we run our backup jobs, our storage becomes a bit 
 slow in response - this is obviously something we’re trying to solve, but 
 bottom line is, that all our other systems somehow stays alive or recovers 
 gracefully when bandwidth exists again.
 SolrCloud - not so much. Typically after a session like this, 3-5 nodes will 
 either go into a Down state or a Recovering state - and stay that way. 
 Sometimes such node will even be marked as leader. A such node will have 
 something like this in the log:
 
 ERROR - 2013-11-05 08:57:45.764; 
 org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState 
 says we are the leader, but locally we don't think so
 ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException; 
 org.apache.solr.common.SolrException: ClusterState says we are the leader 
 (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally 
 we don't think so. Request came from 
 http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
at 
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
at 
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
at 
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
at 
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
at 
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
at 
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
at 
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
at 
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
at 
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at 
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:724)
 
 On the other nodes, an error similar to this will be in the log:
 
 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: 
 http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
  Server at http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2 
 returned non ok status:503,​ message:Service Unavailable
 09:27:34 -ERROR - SolrCmdDistributor forwarding update to 
 http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed - 
 retrying ...
 
 Does anyone have any ideas or leads towards a solution - one that doesn’t 
 involve getting a new storage 

RE: SolrCloud never fully recovers after slow disks

2013-11-10 Thread Henrik Ossipoff Hansen
Solr version is 4.5.0.

I have done some tweaking. Doubling my Zookeeper timeout values in zoo.cfg and 
the Zookeeper timeout in solr.xml seemed to somewhat minimize the problem, but 
it still did occur. I next stopped all larger batch indexing in the period 
where the issues happened, which also seemed to help somewhat. Now the next 
thing weirds me out a bit - I switched from using Tomcat7 to using the Jetty 
that ships with Solr, and that actually seems to have fixed the last issues 
(together with stopping a few smaller updates - very few).

During the slow period in the night, I get something like this:

03:11:49 ERROR ZkController There was a problem finding the leader in 
zk:org.apache.solr.common.SolrException: Could not get leader props
03:06:47 ERROR Overseer Could not create Overseer node
03:06:47 WARN LeaderElector
03:06:47 WARN ZkStateReader ZooKeeper watch triggered,​ but Solr cannot talk to 
ZK
03:07:41 WARN RecoveryStrategy Stopping recovery for 
zkNodeName=solr04.cd-et.com:8080_solr_auto_suggest_shard1_replica2core=auto_suggest_shard1_replica2

After this, the cluster state seems to be fine, and I'm not being spammed with 
errors in the log files.

Bottom line is that the issues are fixed for now it seems, but I still find it 
weird that Solr was not able to fully receover.

// Henrik Ossipoff

-Original Message-
From: Mark Miller [mailto:markrmil...@gmail.com] 
Sent: 10. november 2013 19:27
To: solr-user@lucene.apache.org
Subject: Re: SolrCloud never fully recovers after slow disks

Which version of solr are you using? Regardless of your env, this is a fail 
safe that you should not hit. 

- Mark

 On Nov 5, 2013, at 8:33 AM, Henrik Ossipoff Hansen 
 h...@entertainment-trading.com wrote:
 
 I previously made a post on this, but have since narrowed down the issue and 
 am now giving this another try, with another spin to it.
 
 We are running a 4 node setup (over Tomcat7) with a 3-ensemble external 
 ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM 
 is using our Storage system (NFS share in VMWare).
 
 Now I do realize and have heard, that NFS is not the greatest way to run Solr 
 on, but we have never had this issue on non-SolrCloud setups.
 
 Basically, each night when we run our backup jobs, our storage becomes a bit 
 slow in response - this is obviously something we’re trying to solve, but 
 bottom line is, that all our other systems somehow stays alive or recovers 
 gracefully when bandwidth exists again.
 SolrCloud - not so much. Typically after a session like this, 3-5 nodes will 
 either go into a Down state or a Recovering state - and stay that way. 
 Sometimes such node will even be marked as leader. A such node will have 
 something like this in the log:
 
 ERROR - 2013-11-05 08:57:45.764; 
 org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState 
 says we are the leader, but locally we don't think so ERROR - 2013-11-05 
 08:57:45.768; org.apache.solr.common.SolrException; 
 org.apache.solr.common.SolrException: ClusterState says we are the leader 
 (http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but locally 
 we don't think so. Request came from 
 http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
at 
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
at 
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
at org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
at 
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
at 
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
at 
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
at 
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
at 
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
at 
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
at 
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
at 
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
at 
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
at 
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168

Re: SolrCloud never fully recovers after slow disks

2013-11-07 Thread Erick Erickson
Right, can you up your ZK timeouts significantly? It sounds like
your ZK timeout is short enough that when your system slows
down, the timeout is exceeded and it's throwing Solr
into a tailspin

See zoo.cfg.

Best,
Erick


On Tue, Nov 5, 2013 at 3:33 AM, Henrik Ossipoff Hansen 
h...@entertainment-trading.com wrote:

 I previously made a post on this, but have since narrowed down the issue
 and am now giving this another try, with another spin to it.

 We are running a 4 node setup (over Tomcat7) with a 3-ensemble external
 ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM
 is using our Storage system (NFS share in VMWare).

 Now I do realize and have heard, that NFS is not the greatest way to run
 Solr on, but we have never had this issue on non-SolrCloud setups.

 Basically, each night when we run our backup jobs, our storage becomes a
 bit slow in response - this is obviously something we’re trying to solve,
 but bottom line is, that all our other systems somehow stays alive or
 recovers gracefully when bandwidth exists again.
 SolrCloud - not so much. Typically after a session like this, 3-5 nodes
 will either go into a Down state or a Recovering state - and stay that way.
 Sometimes such node will even be marked as leader. A such node will have
 something like this in the log:

 ERROR - 2013-11-05 08:57:45.764;
 org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
 says we are the leader, but locally we don't think so
 ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException;
 org.apache.solr.common.SolrException: ClusterState says we are the leader (
 http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but
 locally we don't think so. Request came from
 http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
 at
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
 at
 org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
 at
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
 at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
 at
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)

 On the other nodes, an error similar to this will be in the log:

 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode:
 http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/:org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
 Server at 
 http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2returned non ok 
 status:503, message:Service Unavailable
 09:27:34 -ERROR - SolrCmdDistributor forwarding update to
 http://solr04.cd-et.com:8080/solr/products_dk_shard1_replica2/ failed -
 retrying ...

 Does anyone have any 

Re: SolrCloud never fully recovers after slow disks

2013-11-07 Thread Henrik Ossipoff Hansen
Hey Erick,

I have tried upping the timeouts quite a bit now, and have tried upping the 
zkTimeout setting in Solr itself (I found a few old posts on the mailing list 
suggesting this).

I realise this is a sort of weird situation, where we are actually trying to 
work around some horrible hardware setup.

Thank you for your post - I will make another post in a day or two after I see 
how it performs.
--
Henrik Ossipoff Hansen
Developer, Entertainment Trading


On 7. nov. 2013 at 13.23.59, Erick Erickson 
(erickerick...@gmail.commailto://erickerick...@gmail.com) wrote:

Right, can you up your ZK timeouts significantly? It sounds like
your ZK timeout is short enough that when your system slows
down, the timeout is exceeded and it's throwing Solr
into a tailspin

See zoo.cfg.

Best,
Erick


On Tue, Nov 5, 2013 at 3:33 AM, Henrik Ossipoff Hansen 
h...@entertainment-trading.com wrote:

 I previously made a post on this, but have since narrowed down the issue
 and am now giving this another try, with another spin to it.

 We are running a 4 node setup (over Tomcat7) with a 3-ensemble external
 ZooKeeper. This is running no a total of 7 (4+3) different VMs, and each VM
 is using our Storage system (NFS share in VMWare).

 Now I do realize and have heard, that NFS is not the greatest way to run
 Solr on, but we have never had this issue on non-SolrCloud setups.

 Basically, each night when we run our backup jobs, our storage becomes a
 bit slow in response - this is obviously something we’re trying to solve,
 but bottom line is, that all our other systems somehow stays alive or
 recovers gracefully when bandwidth exists again.
 SolrCloud - not so much. Typically after a session like this, 3-5 nodes
 will either go into a Down state or a Recovering state - and stay that way.
 Sometimes such node will even be marked as leader. A such node will have
 something like this in the log:

 ERROR - 2013-11-05 08:57:45.764;
 org.apache.solr.update.processor.DistributedUpdateProcessor; ClusterState
 says we are the leader, but locally we don't think so
 ERROR - 2013-11-05 08:57:45.768; org.apache.solr.common.SolrException;
 org.apache.solr.common.SolrException: ClusterState says we are the leader (
 http://solr04.cd-et.com:8080/solr/products_fi_shard1_replica2), but
 locally we don't think so. Request came from
 http://solr01.cd-et.com:8080/solr/products_fi_shard2_replica1/
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.doDefensiveChecks(DistributedUpdateProcessor.java:381)
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.setupRequest(DistributedUpdateProcessor.java:243)
 at
 org.apache.solr.update.processor.DistributedUpdateProcessor.processAdd(DistributedUpdateProcessor.java:428)
 at
 org.apache.solr.handler.loader.XMLLoader.processUpdate(XMLLoader.java:247)
 at
 org.apache.solr.handler.loader.XMLLoader.load(XMLLoader.java:174)
 at
 org.apache.solr.handler.UpdateRequestHandler$1.load(UpdateRequestHandler.java:92)
 at
 org.apache.solr.handler.ContentStreamHandlerBase.handleRequestBody(ContentStreamHandlerBase.java:74)
 at
 org.apache.solr.handler.RequestHandlerBase.handleRequest(RequestHandlerBase.java:135)
 at org.apache.solr.core.SolrCore.execute(SolrCore.java:1859)
 at
 org.apache.solr.servlet.SolrDispatchFilter.execute(SolrDispatchFilter.java:703)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:406)
 at
 org.apache.solr.servlet.SolrDispatchFilter.doFilter(SolrDispatchFilter.java:195)
 at
 org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:243)
 at
 org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:210)
 at
 org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:224)
 at
 org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:169)
 at
 org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:168)
 at
 org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:98)
 at
 org.apache.catalina.valves.AccessLogValve.invoke(AccessLogValve.java:927)
 at
 org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:118)
 at
 org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:407)
 at
 org.apache.coyote.http11.AbstractHttp11Processor.process(AbstractHttp11Processor.java:987)
 at
 org.apache.coyote.AbstractProtocol$AbstractConnectionHandler.process(AbstractProtocol.java:579)
 at
 org.apache.tomcat.util.net.JIoEndpoint$SocketProcessor.run(JIoEndpoint.java:307)
 at
 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
 at
 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
 at java.lang.Thread.run(Thread.java:724)

 On the other nodes, an error similar to this will be in the log:

 09:27:34 - ERROR - SolrCmdDistributor shard update error RetryNode: