SolrCloud does not recover after ZooKeeper ensemble loses (and then regains) a quorum

Kelly, Frank Tue, 15 Mar 2016 09:17:08 -0700

Just wondering if my observation of SolrCloud behavior after ZooKeeper loses a 
quorum is normal or to-be-expected


Version of Solr 5.3.1
Version of ZooKeeper: 3.4.7
Using SolrCloud with external ZooKeeper
Deployed on AWS

Our Zookeeper ensemble consists of three nodes with the same config e.g.

$ more ../conf/zoo.cfg
tickTime=2000
dataDir=/var/zookeeper
dataLogDir=/var/log/zookeeper
clientPort=2181
initLimit=10
syncLimit=5
standaloneEnabled=false
server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888
server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888
server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888

If we terminate one of the zookeeper nodes we get a ZK election (and I think) a 
quorum is maintained.
Operation continues OK and we detect the terminated instance and relaunch a new 
ZK node which comes up fine

If we terminate two of the ZK nodes we lose a quorum and then we observe the 
following

1.1) Admin UI shows the following
[cid:7B4ADA74-9257-4B60-8109-F8EF0C4E2125]

1.2) SolrJ returns the following

org.apache.solr.common.SolrException: Could not load collection from 
ZK:qa_eu-west-1_public_index
at 
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72)
at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86)
at 
com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.java:112)
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/collections/qa_eu-west-1_public_index/state.json
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
at 
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)
... 24 more

This makes sense based on our understanding.
When our AutoScale groups launch two new ZooKeeper nodes, initialize them, fix 
the DNS etc. we regain a quorum but at this point

2.1) Admin UI shows the shards as “GONE”
[cid:DC7412DD-FF95-4DE1-AA4E-9C6F7A47C74C]
2.2) SolrJ returns the same error even though the ZooKeeper DNS names are now 
bound to new IP addresses

So at this point I restart the Solr nodes. At this point then

3.1) Admin UI shows the following – yeah the nodes are back!
[cid:765921A3-CE96-4989-9C46-838F96A8F05B]

3.2) SolrJ Client still shows the same error – namely

org.apache.solr.common.SolrException: Could not load collection from 
ZK:qa_eu-west-1_here_account
at 
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850)
at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837)
at 
org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788)
at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803)
at com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:257)
.
.
Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: 
KeeperErrorCode = ConnectionLoss for 
/collections/qa_eu-west-1_here_account/state.json
at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
at org.apache.zookeeper.KeeperException.create(KeeperException.java:51)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345)
at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342)
at 
org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61)
at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342)
at 
org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841)

I have a few questions
1) Is this behavior (lack of self-healing) a known behavior?
2) Is this the same or similar behavior as documented here 
https://issues.apache.org/jira/browse/SOLR-5129
3) If it is not covered by #2 should I log it in JIRA?

Thanks and Best Wishes,

-Frank

p.s. I can add Solr log files if they will help

[cid:AB6ED7F7-2354-40CD-BFBB-C6956EEC27D3]
Frank Kelly
Principal Software Engineer
Predictive Analytics Team (SCBE/HAC/CDA)






HERE
5 Wayside Rd, Burlington, MA 01803, USA
42° 29' 7" N 71° 11' 32” W

[cid:CB672D9E-9F2D-4C4E-85BF-1110CB15C9B0]<http://360.here.com/>  
[cid:BFC33815-ACDF-4293-AA7B-C6C8F359604E] <https://twitter.com/here>   
[cid:F538C450-1043-4C78-BB12-17022F710731] <https://www.facebook.com/here>    
[cid:E06A5E9B-7555-4232-9003-A73FC6D0C67F] 
<https://linkedin.com/company/heremaps>    
[cid:F14F9AEA-0CD0-41C5-80B3-A90CC1B309FC] <https://www.instagram.com/here>

SolrCloud does not recover after ZooKeeper ensemble loses (and then regains) a quorum

Reply via email to