Just wondering if my observation of SolrCloud behavior after ZooKeeper loses a quorum is normal or to-be-expected
Version of Solr 5.3.1 Version of ZooKeeper: 3.4.7 Using SolrCloud with external ZooKeeper Deployed on AWS Our Zookeeper ensemble consists of three nodes with the same config e.g. $ more ../conf/zoo.cfg tickTime=2000 dataDir=/var/zookeeper dataLogDir=/var/log/zookeeper clientPort=2181 initLimit=10 syncLimit=5 standaloneEnabled=false server.1=zookeeper1.qa.eu-west-1.mysearch.com:2888:3888 server.2=zookeeper2.qa.eu-west-1.mysearch.com:2888:3888 server.3=zookeeper3.qa.eu-west-1.mysearch.com:2888:3888 If we terminate one of the zookeeper nodes we get a ZK election (and I think) a quorum is maintained. Operation continues OK and we detect the terminated instance and relaunch a new ZK node which comes up fine If we terminate two of the ZK nodes we lose a quorum and then we observe the following 1.1) Admin UI shows the following [cid:7B4ADA74-9257-4B60-8109-F8EF0C4E2125] 1.2) SolrJ returns the following org.apache.solr.common.SolrException: Could not load collection from ZK:qa_eu-west-1_public_index at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850) at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515) at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:107) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:72) at org.apache.solr.client.solrj.SolrClient.add(SolrClient.java:86) at com.here.scbe.search.solr.SolrFacadeImpl.addToSearchIndex(SolrFacadeImpl.java:112) Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/qa_eu-west-1_public_index/state.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841) ... 24 more This makes sense based on our understanding. When our AutoScale groups launch two new ZooKeeper nodes, initialize them, fix the DNS etc. we regain a quorum but at this point 2.1) Admin UI shows the shards as “GONE” [cid:DC7412DD-FF95-4DE1-AA4E-9C6F7A47C74C] 2.2) SolrJ returns the same error even though the ZooKeeper DNS names are now bound to new IP addresses So at this point I restart the Solr nodes. At this point then 3.1) Admin UI shows the following – yeah the nodes are back! [cid:765921A3-CE96-4989-9C46-838F96A8F05B] 3.2) SolrJ Client still shows the same error – namely org.apache.solr.common.SolrException: Could not load collection from ZK:qa_eu-west-1_here_account at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:850) at org.apache.solr.common.cloud.ZkStateReader$7.get(ZkStateReader.java:515) at org.apache.solr.client.solrj.impl.CloudSolrClient.getDocCollection(CloudSolrClient.java:1205) at org.apache.solr.client.solrj.impl.CloudSolrClient.requestWithRetryOnStaleState(CloudSolrClient.java:837) at org.apache.solr.client.solrj.impl.CloudSolrClient.request(CloudSolrClient.java:805) at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:135) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:825) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:788) at org.apache.solr.client.solrj.SolrClient.deleteById(SolrClient.java:803) at com.here.scbe.search.solr.SolrFacadeImpl.deleteById(SolrFacadeImpl.java:257) . . Caused by: org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /collections/qa_eu-west-1_here_account/state.json at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:345) at org.apache.solr.common.cloud.SolrZkClient$7.execute(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:61) at org.apache.solr.common.cloud.SolrZkClient.getData(SolrZkClient.java:342) at org.apache.solr.common.cloud.ZkStateReader.getCollectionLive(ZkStateReader.java:841) I have a few questions 1) Is this behavior (lack of self-healing) a known behavior? 2) Is this the same or similar behavior as documented here https://issues.apache.org/jira/browse/SOLR-5129 3) If it is not covered by #2 should I log it in JIRA? Thanks and Best Wishes, -Frank p.s. I can add Solr log files if they will help [cid:AB6ED7F7-2354-40CD-BFBB-C6956EEC27D3] Frank Kelly Principal Software Engineer Predictive Analytics Team (SCBE/HAC/CDA) HERE 5 Wayside Rd, Burlington, MA 01803, USA 42° 29' 7" N 71° 11' 32” W [cid:CB672D9E-9F2D-4C4E-85BF-1110CB15C9B0]<http://360.here.com/> [cid:BFC33815-ACDF-4293-AA7B-C6C8F359604E] <https://twitter.com/here> [cid:F538C450-1043-4C78-BB12-17022F710731] <https://www.facebook.com/here> [cid:E06A5E9B-7555-4232-9003-A73FC6D0C67F] <https://linkedin.com/company/heremaps> [cid:F14F9AEA-0CD0-41C5-80B3-A90CC1B309FC] <https://www.instagram.com/here>