Hi Team, Gentle reminder and request to help with this issue as we are kind of stuck. Solr admin ui is also not opening up for us to do further debugging and throws this error:
org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /overseer/collection-queue-work/qn- at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:243) at org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:240) at org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:240) at org.apache.solr.cloud.DistributedQueue.createData(DistributedQueue.java:311) at org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:330) at org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:344) at org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:309) at org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:628 Our preference is to bring the cluster back to stable state with low downtime and no data loss. or At least admin ui starts opening up. Please suggest. Thanks On Sun, May 12, 2024 at 9:40 PM Sarthak Sharma <jgdsart...@gmail.com> wrote: > Hi, > > We have a production solr cluster setup with 4 shards and 4 replicas on a > legacy stack. > Same machines have been used to host 5 Zookeeper nodes ensemble. > > Solr version : 1.1.0.41 > ZK version : 3.4.6 > > Few days back, one of the solr processes was stuck because of which reads > and write were failing. We did a few rounds of ZK/Solr restarts after > clearing up disk space and read operations started working fine. but , the > write operations (indexing) started failing with below error : > > Caused by: > org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: > ClusterState says we are the leader (<host>:<port>/solr/<collection-name>), > but locally we don't think so. Request came from null > > We checked the cluster state using below steps. (Solr UI is not accessible > for some reasons) > > > 1. Go to the path "zookeeper/installation/directory/bin" > 2. ./zkCli.sh -server localhost:1234 > 3. get /clusterstate.json > > > We see that 3 shard replica nodes are STUCK in 'recovering' state. > > The cluster is of critical importance and We want to use minimum possible > and safe changes to bring cluster back to stable state. Upgrading versions > is not possible either. > > > Please help us understand this behavior and way out of it. > Fixing this issue is really critical and urgent and we don't have enough > Solr expertise in the team. > This cluster was mainly in maintenance mode and in a deprecation path, > hence the situation. > > Is there a way to force replication to unstable node from stable node? > Please let me know of your thoughts. Really appreciate any help. > > Thanks, > Sarthak >