A correction : SOLR version is 4.8.1 actually. Sorry for the miss.
On Tue, May 14, 2024 at 10:55 AM Sarthak Sharma <jgdsart...@gmail.com> wrote: > Hi Team, > > Gentle reminder and request to help with this issue as we are kind of > stuck. Solr admin ui is also not opening up for us to do further debugging > and throws this error: > > org.apache.zookeeper.KeeperException$ConnectionLossException: > KeeperErrorCode = ConnectionLoss for /overseer/collection-queue-work/qn- at > org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at > org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at > org.apache.zookeeper.ZooKeeper.create(ZooKeeper.java:783) at > org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:243) > at > org.apache.solr.common.cloud.SolrZkClient$5.execute(SolrZkClient.java:240) > at > org.apache.solr.common.cloud.ZkCmdExecutor.retryOperation(ZkCmdExecutor.java:73) > at org.apache.solr.common.cloud.SolrZkClient.create(SolrZkClient.java:240) > at > org.apache.solr.cloud.DistributedQueue.createData(DistributedQueue.java:311) > at org.apache.solr.cloud.DistributedQueue.offer(DistributedQueue.java:330) > at > org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:344) > at > org.apache.solr.handler.admin.CollectionsHandler.handleResponse(CollectionsHandler.java:309) > at > org.apache.solr.handler.admin.CollectionsHandler.handleClusterStatus(CollectionsHandler.java:628 > > Our preference is to bring the cluster back to stable state with low > downtime and no data loss. > or At least admin ui starts opening up. Please suggest. > > Thanks > > On Sun, May 12, 2024 at 9:40 PM Sarthak Sharma <jgdsart...@gmail.com> > wrote: > >> Hi, >> >> We have a production solr cluster setup with 4 shards and 4 replicas on a >> legacy stack. >> Same machines have been used to host 5 Zookeeper nodes ensemble. >> >> Solr version : 4.8.1 >> ZK version : 3.4.6 >> >> Few days back, one of the solr processes was stuck because of which reads >> and write were failing. We did a few rounds of ZK/Solr restarts after >> clearing up disk space and read operations started working fine. but , the >> write operations (indexing) started failing with below error : >> >> Caused by: >> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: >> ClusterState says we are the leader (<host>:<port>/solr/<collection-name>), >> but locally we don't think so. Request came from null >> >> We checked the cluster state using below steps. (Solr UI is not >> accessible for some reasons) >> >> >> 1. Go to the path "zookeeper/installation/directory/bin" >> 2. ./zkCli.sh -server localhost:1234 >> 3. get /clusterstate.json >> >> >> We see that 3 shard replica nodes are STUCK in 'recovering' state. >> >> The cluster is of critical importance and We want to use minimum possible >> and safe changes to bring cluster back to stable state. Upgrading versions >> is not possible either. >> >> >> Please help us understand this behavior and way out of it. >> Fixing this issue is really critical and urgent and we don't have enough >> Solr expertise in the team. >> This cluster was mainly in maintenance mode and in a deprecation path, >> hence the situation. >> >> Is there a way to force replication to unstable node from stable node? >> Please let me know of your thoughts. Really appreciate any help. >> >> Thanks, >> Sarthak >> >