Hi,

We have a production solr cluster setup with 4 shards and 4 replicas on a
legacy stack.
Same machines have been used to host 5 Zookeeper nodes ensemble.

Solr version : 1.1.0.41
ZK version : 3.4.6

Few days back, one of the solr processes was stuck because of which reads
and write were failing. We did a few rounds of ZK/Solr restarts after
clearing up disk space and read operations started working fine. but , the
write operations (indexing) started failing with below error :

Caused by:
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException:
ClusterState says we are the leader (<host>:<port>/solr/<collection-name>),
but locally we don't think so. Request came from null

We checked the cluster state using below steps. (Solr UI is not accessible
for some reasons)


   1. Go to the path "zookeeper/installation/directory/bin"
   2. ./zkCli.sh -server localhost:1234
   3. get /clusterstate.json


We see that 3 shard replica nodes are STUCK in 'recovering' state.

The cluster is of critical importance and We want to use minimum possible
and safe changes to bring cluster back to stable state. Upgrading versions
is not possible either.


Please help us understand this behavior and way out of it.
Fixing this issue is really critical and urgent and we don't have enough
Solr expertise in the team.
This cluster was mainly in maintenance mode and in a deprecation path,
hence the situation.

Is there a way to force replication to unstable node from stable node?
Please let me know of your thoughts. Really appreciate any help.

Thanks,
Sarthak

Reply via email to