Hi We are using 2 node SolrCloud 7.2.1 cluster with external 3 node ZK ensemble in AWS. There are about 60 collections at any point in time. We have per JVM max heap of 8GB.
The problem is: We are seeing few collection's few replicas in "recovering" state and few in the "down". Since we have 2 replicas for each of the shard, the system is still functional, even though few replicas are in unhealthy state. Currently we are seeing less then 50% heap memory is used and there are free physical memory available as well. The GC seems to be fine now. We think issue could have happened, when we were accidentally trying to read the zookeeper transactions logs (to see the count of zk transactions, we understand this is not a good practice now) during an Solr data load and load failed during that time, as Solr was not able to find the leader with this error("*Cannot talk to ZooKeeper - Updates are disabled*"). We stopped reading it further. But, this changed the Solr Leader and since then we were able to do load just fine, but the leader remains switched. Detailed *error message 1 <https://pastebin.com/embed_iframe/wcp3L9nk>* But as stated above problem, we will have few collection replicas in the recovering and down state. In the past we have seen it come back to normal by restarting the solr server, but we want to understand is there any way to get this back to normal (all synched up with Zookeeper) through command line/admin? Another question is, being in this state can it cause data issue? How do we check that (distrib=false on collection count?)? We predominantly use Solr realtime GET by key in our application. Regards, Ganesh