Hi

We are using 2 node SolrCloud 7.2.1 cluster with external 3 node ZK
ensemble in AWS. There are about 60 collections at any point in time. We
have per JVM max heap of 8GB.

The problem is: We are seeing few collection's few replicas in "recovering"
state and few in the "down". Since we have 2 replicas for each of the
shard, the system is still functional, even though few replicas are in
unhealthy state. Currently we are seeing less then 50% heap memory is used
and there are free physical memory available as well. The GC seems to be
fine now.

We think issue could have happened, when we were accidentally trying to
read the zookeeper transactions logs (to see the count of zk transactions,
we understand this is not a good practice now) during an Solr data load and
load failed during that time, as Solr was not able to find the leader with
this error("*Cannot talk to ZooKeeper - Updates are disabled*"). We stopped
reading it further. But, this changed the Solr Leader and since then we
were able to do load just fine, but the leader remains switched.
Detailed *error
message 1 <https://pastebin.com/embed_iframe/wcp3L9nk>*

But as stated above problem, we will have few collection replicas in the
recovering and down state. In the past we have seen it come back to normal
by restarting the solr server, but we want to understand is there any way
to get this back to normal (all synched up with Zookeeper) through command
line/admin? Another question is, being in this state can it cause data
issue? How do we check that (distrib=false on collection count?)?

We predominantly use Solr realtime GET by key in our application.

Regards,
Ganesh

Reply via email to