Hi All! We're coming across a behaviour in SolrCloud + ZK in a kubernetes environment that I was looking for some help/clarification with.
Specifically, if a ZK cluster is not serving requests because 2 of the 3 nodes are down, SOLR nodes that attempt to connect to it will fail, and not become Ready. Which makes sense. However, after the ZK cluster has fixed itself - i.e. all 3 nodes are now up and running again, those SOLR nodes that were not able to sort themselves out initially, continue to NOT sort themselves out. From what we can see, it just looks like those SOLR nodes do not attempt to reconnect to ZK, and thus they don't discover that ZK is available again, etc etc etc. Is this expected behaviour? And how can we automatically recover from this? We're using the bitnami SOLR & Zookeeper helm charts, and our setup is: - a ZK stateful set with 3 replicas - a SolrCloud stateful set with 3 replicas The way we've been able to recreate this is as follows: ~~ - scale the ZK cluster down to 1 replica and verify that now it is unhappy and not serving requests - verify that all the 3 SolrCloud pods note that ZK isn't happy. These are the error messages: 2022-09-13 21:37:01.741 WARN (main-SendThread(solr-v1-base-zookeeper-headless:2181)) [ ] o.a.z.ClientCnxn Session 0x0 for sever solr-v1-base-zookeeper-headless/ 10.64.0.69:2181, Closing socket connection. Attempting reconnect except it is a SessionExpiredException. => EndOfStreamException: Unable to read additional data from server sessionid 0x0, likely server has closed socket - delete 2 of the SolrCloud pods and wait for them to start back up. they'd then complain about not being able to get to ZK org.apache.solr.common.SolrException: Error occurred while loading solr.xml from zookeeper - the 2 deleted SolrCloud pods would be in a not-ready state consistently, and it'd be stating that it is due to their cores not being initialised or are shutting down. The single pod that was *not* deleted will keep working as per normal. - scale the ZK cluster back up to 3 replicas and verify that it is now happy again - observe that the 2 SolrCloud pods that had started up when ZK was unhappy continue to NOT become ready, and continue to have the same error in the logs. javax.servlet.ServletException: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down. o.a.s.s.SolrDispatchFilter Error processing the request. CoreContainer is either not initialized or shutting down. o.e.j.s.HttpChannel /solr/admin/info/system => javax.servlet.ServletException: javax.servlet.UnavailableException: Error processing the request. CoreContainer is either not initialized or shutting down. ~~ So what I'm trying to verify... It looks like SOLR doesn't attempt to reconnect to ZK if it has previously failed. Is that intentional? Is there a way to get it to do so? Thank you!
