I don't know the Bitnami solr chart (I used the zk one), but I would expect there to be configured a livenessProbe that would cause k8s to cycle the pod if it is not alive?
Jan > 14. sep. 2022 kl. 03:26 skrev Jonathan Tan <[email protected]>: > > Hi All! > > We're coming across a behaviour in SolrCloud + ZK in a kubernetes > environment that I was looking for some help/clarification with. > > Specifically, if a ZK cluster is not serving requests because 2 of the 3 > nodes are down, SOLR nodes that attempt to connect to it will fail, and not > become Ready. Which makes sense. > However, after the ZK cluster has fixed itself - i.e. all 3 nodes are now > up and running again, those SOLR nodes that were not able to sort > themselves out initially, continue to NOT sort themselves out. From what we > can see, it just looks like those SOLR nodes do not attempt to reconnect to > ZK, and thus they don't discover that ZK is available again, etc etc etc. > > Is this expected behaviour? And how can we automatically recover from this? > We're using the bitnami SOLR & Zookeeper helm charts, and our setup is: > - a ZK stateful set with 3 replicas > - a SolrCloud stateful set with 3 replicas > > The way we've been able to recreate this is as follows: > > ~~ > - scale the ZK cluster down to 1 replica and verify that now it is unhappy > and not serving requests > - verify that all the 3 SolrCloud pods note that ZK isn't happy. > > These are the error messages: > 2022-09-13 21:37:01.741 WARN > (main-SendThread(solr-v1-base-zookeeper-headless:2181)) [ ] > o.a.z.ClientCnxn Session 0x0 for sever solr-v1-base-zookeeper-headless/ > 10.64.0.69:2181, Closing socket connection. Attempting reconnect except it > is a SessionExpiredException. => EndOfStreamException: Unable to read > additional data from server sessionid 0x0, likely server has closed socket > > - delete 2 of the SolrCloud pods and wait for them to start back up. they'd > then complain about not being able to get to ZK > org.apache.solr.common.SolrException: Error occurred while loading solr.xml > from zookeeper > > - the 2 deleted SolrCloud pods would be in a not-ready state consistently, > and it'd be stating that it is due to their cores not being initialised or > are shutting down. The single pod that was *not* deleted will keep working > as per normal. > > - scale the ZK cluster back up to 3 replicas and verify that it is now > happy again > - observe that the 2 SolrCloud pods that had started up when ZK was unhappy > continue to NOT become ready, and continue to have the same error in the > logs. > > javax.servlet.ServletException: javax.servlet.UnavailableException: Error > processing the request. CoreContainer is either not initialized or shutting > down. > o.a.s.s.SolrDispatchFilter Error processing the request. CoreContainer is > either not initialized or shutting down. > o.e.j.s.HttpChannel /solr/admin/info/system => > javax.servlet.ServletException: javax.servlet.UnavailableException: Error > processing the request. CoreContainer is either not initialized or shutting > down. > > ~~ > > So what I'm trying to verify... > It looks like SOLR doesn't attempt to reconnect to ZK if it has previously > failed. Is that intentional? Is there a way to get it to do so? > > Thank you!
