I don't know the Bitnami solr chart (I used the zk one), but I would expect 
there to be configured a livenessProbe that would cause k8s to cycle the pod if 
it is not alive?

Jan

> 14. sep. 2022 kl. 03:26 skrev Jonathan Tan <[email protected]>:
> 
> Hi All!
> 
> We're coming across a behaviour in SolrCloud + ZK in a kubernetes
> environment that I was looking for some help/clarification with.
> 
> Specifically, if a ZK cluster is not serving requests because 2 of the 3
> nodes are down, SOLR nodes that attempt to connect to it will fail, and not
> become Ready. Which makes sense.
> However, after the ZK cluster has fixed itself - i.e. all 3 nodes are now
> up and running again, those SOLR nodes that were not able to sort
> themselves out initially, continue to NOT sort themselves out. From what we
> can see, it just looks like those SOLR nodes do not attempt to reconnect to
> ZK, and thus they don't discover that ZK is available again, etc etc etc.
> 
> Is this expected behaviour? And how can we automatically recover from this?
> We're using the bitnami SOLR & Zookeeper helm charts, and our setup is:
> - a ZK stateful set with 3 replicas
> - a SolrCloud stateful set with 3 replicas
> 
> The way we've been able to recreate this is as follows:
> 
> ~~
> - scale the ZK cluster down to 1 replica and verify that now it is unhappy
> and not serving requests
> - verify that all the 3 SolrCloud pods note that ZK isn't happy.
> 
> These are the error messages:
> 2022-09-13 21:37:01.741 WARN
> (main-SendThread(solr-v1-base-zookeeper-headless:2181)) [ ]
> o.a.z.ClientCnxn Session 0x0 for sever solr-v1-base-zookeeper-headless/
> 10.64.0.69:2181, Closing socket connection. Attempting reconnect except it
> is a SessionExpiredException. => EndOfStreamException: Unable to read
> additional data from server sessionid 0x0, likely server has closed socket
> 
> - delete 2 of the SolrCloud pods and wait for them to start back up. they'd
> then complain about not being able to get to ZK
> org.apache.solr.common.SolrException: Error occurred while loading solr.xml
> from zookeeper
> 
> - the 2 deleted SolrCloud pods would be in a not-ready state consistently,
> and it'd be stating that it is due to their cores not being initialised or
> are shutting down. The single pod that was *not* deleted will keep working
> as per normal.
> 
> - scale the ZK cluster back up to 3 replicas and verify that it is now
> happy again
> - observe that the 2 SolrCloud pods that had started up when ZK was unhappy
> continue to NOT become ready, and continue to have the same error in the
> logs.
> 
> javax.servlet.ServletException: javax.servlet.UnavailableException: Error
> processing the request. CoreContainer is either not initialized or shutting
> down.
> o.a.s.s.SolrDispatchFilter Error processing the request. CoreContainer is
> either not initialized or shutting down.
> o.e.j.s.HttpChannel /solr/admin/info/system =>
> javax.servlet.ServletException: javax.servlet.UnavailableException: Error
> processing the request. CoreContainer is either not initialized or shutting
> down.
> 
> ~~
> 
> So what I'm trying to verify...
> It looks like SOLR doesn't attempt to reconnect to ZK if it has previously
> failed. Is that intentional? Is there a way to get it to do so?
> 
> Thank you!

Reply via email to