SolrCloud nodes do not attempt to rediscover Zookeeper nodes in a kubernetes environment?

Jonathan Tan Tue, 13 Sep 2022 18:27:04 -0700

Hi All!

We're coming across a behaviour in SolrCloud + ZK in a kubernetes
environment that I was looking for some help/clarification with.


Specifically, if a ZK cluster is not serving requests because 2 of the 3
nodes are down, SOLR nodes that attempt to connect to it will fail, and not
become Ready. Which makes sense.
However, after the ZK cluster has fixed itself - i.e. all 3 nodes are now
up and running again, those SOLR nodes that were not able to sort
themselves out initially, continue to NOT sort themselves out. From what we
can see, it just looks like those SOLR nodes do not attempt to reconnect to
ZK, and thus they don't discover that ZK is available again, etc etc etc.

Is this expected behaviour? And how can we automatically recover from this?
We're using the bitnami SOLR & Zookeeper helm charts, and our setup is:
- a ZK stateful set with 3 replicas
- a SolrCloud stateful set with 3 replicas

The way we've been able to recreate this is as follows:

~~
- scale the ZK cluster down to 1 replica and verify that now it is unhappy
and not serving requests
- verify that all the 3 SolrCloud pods note that ZK isn't happy.

These are the error messages:
2022-09-13 21:37:01.741 WARN
(main-SendThread(solr-v1-base-zookeeper-headless:2181)) [ ]
o.a.z.ClientCnxn Session 0x0 for sever solr-v1-base-zookeeper-headless/
10.64.0.69:2181, Closing socket connection. Attempting reconnect except it
is a SessionExpiredException. => EndOfStreamException: Unable to read
additional data from server sessionid 0x0, likely server has closed socket

- delete 2 of the SolrCloud pods and wait for them to start back up. they'd
then complain about not being able to get to ZK
org.apache.solr.common.SolrException: Error occurred while loading solr.xml
from zookeeper

- the 2 deleted SolrCloud pods would be in a not-ready state consistently,
and it'd be stating that it is due to their cores not being initialised or
are shutting down. The single pod that was *not* deleted will keep working
as per normal.

- scale the ZK cluster back up to 3 replicas and verify that it is now
happy again
- observe that the 2 SolrCloud pods that had started up when ZK was unhappy
continue to NOT become ready, and continue to have the same error in the
logs.

javax.servlet.ServletException: javax.servlet.UnavailableException: Error
processing the request. CoreContainer is either not initialized or shutting
down.
o.a.s.s.SolrDispatchFilter Error processing the request. CoreContainer is
either not initialized or shutting down.
o.e.j.s.HttpChannel /solr/admin/info/system =>
javax.servlet.ServletException: javax.servlet.UnavailableException: Error
processing the request. CoreContainer is either not initialized or shutting
down.

~~

So what I'm trying to verify...
It looks like SOLR doesn't attempt to reconnect to ZK if it has previously
failed. Is that intentional? Is there a way to get it to do so?

Thank you!

SolrCloud nodes do not attempt to rediscover Zookeeper nodes in a kubernetes environment?

Reply via email to