Sorry for such a long post.

We have a 4-node SolrCloud running Solr 8.11.1.  There are 2 nodes in one
AWS region, and 2 nodes in another region.  All nodes are in peered VPC.
All communications between the nodes are direct IP calls (no DNS).  One
node in each region holds replicas of multiple collections (single-shard
collections).  The other 2 nodes (1 in each region) are empty.  Why we did
that will become apparent.

Zookeeper is a 3 node ensemble, with 1 node in each of the two SolrCloud
regions and a 3rd node in a completely different region.

We are having issues with very high latencies, which were sometimes
resolved by restarting Solr, but sometimes restarting Solr made it worse. A
lot worse.  Sometimes restarting improved things, but then it would
suddenly go bad.

Through a serendipitous side investigation of a blank Tree in the Admin UI,
we found that making the .../solr/admin/zookeeper call would take anywhere
from milliseconds to 10 seconds to 60 seconds.  The latencies were
perfectly correlated with which Zookeeper that particular Solr node was
"attached" to.  Same region: milliseconds, other region - 10 seconds. And
the outlier ZK region - 60 seconds.

Seems like some network issue, yes?  I agree, but I'm trying to convince
our network engineers that it's something inherent in Solr or Zookeeper.

The odd thing is that the query latencies seem to hinge on whether the node
which receives the query actually has at least 1 shard for the queried
collection.  We deployed Dynatrace agents to peer into what might be
happening, but all I end up seeing is there are long waits in
ZkStateReader$LazyCollectionRef.get, but only when the node doesn't have
the collection being queried.

So I'd like to understand better the difference in how Solr manages these
collection configs when the collection is resident or not.
LazyCollectionRef seems to be called when the collection isn't there, and
the timeout for the cache is 2 seconds (solr.OverseerStateUpdateDelay). Do
resident collections run down a different code path?  I ran across this old
change: https://issues.apache.org/jira/browse/SOLR-6629, which seems
related, but only in that it is the father of the current code. If I can
explain with conviction that Solr behaves differently with resident vs
non-resident collections, then I have a path forward to kick networking to
look at this, or suck it up and make sure that every collection is
represented with a shard on each node (which I think is a stupid
work-around especially for small collections, but I gotta do what I gotta
do).

Thanks for your attention!

Reply via email to