Sorry for such a long post. We have a 4-node SolrCloud running Solr 8.11.1. There are 2 nodes in one AWS region, and 2 nodes in another region. All nodes are in peered VPC. All communications between the nodes are direct IP calls (no DNS). One node in each region holds replicas of multiple collections (single-shard collections). The other 2 nodes (1 in each region) are empty. Why we did that will become apparent.
Zookeeper is a 3 node ensemble, with 1 node in each of the two SolrCloud regions and a 3rd node in a completely different region. We are having issues with very high latencies, which were sometimes resolved by restarting Solr, but sometimes restarting Solr made it worse. A lot worse. Sometimes restarting improved things, but then it would suddenly go bad. Through a serendipitous side investigation of a blank Tree in the Admin UI, we found that making the .../solr/admin/zookeeper call would take anywhere from milliseconds to 10 seconds to 60 seconds. The latencies were perfectly correlated with which Zookeeper that particular Solr node was "attached" to. Same region: milliseconds, other region - 10 seconds. And the outlier ZK region - 60 seconds. Seems like some network issue, yes? I agree, but I'm trying to convince our network engineers that it's something inherent in Solr or Zookeeper. The odd thing is that the query latencies seem to hinge on whether the node which receives the query actually has at least 1 shard for the queried collection. We deployed Dynatrace agents to peer into what might be happening, but all I end up seeing is there are long waits in ZkStateReader$LazyCollectionRef.get, but only when the node doesn't have the collection being queried. So I'd like to understand better the difference in how Solr manages these collection configs when the collection is resident or not. LazyCollectionRef seems to be called when the collection isn't there, and the timeout for the cache is 2 seconds (solr.OverseerStateUpdateDelay). Do resident collections run down a different code path? I ran across this old change: https://issues.apache.org/jira/browse/SOLR-6629, which seems related, but only in that it is the father of the current code. If I can explain with conviction that Solr behaves differently with resident vs non-resident collections, then I have a path forward to kick networking to look at this, or suck it up and make sure that every collection is represented with a shard on each node (which I think is a stupid work-around especially for small collections, but I gotta do what I gotta do). Thanks for your attention!