Hello Solr Community, We recently experienced a load incident on a 10-node SolrCloud cluster and are trying to understand whether Solr provides a way to stop routing traffic to replicas or nodes that are already under stress before they destabilise the cluster.
Environment - Solr 9.6.1 / Lucene 9.10.0 - Java 17, G1GC, MaxGCPauseMillis=250 - Heap: 12 GB (-Xms12g -Xmx12g) - 10 Solr nodes on GCP - 3-node ZooKeeper ensemble - allowPartialResults=true Node hardware - 16 vCPU - 48 GB RAM - No swap - Linux (RHEL 9) Collections search_collection_a - 63 shards, 66 replicas - ~173 GB index - ~96M documents search_collection_b - 63 shards, 63 replicas - ~205 GB index - ~95M documents Each node hosts roughly 5-8 shards from each collection. Incident Summary During a period of elevated query load, several nodes became unstable, exhibiting increased latency, GC pressure, and thread pool saturation. We enabled the Solr CPU Circuit Breaker with a threshold of 85%, expecting it to shed load from overloaded nodes. Instead: - The cluster began returning a large number of HTTP 429 responses. - CPU utilisation observed at the OS level remained well below the configured threshold. - We eventually disabled the Circuit Breaker because it appeared to be worsening availability. Our working theory is that GC pressure and request backlog may have been the primary bottlenecks rather than raw CPU utilisation, but we're trying to understand whether Solr has built-in mechanisms for handling this scenario. Questions 1. Replica/node exclusion based on health Is there a built-in mechanism in SolrCloud 9.x to temporarily avoid routing requests to replicas that are degraded (for example: high latency, GC pressure, thread pool saturation, or slow responses)? 2. Latency-aware or load-aware replica selection Can replica selection be influenced by observed response latency or node load so that healthy replicas are preferred, and slow replicas are deprioritised? 3. Circuit Breaker behavior Has anyone observed CPU Circuit Breakers triggering while node-level CPU utilisation appears to remain below the configured threshold? Are there additional Circuit Breakers in Solr 9.x or an external library that are generally more effective than CPU thresholds for detecting overload caused by GC pressure or resource contention? Any guidance, documentation references, JIRA issues, or production experience would be greatly appreciated. Thank you Harshit Sharma
