Just had an odd scenario in our current Solr system (4.3.0 + SOLR-4829 patch), 4 shards, 2 replicas (leader + 1 other) per shard spread across 8 machines.
We sent all our updates into a single instance, and we shutdown a leader for maintenance, expecting it to failover to the other replica. What I saw was that when the leader shard went down, the instance taking updates started seeing rejections almost instantly, yet the cluster state changes didn't occur for several seconds. During that time, we had no valid leader for one of our shards, so we were losing updates and queries. (shard4 leader) 07:10:33,124 - xxxxxx4 (shard 4 leader) starts coming down. 07:10:35,885 - cluster state change is detected 07:10:37,172 - nsrchnj4 publishes itself as down 07:10:37,869 - second cluster state change detected 07:10:40,202 - closing searcher 07:10:43,447 - cluster state change (live_nodes) (instance taking updates) 07:10:33,443 - starts seeing rejections from xxxxxx4 07:10:35,937 - detects a cluster state change (red herring) 07:10:37,899 - detects another cluster state change 07:10:43,478 - detects a live_nodes change (as shard4 leader is really down now) 07:10:44,586 - detects that shard4 has no leader anymore (xxxxx8) - new shard4 leader 07:10:32,981 - last story FROMLEADER (xxxxxx4) 07:10:35,980 - cluster state change detected (red herring) 07:10:37,975 - another cluster state change detected 07:10:43,868 - running election process(!) 07:10:44,069 - nsrchnj8 becomes leader, tries to sync from nsrchnj4 (which is already rejecting requests). My question is what should happen during leader transition? As I understand it, the leader publishes that is DOWN, and waits until it sees the response (by effectively waiting for cluster state messages), so by the time it starts to shutdown its own reader/writers, the cluster should be aware that it is unavailable... The fact that our update node took 11s and had to wait for the live_nodes changes in order to detect it didn't have a leader for shard4, seems like a real whole? >From what I am seeing here though, it is is like Jetty has shutdown its HTTP interface before any of that happens, so the instance taking updates can't communicate with it, we see a bunch of errors like this: 2013-06-24 07:10:33,443 ERROR [qtp2128911821-403089] o.a.s.u.SolrCmdDistributor [SolrException.java:129] forwarding update to http://xxxxxx4:10600/solr/collection1/ failed - retrying ... This is with Solr 4.3.0 + patch for SOLR-4829. I couldn't find this in any list of existing issues, and I thought we'd seen valid leader swaps before, so is this a very specific scenario we've hit? I can get full logs and such, will see how reproducible it is. Surely, Jetty shouldn't shutdown the interface until Solr has stopped? Or we are doing our shutdowns wrong (we are just using the "--stop" option on JETTY). Cheers, Daniel