shard failure, leader transition took 11s (seems high?)

Daniel Collins Mon, 24 Jun 2013 05:26:44 -0700

Just had an odd scenario in our current Solr system (4.3.0 + SOLR-4829
patch), 4 shards, 2 replicas (leader + 1 other) per shard spread across 8
machines.


We sent all our updates into a single instance, and we shutdown a leader
for maintenance, expecting it to failover to the other replica.  What I saw
was that when the leader shard went down, the instance taking updates
started seeing rejections almost instantly, yet the cluster state changes
didn't occur for several seconds.  During that time, we had no valid leader
for one of our shards, so we were losing updates and queries.

(shard4 leader)
07:10:33,124 - xxxxxx4 (shard 4 leader) starts coming down.
07:10:35,885 - cluster state change is detected
07:10:37,172 - nsrchnj4 publishes itself as down
07:10:37,869 - second cluster state change detected
07:10:40,202 - closing searcher
07:10:43,447 - cluster state change (live_nodes)

(instance taking updates)
07:10:33,443 - starts seeing rejections from xxxxxx4
07:10:35,937 - detects a cluster state change (red herring)
07:10:37,899 - detects another cluster state change
07:10:43,478 - detects a live_nodes change (as shard4 leader is really down
now)
07:10:44,586 - detects that shard4 has no leader anymore

(xxxxx8) - new shard4 leader

07:10:32,981 - last story FROMLEADER (xxxxxx4)
07:10:35,980 - cluster state change detected (red herring)
07:10:37,975 - another cluster state change detected
07:10:43,868 - running election process(!)
07:10:44,069 - nsrchnj8 becomes leader, tries to sync from nsrchnj4 (which
is already rejecting requests). My question is what should happen during
leader transition?  As I understand it, the leader publishes that is DOWN,
and waits until it sees the response (by effectively waiting for cluster
state messages), so by the time it starts to shutdown its own
reader/writers, the cluster should be aware that it is unavailable...  The
fact that our update node took 11s and had to wait for the live_nodes
changes in order to detect it didn't have a leader for shard4, seems like a
real whole?

>From what I am seeing here though, it is is like Jetty has shutdown its
HTTP interface before any of that happens, so the instance taking updates
can't communicate with it, we see a bunch of errors like this:

2013-06-24 07:10:33,443 ERROR [qtp2128911821-403089]
o.a.s.u.SolrCmdDistributor [SolrException.java:129] forwarding update to
http://xxxxxx4:10600/solr/collection1/ failed - retrying ...

This is with Solr 4.3.0 + patch for SOLR-4829.  I couldn't find this in any
list of existing issues, and I thought we'd seen valid leader swaps before,
so is this a very specific scenario we've hit?  I can get full logs and
such, will see how reproducible it is.

Surely, Jetty shouldn't shutdown the interface until Solr has stopped?  Or
we are doing our shutdowns wrong (we are just using the "--stop" option on
JETTY).

Cheers, Daniel

shard failure, leader transition took 11s (seems high?)

Reply via email to