Alan Woodward created SOLR-6763:
-----------------------------------

             Summary: Shard leader election thread can persist across 
connection loss
                 Key: SOLR-6763
                 URL: https://issues.apache.org/jira/browse/SOLR-6763
             Project: Solr
          Issue Type: Bug
            Reporter: Alan Woodward


A ZK connection loss during a call to ElectionContext.waitForReplicasToComeUp() 
will result in two leader election processes for the shard running within a 
single node - the initial election that was waiting, and another spawned by the 
ReconnectStrategy.  After the function returns, the first election will create 
an ephemeral leader node.  The second election will then also attempt to create 
this node, fail, and try to put itself into recovery.  It will also set the 
'isLeader' value in its CloudDescriptor to false.

The first election, meanwhile, is happily maintaining the ephemeral leader 
node.  But any updates that are sent to the shard will cause an exception due 
to the mismatch between the cloudstate (where this node is the leader) and the 
local CloudDescriptor leader state.

I think the fix is straightfoward - the call to zkClient.getChildren() in 
waitForReplicasToComeUp should be called with 'retryOnReconnect=false', rather 
than 'true' as it is currently, because once the connection has dropped we're 
going to launch a new election process anyway.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to