Flavio Paiva Junqueira commented on ZOOKEEPER-140:

It seems to me that there are two unnecessary synchronized blocks: one on 
sendTo() for the call to initiateConnection, and second upon a new connection 
and subsequent call to receiveConnection. Both methods synchronize again on 
senderWorkerMap when it is time to update the bookkeeping information on the 
connections. By removing these two, we prevent the problem pointed out in this 
jira. I have tested, and it seems to work, and logic also seems to work to me.

I will postpone submitting a patch because I'd like to have a patch for 127 
reviewed and committed first. 

> Deadlock in QuorumCnxManager
> ----------------------------
>                 Key: ZOOKEEPER-140
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-140
>             Project: Zookeeper
>          Issue Type: Bug
>            Reporter: Flavio Paiva Junqueira
> Frequently the servers deadlock in QuorumCnxManager:initiateConnection on
> s.read(msgBuffer) when reading the challenge from the peer.
> Calls to initiateConnection and receiveConnection are synchronized, so only 
> one or the other can be executing at a time. This prevents two connections 
> from opening between the same pair of servers.
> However, it seems that this leads to deadlock, as in this scenario:
> {noformat}
> A (initiate --> B)
> B (initiate --> C)
> C (initiate --> A)
> {noformat}
> initiateConnection can only complete when receiveConnection runs on the 
> remote peer and answers the challenge. If all servers are blocked in 
> initiateConnection, receiveConnection never runs and leader election halts.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

Reply via email to