[ 
https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12927518#action_12927518
 ] 

Vishal K commented on ZOOKEEPER-900:
------------------------------------

Hi Flavio,

I have a suggestion for changing the blocking IO code in QuorumCnxManager. It 
keeps the current code structure and requires a small amount of changes. I am 
not sure if these comments should go in ZOOKEEPER-901. ZOOKEEPER-901 is 
probably addressing netty as well. Please feel free to close this JIRA if you 
intend to make all the changes as a part of ZOOKEEPER-901.

Basically we jusy need to move parts of initiateConnection and 
receiveConnection to SenderWorker and ReceiveWorker.

A. Current flow for receiving connection:
1. accept connection in Listener.run()
2. receiveConnection()
        - Read remote server's ID
        - Take action based on my ID and remote server's ID (disconnect and 
reconnect if my ID is > remote server's ID).
        - kill current set of SenderWorker and ReciveWorker threads
        - Start a new pair

B Current flow for initiating connection:
1. In connectOne(), connect if not already connected. else return.
2. send my ID to the remote server
3. if my ID < remote server disconnect and return
4. if my ID > remote server
        - kill current set of SenderWorker and ReceiveWorkter threads for the 
remote server
        - Start a new pair


Proposed changes:
Move the code that performs any blocking IO in SenderWorker and ReceiveWorker.

A. Proposed flow for receiving connection:
1. accept connection in Listener.run()
2. receiveConnection()
        - kill current set of SenderWorker and ReciveWorker threads
        - Start a new pair

Proposed changed to SenderWorker:
        - Read remote server's ID
        - Take action based on my ID and remote server's ID (disconnect and 
reconnect if my ID is > remote server's ID).
        - Proceed to normal operation


B Proposed flow for initiating connection:
1. in connectOne(), return if already connected
2. Start a new SenderWorker and ReceiveWorker pair
2. In SenderWorker
        - connect to remote server
        - write my ID
        - if my ID < remote server disconnect and return (shutdown the pair).
        - Proceed to normal operation


Questions:
- In QuorumCnxManager, is it necessary to kill the current pair and restart a 
new one every time we receive a connect request?
- In receiveConnection we may choose to reject an accepted connection if a 
thread in
SenderWorker is in the process of connecting. Otherwise a server with ID <
remote server may keep sending frequent connect request that will result in the
remote server closing connections for this peer. But I think we add a delay
before sending notifications, which might be good enough to prevent this
problem.

Let me know what you think about this. I can also help with the implementation.

-Vishal


> FLE implementation should be improved to use non-blocking sockets
> -----------------------------------------------------------------
>
>                 Key: ZOOKEEPER-900
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900
>             Project: Zookeeper
>          Issue Type: Bug
>            Reporter: Vishal K
>            Assignee: Flavio Junqueira
>            Priority: Critical
>
> From earlier email exchanges:
> 1. Blocking connects and accepts:
> a) The first problem is in manager.toSend(). This invokes connectOne(), which 
> does a blocking connect. While testing, I changed the code so that 
> connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() 
> does a socketChannel.connect(). After starting AsyncConnect, connectOne 
> starts a timer. connectOne continues with normal operations if the connection 
> is established before the timer expires, otherwise, when the timer expires it 
> interrupts AsyncConnect() thread and returns. In this way, I can have an 
> upper bound on the amount of time we need to wait for connect to succeed. Of 
> course, this was a quick fix for my testing. Ideally, we should use Selector 
> to do non-blocking connects/accepts. I am planning to do that later once we 
> at least have a quick fix for the problem and consensus from others for the 
> real fix (this problem is big blocker for us). Note that it is OK to do 
> blocking IO in SenderWorker and RecvWorker threads since they block IO to the 
> respective !
 peer.
> b) The blocking IO problem is not just restricted to connectOne(), but also 
> in receiveConnection(). The Listener thread calls receiveConnection() for 
> each incoming connection request. receiveConnection does blocking IO to get 
> peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the 
> peer that had sent the connection request. All of this is happening from the 
> Listener. In short, if a peer fails after initiating a connection, the 
> Listener thread won't be able to accept connections from other peers, because 
> it would be stuck in read() or connetOne(). Also the code has an inherent 
> cycle. initiateConnection() and receiveConnection() will have to be very 
> carefully synchronized otherwise, we could run into deadlocks. This code is 
> going to be difficult to maintain/modify.
> Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to