[ https://issues.apache.org/jira/browse/ZOOKEEPER-932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Vishal K updated ZOOKEEPER-932: ------------------------------- Attachment: ZOOKEEPER-932.patch Hi Flavio, I have attached a diff that moves all blocking read/write calls to SendWorker and RecvWorker threads. The implementation is pretty much adheres to what I had proposed. The basic idea is to perform the logic if determining which connection to keep (depending on server id) in SenderWorker thread. Listener does not perform any IO. Therefore, it is free to accept connections from other peers. Earlier, if a peer dies after establishing the connection Listener used to block on a read. Also, toSend() used to serially connect to remote peers. This used to slow down FLE. I have done some basic testing of the code. Before I do some more testing, I wanted to send out the diff and see if you are ok with these changes. The patch that I had submitted for ZOOKEEPER-900 eliminates use of SocketChannel. This attached diff eliminates the need to perform blocking IO outside of the worker threads. Let know if you have any comments. Thanks. -Vishal > Move blocking read/write calls to SendWorker and RecvWorker Threads > ------------------------------------------------------------------- > > Key: ZOOKEEPER-932 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-932 > Project: Zookeeper > Issue Type: Sub-task > Reporter: Vishal K > Assignee: Vishal K > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-932.patch > > > Copying relevant comments: > Vishal K added a comment - 02/Nov/10 02:09 PM > Hi Flavio, > I have a suggestion for changing the blocking IO code in QuorumCnxManager. It > keeps the current code structure and requires a small amount of changes. I am > not sure if these comments should go in ZOOKEEPER-901. ZOOKEEPER-901 is > probably addressing netty as well. Please feel free to close this JIRA if you > intend to make all the changes as a part of ZOOKEEPER-901. > Basically we jusy need to move parts of initiateConnection and > receiveConnection to SenderWorker and ReceiveWorker. > A. Current flow for receiving connection: > 1. accept connection in Listener.run() > 2. receiveConnection() > * Read remote server's ID > * Take action based on my ID and remote server's ID (disconnect and > reconnect if my ID is > remote server's ID). > * kill current set of SenderWorker and ReciveWorker threads > * Start a new pair > B Current flow for initiating connection: > 1. In connectOne(), connect if not already connected. else return. > 2. send my ID to the remote server > 3. if my ID < remote server disconnect and return > 4. if my ID > remote server > * kill current set of SenderWorker and ReceiveWorkter threads for the > remote server > * Start a new pair > Proposed changes: > Move the code that performs any blocking IO in SenderWorker and ReceiveWorker. > A. Proposed flow for receiving connection: > 1. accept connection in Listener.run() > 2. receiveConnection() > * kill current set of SenderWorker and ReciveWorker threads > * Start a new pair > Proposed changed to SenderWorker: > * Read remote server's ID > * Take action based on my ID and remote server's ID (disconnect and > reconnect if my ID is > remote server's ID). > * Proceed to normal operation > B Proposed flow for initiating connection: > 1. in connectOne(), return if already connected > 2. Start a new SenderWorker and ReceiveWorker pair > 2. In SenderWorker > * connect to remote server > * write my ID > * if my ID < remote server disconnect and return (shutdown the pair). > * Proceed to normal operation > Questions: > * In QuorumCnxManager, is it necessary to kill the current pair and > restart a new one every time we receive a connect request? > * In receiveConnection we may choose to reject an accepted connection if > a thread in > SenderWorker is in the process of connecting. Otherwise a server with > ID < > remote server may keep sending frequent connect request that will > result in the > remote server closing connections for this peer. But I think we add a > delay > before sending notifications, which might be good enough to prevent this > problem. > Let me know what you think about this. I can also help with the > implementation. > Flavio Junqueira added a comment - 03/Nov/10 05:28 PM > Hi Vishal, I like your proposal, it seems reasonable and not difficult to > implement. > On your questions: > 1. I don't think it is necessary to kill a pair SenderWorker/RecvWorker > every time, and I'd certainly support changing it; > 2. I'm not sure where you're suggesting to introduce a delay. In the FLE > code, a server sends a new batch of notifications if it changes its vote or > if it times out waiting for a new notification. This timeout value increases > over time. I was actually thinking that we should reset the timeout value > upon receiving a notification. I think this is a bug.... > Given that it is your proposal, I'd be happy to let you take a stab at it and > help you out if you need a hand. Does it make sense for you? -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.