Hi Mahadev, To be honest, yes, we need the quick fix. I am really surprised why anyone else is not seeing this problem. There is nothing special with our setup. If you look at the JIRA, I have posted logs from various setups (different OS, using physical machines, using virtual machines, etc). Also, the bug is evident from the code. Pretty much every developer in our team has hit this bug.
Now, we have an application that is highly time-sensitive. Maybe most of the applications that ZK is running on today can tolerate a 60-80 seconds of FLE convergence. For us such a long delays (under normal conidtions) are not acceptable. It will be nice if people can provide some feedback on how time sensitive their application is? Is 60-80 seconds delay in FLE acceptable? What has been your experience with running ZK in production? How often do you have leader reboots? Feedback will be greatly apprecaited. Thanks. -Vishal On Fri, Sep 3, 2010 at 1:44 PM, Mahadev Konar <maha...@yahoo-inc.com> wrote: > Hi Vishal, > Thanks for picking this up. My comments are inline: > > > On 9/2/10 3:31 PM, "Vishal K" <vishalm...@gmail.com> wrote: > > > Hi All, > > > > I had posted this message as a comment for ZOOKEEPER-822. I thought it > might > > be a good idea to give a wider attention so that it will be easier to > > collect feedback. > > > > I found few problems in the FLE implementation while debugging for: > > https://issues.apache.org/jira/browse/ZOOKEEPER-822. Following the email > > below might require some background. If necessary, please browse the > JIRA. I > > have a patch for 1. a) and 2). I will send them out soon. > > > > 1. Blocking connects and accepts: > > > > a) The first problem is in manager.toSend(). This invokes connectOne(), > > which does a blocking connect. While testing, I changed the code so that > > connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() > > does a socketChannel.connect(). After starting AsyncConnect, connectOne > > starts a timer. connectOne continues with normal operations if the > > connection is established before the timer expires, otherwise, when the > > timer expires it interrupts AsyncConnect() thread and returns. In this > way, > > I can have an upper bound on the amount of time we need to wait for > connect > > to succeed. Of course, this was a quick fix for my testing. Ideally, we > > should use Selector to do non-blocking connects/accepts. I am planning to > do > > that later once we at least have a quick fix for the problem and > consensus > > from others for the real fix (this problem is big blocker for us). Note > that > > it is OK to do blocking IO in SenderWorker and RecvWorker threads since > they > > block IO to the respective peer. > Vishal, I am really concerned about starting up new threads in the server. > We really need a total revamp of this code (using NIO and selector). Is the > quick fix really required. Zookeeper servers have been running in > production > for a while, and this problem hasn't been noticed by anyone. Shouldn't we > fix it with NIO then? > > > > > > b) The blocking IO problem is not just restricted to connectOne(), but > also > > in receiveConnection(). The Listener thread calls receiveConnection() for > > each incoming connection request. receiveConnection does blocking IO to > get > > peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to > the > > peer that had sent the connection request. All of this is happening from > the > > Listener. In short, if a peer fails after initiating a connection, the > > Listener thread won't be able to accept connections from other peers, > > because it would be stuck in read() or connetOne(). Also the code has an > > inherent cycle. initiateConnection() and receiveConnection() will have to > be > > very carefully synchronized otherwise, we could run into deadlocks. This > > code is going to be difficult to maintain/modify. > > > > > 2. Buggy senderWorkerMap handling: > > The code that manages senderWorkerMap is very buggy. It is causing > multiple > > election rounds. While debugging I found that sometimes after FLE a node > > will have its sendWorkerMap empty even if it has SenderWorker and > RecvWorker > > threads for each peer. > IT would be great to clean it up!! I'd be happy to see this class be > cleaned > up! :) > > > > > a) The receiveConnection() method calls the finish() method, which > removes > > an entry from the map. Additionally, the thread itself calls finish() > which > > could remove the newly added entry from the map. In short, > receiveConnection > > is causing the exact condition that you mentioned above. > > > > b) Apart from the bug in finish(), receiveConnection is making an entry > in > > senderWorkerMap at the wrong place. Here's the buggy code: > > SendWorker vsw = senderWorkerMap.get(sid); > > senderWorkerMap.put(sid, sw); > > if(vsw != null) > > vsw.finish(); > > It makes an entry for the new thread and then calls finish, which causes > the > > new thread to be removed from the Map. The old thread will also get > > terminated since finish() will interrupt the thread. > > > > 3. Race condition in receiveConnection and initiateConnection: > > > > *In theory*, two peers can keep disconnecting each other's connection. > > > > Example: > > T0: Peer 0 initiates a connection (request 1) > > T1: Peer 1 receives connection from > > peer 0 > > T2: Peer 1 calls receiveConnection() > > T2: Peer 0 closes connection to Peer 1 because its ID is lower. > > T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend() > (request > > 2) > > T3: Peer 1 terminates older > connection > > to peer 0 > > T4: Peer 1 calls connectOne() which > > starts new sendWorker threads for peer 0 > > T5: Peer 1 kills connection created > in > > T3 because it receives another (request 2) connect request from 0 > > > > The problem here is that while Peer 0 is accepting a connection from Peer > 1 > > it can also be initiating a connection to Peer 1. So if they hit the > right > > frequencies they could sit in a connect/disconnect loop and cause > multiple > > rounds of leader election. > > > > I think the cause here is again blocking connects()/accepts(). A peer > starts > > to take action (to kill existing threads and start new threads) as soon > as a > > connection is established at the* *TCP level. That is, it does not give > us > > any control to synchronized connect and accepts. We could use > non-blocking > > connects and accepts. This will allow us to a) tell a thread to not > initiate > > a connection because the listener is about to accept a connection from > the > > remote peer (use isAcceptable() and isConnectable()methods of > SelectionKey) > > and b) prevent a thread from initiating multiple connect request to the > same > > peer. It will simplify synchronization. > > > > Any thoughts? > > > I am all for cleaning up the code :). > > Thanks > mahadev > > -Vishal > > > >