[ https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12932995#action_12932995 ]
Vishal K commented on ZOOKEEPER-900: ------------------------------------ Hi Flavio, Thanks. I will take shutting down of worker threads in a separate jira. I am cleaning up QCM further to incorporate the change mentioned in my comment on 02/Nov/10 02:09 PM. I have a few more questions: 1. I have a question about the following piece of code in QCM: if (remoteSid == QuorumPeer.OBSERVER_ID) { /* * Choose identifier at random. We need a value to identify * the connection. */ remoteSid = observerCounter--; LOG.info("Setting arbitrary identifier to observer: " + remoteSid); } Should we allow this? The problem with this code is that if a peer connects twice with QuorumPeer.OBSERVER_ID, we will end up creating threads for this peer twice. This could result in redundant SendWorker/RecvWorker threads. I haven't used observers yet. The documentation http://hadoop.apache.org/zookeeper/docs/r3.3.0/zookeeperObservers.html says that just like followers, observers should have server IDs. In which case, why do we want to provide a wild-card? 2. Should I add a check to reject connections from peers that are not listed in the configuration file? Currently, we are not doing any sanity check for server IDs. I think this might fix ZOOKEEPER-851. The fix is simple. However, I am not sure if anyone in community is relying on this ability. -Vishal > FLE implementation should be improved to use non-blocking sockets > ----------------------------------------------------------------- > > Key: ZOOKEEPER-900 > URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900 > Project: Zookeeper > Issue Type: Bug > Reporter: Vishal K > Assignee: Vishal K > Priority: Critical > Fix For: 3.4.0 > > Attachments: ZOOKEEPER-900.patch, ZOOKEEPER-900.patch1, > ZOOKEEPER-900.patch2 > > > From earlier email exchanges: > 1. Blocking connects and accepts: > a) The first problem is in manager.toSend(). This invokes connectOne(), which > does a blocking connect. While testing, I changed the code so that > connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run() > does a socketChannel.connect(). After starting AsyncConnect, connectOne > starts a timer. connectOne continues with normal operations if the connection > is established before the timer expires, otherwise, when the timer expires it > interrupts AsyncConnect() thread and returns. In this way, I can have an > upper bound on the amount of time we need to wait for connect to succeed. Of > course, this was a quick fix for my testing. Ideally, we should use Selector > to do non-blocking connects/accepts. I am planning to do that later once we > at least have a quick fix for the problem and consensus from others for the > real fix (this problem is big blocker for us). Note that it is OK to do > blocking IO in SenderWorker and RecvWorker threads since they block IO to the > respective ! peer. > b) The blocking IO problem is not just restricted to connectOne(), but also > in receiveConnection(). The Listener thread calls receiveConnection() for > each incoming connection request. receiveConnection does blocking IO to get > peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the > peer that had sent the connection request. All of this is happening from the > Listener. In short, if a peer fails after initiating a connection, the > Listener thread won't be able to accept connections from other peers, because > it would be stuck in read() or connetOne(). Also the code has an inherent > cycle. initiateConnection() and receiveConnection() will have to be very > carefully synchronized otherwise, we could run into deadlocks. This code is > going to be difficult to maintain/modify. > Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.