[
https://issues.apache.org/jira/browse/ZOOKEEPER-900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12931056#action_12931056
]
Vishal K commented on ZOOKEEPER-900:
------------------------------------
Hi Flavio,
I have a question regarding the logic that determines which connection
to retain if peer 1 and peer 2 decide to communicate with each other.
Suppose peer 1 connects to peer 2. It first sends its sid as a
challenge. Peer 2 reads the sid and determines whether to keep the
connection or initiate a connection back to peer 1. Both determine
that peer 2 should be the one initiating the connection to peer 1
since sid of peer2 > sid of peer1. I am concerned that they both
may not be able to maintain any connection since the handshake is
one-way.
In the current implementation, peer1 disconnects immediately after
writing the challenge to peer 2. It can happen that peer 2 may get a
ClosedChannelException before it reads the challenge from peer 1. As a
result, peer 2 will not initiate a connection to peer 1.
Is this a legitimate problem? If it is, how about we ask peer 2 to
send back a ACK after it reads the challenge. Peer 1 will do a timed
read() after writing a challenge to peer 2. This will hopefully give
peer 2 enough time to read the challenge and take appropriate
action. If peer 2 is really slow, peer 1 will timeout on the read
operation.
-Vishal
> FLE implementation should be improved to use non-blocking sockets
> -----------------------------------------------------------------
>
> Key: ZOOKEEPER-900
> URL: https://issues.apache.org/jira/browse/ZOOKEEPER-900
> Project: Zookeeper
> Issue Type: Bug
> Reporter: Vishal K
> Assignee: Vishal K
> Priority: Critical
> Fix For: 3.4.0
>
>
> From earlier email exchanges:
> 1. Blocking connects and accepts:
> a) The first problem is in manager.toSend(). This invokes connectOne(), which
> does a blocking connect. While testing, I changed the code so that
> connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run()
> does a socketChannel.connect(). After starting AsyncConnect, connectOne
> starts a timer. connectOne continues with normal operations if the connection
> is established before the timer expires, otherwise, when the timer expires it
> interrupts AsyncConnect() thread and returns. In this way, I can have an
> upper bound on the amount of time we need to wait for connect to succeed. Of
> course, this was a quick fix for my testing. Ideally, we should use Selector
> to do non-blocking connects/accepts. I am planning to do that later once we
> at least have a quick fix for the problem and consensus from others for the
> real fix (this problem is big blocker for us). Note that it is OK to do
> blocking IO in SenderWorker and RecvWorker threads since they block IO to the
> respective !
peer.
> b) The blocking IO problem is not just restricted to connectOne(), but also
> in receiveConnection(). The Listener thread calls receiveConnection() for
> each incoming connection request. receiveConnection does blocking IO to get
> peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to the
> peer that had sent the connection request. All of this is happening from the
> Listener. In short, if a peer fails after initiating a connection, the
> Listener thread won't be able to accept connections from other peers, because
> it would be stuck in read() or connetOne(). Also the code has an inherent
> cycle. initiateConnection() and receiveConnection() will have to be very
> carefully synchronized otherwise, we could run into deadlocks. This code is
> going to be difficult to maintain/modify.
> Also see: https://issues.apache.org/jira/browse/ZOOKEEPER-822
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.