Re: Problems in FLE implementation

Flavio Junqueira Sun, 05 Sep 2010 07:17:20 -0700

Hi Vishal, I haven't tried to reboot the machines, but Ivan Kelly did, if I'm not mistaken, with VMs (comment from Aug. 6).

-Flavio

On Sep 3, 2010, at 9:14 PM, Vishal K wrote:

Hi Flavio,

On Fri, Sep 3, 2010 at 3:02 PM, Flavio Junqueira <f...@yahoo-inc.com> wrote:

Vishal, 60-80 seconds is definitely high, and I would expect people to complain if they were observing such an amount of time to recover. I personally haven't seen any such a case.

Can you describe how you were trying to reproduce the bug? On physical machines, it took me 15 retries (reboot -n leader) to reproduce the problem. On VMs it is lot more frequent.

On my end, you have good points, but I'm not entirely convinced that we need changes as you're proposing them. Seeing a patch would definitely help to determine. If you can't provide a patch due to legal issue, we should work on one or more to fix at least some of the issues you observed.

You are right, my fixes may not be the best approach. My intention was to have a quick fix for our internal use and then start-off a discussion for real fix. I will send out the diff soon.

Thanks.
-Vishal

On Sep 3, 2010, at 8:51 PM, Vishal K wrote:

Hi Mahadev,

To be honest, yes, we need the quick fix. I am really surprised why anyone
else is not seeing this problem. There is nothing special with our setup. If
you look at the JIRA, I have posted logs from various setups (different OS,
using physical machines, using virtual machines, etc). Also, the bug is
evident from the code. Pretty much every developer in our team has hit this
bug.

Now, we have an application that is highly time-sensitive. Maybe most of the
applications that ZK is running on today can tolerate a 60-80 seconds of FLE
convergence. For us such a long delays (under normal conidtions) are not
acceptable.

It will be nice if people can provide some feedback on how time sensitive
their application is? Is 60-80 seconds delay in FLE acceptable?
What has been your experience with running ZK in production? How often do
you have leader reboots?

Feedback will be greatly apprecaited.

Thanks.
-Vishal

On Fri, Sep 3, 2010 at 1:44 PM, Mahadev Konar <maha...@yahoo-inc.com> wrote:

Hi Vishal,
Thanks for picking this up. My comments are inline:

On 9/2/10 3:31 PM, "Vishal K" <vishalm...@gmail.com> wrote:

Hi All,

I had posted this message as a comment for ZOOKEEPER-822. I thought it

might
be a good idea to give a wider attention so that it will be easier to

collect feedback.

I found few problems in the FLE implementation while debugging for:

https://issues.apache.org/jira/browse/ZOOKEEPER-822. Following the email

below might require some background. If necessary, please browse the
JIRA. I

have a patch for 1. a) and 2). I will send them out soon.

1. Blocking connects and accepts:

a) The first problem is in manager.toSend(). This invokes connectOne(),

which does a blocking connect. While testing, I changed the code so that
connectOne() starts a new thread called AsyncConnct(). AsyncConnect.run()
does a socketChannel.connect(). After starting AsyncConnect, connectOne

starts a timer. connectOne continues with normal operations if the
connection is established before the timer expires, otherwise, when the

timer expires it interrupts AsyncConnect() thread and returns. In this
way,

I can have an upper bound on the amount of time we need to wait for
connect
to succeed. Of course, this was a quick fix for my testing. Ideally, we

should use Selector to do non-blocking connects/accepts. I am planning to
do

that later once we at least have a quick fix for the problem and
consensus
from others for the real fix (this problem is big blocker for us). Note

that
it is OK to do blocking IO in SenderWorker and RecvWorker threads since
they
block IO to the respective peer.
Vishal, I am really concerned about starting up new threads in the server.

We really need a total revamp of this code (using NIO and selector). Is the
quick fix really required. Zookeeper servers have been running in

production
for a while, and this problem hasn't been noticed by anyone. Shouldn't we
fix it with NIO then?

b) The blocking IO problem is not just restricted to connectOne(), but
also
in receiveConnection(). The Listener thread calls receiveConnection() for

each incoming connection request. receiveConnection does blocking IO to
get

peer's info (s.read(msgBuffer)). Worse, it invokes connectOne() back to
the
peer that had sent the connection request. All of this is happening from
the
Listener. In short, if a peer fails after initiating a connection, the

Listener thread won't be able to accept connections from other peers,
because it would be stuck in read() or connetOne(). Also the code has an
inherent cycle. initiateConnection() and receiveConnection() will have to

be
very carefully synchronized otherwise, we could run into deadlocks. This

code is going to be difficult to maintain/modify.

2. Buggy senderWorkerMap handling:
The code that manages senderWorkerMap is very buggy. It is causing

multiple
election rounds. While debugging I found that sometimes after FLE a node

will have its sendWorkerMap empty even if it has SenderWorker and
RecvWorker
threads for each peer.
IT would be great to clean it up!! I'd be happy to see this class be
cleaned
up! :)

a) The receiveConnection() method calls the finish() method, which

removes
an entry from the map. Additionally, the thread itself calls finish()

which
could remove the newly added entry from the map. In short,
receiveConnection

is causing the exact condition that you mentioned above.

b) Apart from the bug in finish(), receiveConnection is making an entry
in
senderWorkerMap at the wrong place. Here's the buggy code:
SendWorker vsw = senderWorkerMap.get(sid);

senderWorkerMap.put(sid, sw);
if(vsw != null)
vsw.finish();
It makes an entry for the new thread and then calls finish, which causes
the

new thread to be removed from the Map. The old thread will also get
terminated since finish() will interrupt the thread.

3. Race condition in receiveConnection and initiateConnection:

*In theory*, two peers can keep disconnecting each other's connection.

Example:
T0: Peer 0 initiates a connection (request 1)
                                    T1: Peer 1 receives connection from

peer 0
                                    T2: Peer 1 calls receiveConnection()

T2: Peer 0 closes connection to Peer 1 because its ID is lower.
T3: Peer 0 re-initiates connection to Peer 1 from manger.toSend()

(request
2)
                                    T3: Peer 1 terminates older

connection
to peer 0
                                    T4: Peer 1 calls connectOne() which

starts new sendWorker threads for peer 0
                                    T5: Peer 1 kills connection created

in
T3 because it receives another (request 2) connect request from 0

The problem here is that while Peer 0 is accepting a connection from Peer
1
it can also be initiating a connection to Peer 1. So if they hit the
right

frequencies they could sit in a connect/disconnect loop and cause
multiple
rounds of leader election.

I think the cause here is again blocking connects()/accepts(). A peer

starts
to take action (to kill existing threads and start new threads) as soon

as a
connection is established at the* *TCP level. That is, it does not give
us

any control to synchronized connect and accepts. We could use
non-blocking

connects and accepts. This will allow us to a) tell a thread to not
initiate
a connection because the listener is about to accept a connection from

the
remote peer (use isAcceptable() and isConnectable()methods of
SelectionKey)
and b) prevent a thread from initiating multiple connect request to the
same

peer. It will simplify synchronization.

Any thoughts?

I am all for cleaning up the code :).

Thanks
mahadev
-Vishal

flavio
junqueira

research scientist

f...@yahoo-inc.com
direct +34 93-183-8828

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300    fax (408) 349 3301

flavio
junqueira

research scientist

f...@yahoo-inc.com
direct +34 93-183-8828

avinguda diagonal 177, 8th floor, barcelona, 08018, es
phone (408) 349 3300 fax (408) 349 3301

Re: Problems in FLE implementation

Reply via email to