Hi,
I was seeing random problems in a REQ-ROUTER setup where clients
couldn't connect to the server anymore until the server got restarted.
The network traffic didn't shed any light and libzmq gave no errors.
It simply and silently didn't work.
The first glimmer of hope came when we figured out in what situation
the problem occured and how to reproduce it. Namely the problem occurs
whenever a client system crashes (the cluster had some power problems
so we had some sudden cold restarts). This is easily reproducable by
"echo b >/proc/sysrq-trigger" on a virtual machine.
Now I've found the problem. Each client has a fixed identity, it's
hostname. And the ROUTER socket does not allow 2 connections with the
same identity (how could it?). After a crash and reboot the client
reconnects, the ROUTER sees a identity collision and ignores the new
client. This can be changed by setting ZMQ_ROUTER_HANDOVER for the
socket to 1, which is how I fixed the problem in the server now.
But I think there still is a problem here:
1) The new client is simply ignored. There is no feedback that
anything has gone wrong. Shouldn't ZMQ send a error reply that the
identity was unaceptable? Or should it close the connection? Could it
do both? Close the connection saying the identity was unaceptable?
How about a monitoring event for this?
2) Why doesn't the old connection die? Without any traffic the TCP
socket wouldn't detect an error imediatly. But I would expect it to
die eventually. I haven't reproduced this but I think the reconnect
problem persisted for days, meaning the connection to the long since
crashed client never went away.
3) What happens to the messages from the duplicate client? Debugging
this problem I saw that the initial handshake and CURVE completes just
fine. I see a ZAP request and reply also as expected. Then the
identity message comes in, gets received by the system socket, decoded
through CURVE, put in the incoming pipe for the ROUTER socket and the
other end of the pipe gets woken up. That causes the idenity to get
checked against existing identities and the message is ignored. So far
so good. But the connection isn't closed and the pipe isn't put back
into sleeping state. The followup messages get also decoded and added
to the pipe. BUT at that point the peer is detected to be already
awake and isn't woken up. The ROUTER socket never resets the pipe to
the sleeping state so it never gets woken up for this pipe again. This
causes all further messages from this client to be ignored.
In my case the client only sends one message and then waits for a
reply. But what if you have a PUSH-ROUTER connection and the client
just keeps on sending messages? I think they will all get added to the
pipe and never processed or freed.
So I'm thinking this allows for a DOS attack.
MfG
Goswin
_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev