Hi All,

I'm trying to use 0MQ in our project, but recently I found a very tricky 
problem in the pub/sub pattern.

the setup is like this:

version: 3.2.3, c++ API.

transport protocol: tcp

one SUB socket, subscribe to everything, high water mark = 1
one PUB socket, high water mark = 1

there is application level heartbeat between pub and sub, so if SUB socket 
discover there is no heartbeat from PUB, it assumes PUB is gone and will reset 
itself.
This means whenever PUB is shutdown, SUB will close the socket, create it 
again, reconnect, and resubscribe.
(The heartbeat is not necessary on normal PUB shutdown, but is essential in 
situation like PUB server power down or cable cut.)

What I found is, if I keep on restarting PUB server, ie. once every 5 seconds, 
the SUB socket can fail to subscribe and never receive anything from PUB.
This is very hard to reproduce and happens once in a few hundred times.

Unfortunately we are not allowed to use tcpdump here, so all I can do is use 
strace on the io thread to monitor socket functions.
When it's broken, the messages SUB exchange with PUB after reconnect is like 
this:

recvfrom(53, "\377\0\0\0\0\0\0\0\1\177", 12, 0, NULL, NULL) = 10
| 00000  ff 00 00 00 00 00 00 00  01 7f
recvfrom(53, 0x7f881000b9db, 2, 0, 0, 0) = -1 EAGAIN (Resource temporarily 
unavailable)
sendto(53, "\377\0\0\0\0\0\0\0\1\177\1\2", 12, 0, NULL, 0) = 12
| 00000  ff 00 00 00 00 00 00 00  01 7f 01 02
recvfrom(53, "\1\1", 2, 0, NULL, NULL)  = 2
| 00000  01 01
recvfrom(53, "\0\0", 8192, 0, NULL, NULL) = 2
| 00000  00 00
sendto(53, "\0\0", 2, 0, NULL, 0)       = 2

The last line is what broke it, when it's normal, the last line look like this:
sendto(54, "\0\0\0\1\1", 5, 0, NULL, 0) = 5

After reading this ZMTP document:
http://rfc.zeromq.org/spec:23
I still can't crack the exact meanings of these messages, but I attempt to 
explain these messages like this:
After connected, SUB tries to read 12 bytes from PUB, it got 10 initially, but 
received the last two eventually as 01 01.(I guess the 12 bytes are some 
signature)
Then SUB sends 12 byts to PUB, with the first 10 bytes the same, but last two 
bytes as 01 02.
Then SUB tries to read 8192 bytes, but only got two, which are 00 00
Then SUB decides to send 00 00 back to PUB, and that's it, there will be 
nothing coming from this SUB socket.
The TCP connection is still there, so there must be something wrong with the 
subscription.

And in the normal case, SUB will send 00 00 00 11 11 instead of 00 00, I guess 
this must be the subscription message, but i don't understand why it's 5 bytes?

So can someone please explain the exchange sequence on PUB/SUB, and maybe shed 
some lights on what is broken here?

Thanks very much
Shan


The information contained in this email is strictly confidential and for the 
use of the addressee only, unless otherwise indicated. If you are not the 
intended recipient, please do not read, copy, use or disclose to others this 
message or any attachment. Please also notify the sender by replying to this 
email or by telephone (+44(020 7896 0011) and then delete the email and any 
copies of it. Opinions, conclusion (etc) that do not relate to the official 
business of this company shall be understood as neither given nor endorsed by 
it. IG is a trading name of IG Markets Limited (a company registered in England 
and Wales, company number 04008957) and IG Index Limited (a company registered 
in England and Wales, company number 01190902). Registered address at Cannon 
Bridge House, 25 Dowgate Hill, London EC4R 2YA. Both IG Markets Limited 
(register number 195355) and IG Index Limited (register number 114059) are 
authorised and regulated by the Financial Services Authority.
_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to