Hi folks,

As you can tell, I'm investigating zmq and so I've been going through the Guide 
and the myriad examples to learn. I'm testing this on a MBP running OS X 10.6.7 
using the latest zmq v2.1.7 from github.

In http://zguide.zeromq.org/page:all#Node-Coordination , it talks about the 
syncpub & syncsub. At the end it mentions:
<<<<<
A more robust model could be:
        • Publisher opens PUB socket and starts sending "Hello" messages (not 
data).
        • Subscribers connect SUB socket and when they receive a Hello message 
they tell the publisher via a REQ/REP socket pair.
        • When the publisher has had all the necessary confirmations, it starts 
to send real data.
>>>>>

So I made various versions of those examples to try out different features. 
E.g., I was able to add in the above using the blocking send/recv. So far so 
good.

Then, I made my way up to hacking in using zmq_poll() in syncpub to explore the 
non-blocking support. I purposefully kept the syncsub code using the blocking 
send/recv. The code is: https://gist.github.com/978103

The code basically works. If you start up up to the 10 subscribers they will do 
the dance and then ingest the 1M messages and everybody will die peacefully.

However, if I start up more than 10 subscribers, I can start getting surprising 
behavior... If I start say 15 all together they can behave as expected. If I 
start some new subscribers after the 10 have started receiving the 1M messages 
then the latecomers will end up hung waiting for the publisher to notice the 
client is there at all (in the recv at syncsub4.c, line 23).

Steps to reproduce:
in terminal 1:
% syncpub3
in terminal 2:
% syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; 
% syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; syncsub4 &; 
...wait for the publisher to print "Switching to Spew mode..." and then start 
up some more subscribers...
% syncsub4 &; syncsub4 &; syncsub4 &;
% syncsub4 &; syncsub4 &; syncsub4 &;

As more subscribers come in later, they will more and more likely hang in that 
first recv() call.

I've tried various experiments (and fixed a few bugs in my code/understanding 
:-) but it comes down to this at the moment... by adding the call to usleep() 
at line 92 of syncpub3.c I can tune how easy/hard it is to reproduce the 
problem. That is to say that if there's no usleep() there at all the problem 
always happens but as I lengthen the usleep delay the problem shows up less and 
less. With the usleep(10) that's in that gist, I'm seeing it saturate all 8 
cores on this box and then reliably hanging everybody after that.

The observed behavior seems to be that the new connections are *not* completed 
while the  publisher is busy spewing as fast as it's able to and so the 
latecomers are basically ignored until/unless there's a gap for them to sneak 
through.

Is this something missing in how zmq is dealing with it's internal scheduling 
fairness or is this an issue in the kernel itself or am I missing something 
blatantly obvious?

Thanks,
John

_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Reply via email to