[zeromq-dev] Handling network failures in parallel pipeline ventilator?

Joe Planisky Thu, 23 Aug 2012 14:02:17 -0700

I'm a little stumped about how to handle network failures in a system that uses 
PUSH/PULL sockets as in the Parallel Pipeline case in chapter 2 of The Guide.


As in the Guide, suppose my ventilator is pushing tasks to 3 workers.  It 
doesn't matter which task gets pushed to which worker, but it's very important 
that all tasks eventually get sent to a worker.  

Everything is working fine; tasks are being load balanced to workers, workers 
are doing their thing and sending the results on to a sink.  Now suppose 
there's a network failure between the ventilator and one of the workers. 
Suppose the ethernet cable to one of the worker machines is unplugged.

Based on what we've seen in practice, the ventilator socket will still attempt 
to push some number of tasks to the now disconnected worker before realizing 
there's a problem.  Tasks intended for that worker start backing up, presumably 
in ZMQ buffers and/or in buffers in the underlying OS (Ubuntu 10.04 in our 
case).  Eventually, the PUSH socket figures out that something is wrong and 
stops trying to send additional tasks to that worker. All new tasks are then 
load balanced to the remaining workers.  

However, the tasks that are queued up for the disconnected worker are stuck and 
are never sent anywhere unless or until the original worker comes back online.  
If the original worker never comes back, those tasks never get executed. (If it 
does come back, it gets a burst of all the backed up tasks and the PUSH socket 
resumes load balancing new tasks to all 3 workers.)

We'd like to prevent this backup from happening or at least minimize the number 
of tasks that get stuck.  We've tried setting high water marks, send and 
receive timeouts, and send and receive buffer sizes in ZMQ to small values 
(e.g. 1) hoping that it would cause the PUSH socket to notice the problem 
sooner, but at best we still get several dozen task messages backed up before 
the socket notices the problem and stops trying.  (Our task messages are small, 
about 520 bytes each.)

If we have to, we can deal with the same task getting sent to more than one 
worker on an occasional basis, but we'd like to avoid that if possible.

We're using ZMQ 2.2.0, but are also investigating the 3.2.0 release candidate.  
If it matters, we're accessing ZMQ with Java using the jzmq bindings.  The 
underlying OS is Ubuntu 10.04.

Any suggestions for how to deal with this?

--
Joe


_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

[zeromq-dev] Handling network failures in parallel pipeline ventilator?

Reply via email to