Re: [zeromq-dev] Important: backward incompatible changes for 0MQ/3.0!

Paul Colomiets Mon, 04 Apr 2011 14:32:52 -0700

Hi Martin,

04.04.2011, 09:16, "Martin Sustrik" <[email protected]>:
> Hi Paul,
>
>>  The documentation is actually a bit misleading. After
>>  you call shutdown(s, SHUT_RD) you *can* read, up to the point
>>  when shutdown called. It means everything already buffered will
>>  be read, and you will read until 0 (zero) returned from read call.
>
> What implementation is that? Both POSIX spec and Stevens seem to suggest
> that you can't read after shutdown(s, SHUT_RD).


This is Linux behavior. Just noted another behavior on FreeBSD.
Have currently no access to other platforms for test. But if
linux does that for TCP, why zeromq can't?

>
>>>  2. The handshake with all the peers during the shutdown can take
>>>  arbitrary long time and even cause a deadlock.
>>  Probably yes. It's up to you, to use it. Many small application will
>>  actually never care. Many huge applications will probably use failure
>>  resistancy to overcome reconfiguration problem. But there are plenty
>>  of others where you would stop entire system, when you add, remove
>>  or replace a node from configuration if you have no chance to
>>  shutdown socket cleanly. And time is not always very important.
>
> Well, I don't like introducing an API that works well for small apps and
> deadlocks for large apps. Let's rather think of something that works
> consistently in either scenario.

I don't understand your fear of deadlocks. You are always polling before
sending/receiving messages. If you don't you are in a bit trouble anyway.
IO thread works fully in non-blocking manner, so it can't deadlock on
sending/receiving bytes on the wire. If you wan't not to shutdown
properly you can always do that.

>
>>  Consider the following scenario. You have a node A which
>>  pushes work for node B along with 10 other nodes. And
>>  you should remove node B (probably because of replacing it
>>  with C). Node A has bound socket. Currently you have two
>>  ways:
>>
>>  * stop producing on A until all the work is consumed by B, then
>>  disconnect it, connect C and continue. It can take a lot of time
>>  while other workers are also stopped
>>  * nevermind loosing messages and react on them downstream,
>>  which takes a lot of time to notice (because of timeout), and
>>  probably some more time to find lost messages in logs at node A.
>
> You have to keep in mind that messages may be lost at any point on route
> from A to B. Thus, you can't count on being notified about the loss. The
> only way to handle it is timeout. Btw, even in simple single-hop
> scenario, TCP spec mandates keep-alives of at least 2 hours. So, if B is
> killed brutally (such as when power is switched off) A won't be notified
> for at least 2 hours.
>

Yes. I do have timeouts. The timeout is about 5 seconds. I have
lots of messages per second (at least hundred and up to 10 thousand in
critical places), this means that TCP connection will break very fast if
a machine goes down. If I'd have small application which can be idle
for 2 hours I'd never care to pause all producers to make any reconfiguration
or software updates.

>
> 1. Request/reply. In this case requester can re-send request after
> timeout have expired. There are couple of nice properties of this
> system: It's fully end-to-end so you are resilient against middle node
> failures. What you get is actually a TCP-level reliability ("As long as
> the app is alive it will ultimately get the data through"). The downside
> is that at some rare circumstances, a message may be processed twice
> Which does not really matter as the services are assumed to be stateless.
>

Actually about 50% of services I code are stateful. And indeed it's quite easy 
to
implement bookkeeping needed to re-send request. It could easily be done
in libzapi or in language binding or in your own project. Probably the socket
option would work for small applications you've said you don't care of :)
For big applications I guess they would write it in their own way (e.g. I
usually send both stateful and idempotent requests using same socket,
determining which is one by application specific means).

Frankly, it would be great if I could just send request and receive reply and
don't make loop along with zmq_poll. But you can't disable EINTR, so
loop will obviously there anyway unless it's language binding (which can
throw exception) or networking library with own main loop (which can
reap signal itself and resume), and both of them can do this sort of thing
today and will not become much simpler either way.

> 2. Publish/subscribe. In this case we assume infinite feed of messages
> that individual subscribers consume. When shutting down, the subscriber
> has to cut off the feed at some point.  Whether it cuts off immediately,
> dropping the messages on the fly or whether it waits for messages on the
> fly to be delivered is irrelevant. The only difference is that the point
> of cut off is slightly delayed in the latter case.
>

Currently if subscriber disconnected for some reason, it can connect and
continue where it left over (if its identity is same). But it's useless since 
you
can only assume you don't lost any messages only if you stopped
publishers. With proper shutdown you can do software update and start
where you left without loosing anything. Without that you must stop
propagating to all subscribers or have a separate device in front of every
subscriber which will queue data. Latter is quite huge performance problem.

> 3. Pipeline (push/pull). This is the interesting case. The communication
> is uni-directional, meaning that we can't timeout and resend while at
> the same time we want every message to be delivered. In this case
> automatic acks from consumer to producer can improve reliability
> heavily. When consumer disconnects, the unacked messages would be
> rescheduled to be delivered to a different peer. That messes with the
> message ordering, however, we don't care as the parallelised pipeline
> architecture does not guarantee ordering anyway. NB: this is a
> heuristic, it will improve the reliability, but won't make it perfect.
>

Well its also very controversal. TCP is quite reliable itself, it has
ACKs and resending and whatever. So you probably can't
compensate network failure this way. You also can't compensate
crashes, because you can be sure that message delivered to IO thread
on the other side, but not being processed by the application (and
in my case having hudreds to thousands of them per second makes me
sure that I will lost a lot). And it actually doesn't work even for software
updates, because you must call zmq_close(). Starting from this call
messages will not be acked but ones already acked and already in
the queue will be discarded, because having no way to read them.

So what the only problem I can imagine you will solve is someone
mistakenly killing connection by tcpkill. Or working with some very
buggy router which drops connection often. Please, prove if
I'm wrong :)

> We should give some more though to how the reset would interact with
> half-sent/recvd multipart messages.
>

Probably it's good idea to discard messages, and let this option do that
even for non REQ/REP sockets. It's useful if we started to produce
multipart message and then got e.g. out of memory error. For REP socket
it's useful if we know that we won't accept request from the first part(s).
But, for the latter it's easy to read and discard so any variant will do.

--
Paul
_______________________________________________
zeromq-dev mailing list
[email protected]
http://lists.zeromq.org/mailman/listinfo/zeromq-dev

Re: [zeromq-dev] Important: backward incompatible changes for 0MQ/3.0!

Reply via email to